About the Role
The NOC Team Lead will provide leadership for a distributed team of NOC Engineers that is responsible for monitoring a global cloud infrastructure for several StackPath platforms. The NOC Team has end-to-end ownership of incident management and response, performing initial troubleshooting of production issues, escalation of problems, performing testing, and providing internal communication of issues during and after incident resolution.
This role with report to our: Platform Operations Manager
Essential Duties and Responsibilities
- Provide team leadership for the 24x7 NOC Team spanning multiple continents and time zones.
- Work closely with Shift Supervisors and Platform Operations Management to ensure adherence to standard policies and procedures for alert handling and incident response.
- Ensures training and communication of policies and procedures for the 24x7 NOC Team.
- Ensures all NOC team members have access to all tools needed to perform their duties and are trained on their use.
- Ensures shift handovers occur reliably and effectively according to industry best practices.
- Monitor and report on team KPIs and provide guidance to NOC team members to improve performance.
- Evaluate NOC procedures and policies on a continuous basis and provide recommendations for improvement to Platform Operations Management.
- Maintain documentation of technical procedures and playbooks used by the NOC.
- Track escalations to other teams and work with other team leads / managers to drive escalations to resolution.
- Track follow-up action items relating to platform incidents and work with other teams to drive these items to resolution.
- Participate in and organize projects involving the NOC team.
Desired Skills and Experience
- ITIL V4 Foundation or better.
- 3+ years working in a 24x7x365 high-availability environment.
- 5+ years working with Linux in a distributed server environment.
- 3+ years experience in a team leadership role.
- Some networking experience preferred.
- Exceptional written and verbal communication skills.
- Exceptional troubleshooting and problem-solving skills.
- Demonstrated ability to work remotely with a team.
- Demonstrated punctuality and reliability.
- Ability to work nights/weekends/holidays as needed and to participate in an on-call rotation for incident response.