Must Have Technical/Functional Skills
- 6+ years of IT Service Management experience with a minimum of 3 years in a dedicated Major Incident
- Management or Incident Commander role in a large enterprise (Fortune 500 / FTSE 100 equivalent complexity).
- ITIL 4 Managing Professional or ITIL 4 Specialist: High Velocity IT certification
- (ITIL 4 Foundation minimum required).
- Demonstrable experience managing Azure platform incidents: working knowledge of Azure Monitor,
- Azure Service Health, Log Analytics, Application Insights, and Microsoft support escalation paths.
- Proven ability to command high-pressure P1 incidents involving 20+ stakeholders across technical and
- executive levels simultaneously
- Expert-level proficiency in ServiceNow ITSM, including Incident, Problem, Change modules and
- dashboard/report building.
- Strong data analysis skills: ability to analyze incident trends, build KPI dashboards, and present
- actionable insights to senior leadership.
- Roles & Responsibilities
- Major Incident Command & Coordination
- Serve as the single accountable owner for all P1 and P2 major incidents across on premises
- and Azure-hosted services, from initial declaration through resolution and post-incident closure.
- Convene and chair live incident bridge calls and virtual war rooms using Microsoft Teams,
- coordinating across 10+ internal technical resolver groups, managed service partners,
- and Microsoft Azure Support (Unified Support escalations).
- Drive swift triage by leveraging Azure Service Health, Resource Health, and Azure Monitor dashboards
- to rapidly establish scope, affected services, and blast radius within the first 15 minutes of an incident.
- Make and enforce escalation decisions, including engaging Microsoft CSS P1 Severity A support cases
- and activating DR runbooks where service restoration via normal means is not achievable within RTO.
- Maintain clear, timely, and audience-appropriate stakeholder communications throughout the
- incident lifecycle, including CEO/CISO executive briefings for business-critical outages.
Post-Incident Review & Continual Improvement
- Facilitate structured blameless Post-Incident Reviews (PIRs) within agreed SLAs (P1: 48 hours.
- P2: 5 business days); produce high-quality PIR reports consumed by CTO and Board Technology Committee.
- Own the incident action item registry; chair weekly SIP (Service Improvement Plan) reviews to ensure
- commitments are delivered on time and to quality.
- Identify systemic incident patterns through trend analysis using ServiceNow and Log Analytics.
- collaborate with Problem Management to drive root cause elimination for repeat incidents.
- Define, track, and report on enterprise incident management KPIs: MTTD, MTTR, incident recurrence rate
- ,SLA compliance, and customer impact hours presented to IT leadership in monthly operational reviews.
- Process Ownership & ITSM Governance
- Own, maintain, and continuously improve the enterprise Major Incident Management process, policy,
- playbooks, and runbooks aligned to ITIL 4 and the organizations IT Risk and Control Framework.
- Define and govern the incident severity classification matrix and escalation decision tree.
- ensure co nsistent adoption across all IT towers and managed service partners.
- Maintain and test the enterprise crisis communication framework, including stakeholder
- notification trees, bridge protocols, and executive communication templates.
- Collaborate with Change Management to ensure CAB processes adequately assess change-
- induced incident risk; maintain correlation tracking between changes and incidents.
Azure Operations & Cloud Incident Specifics
- Develop and maintain Azure-specific incident playbooks covering platform scenarios:
- AKS node/pod failures, Azure SQL failover events, ExpressRoute circuit drops, Azure Active Directory
- (Entra ID) authentication outages, and Azure region-wide service incidents.
- Maintain working relationships with Microsoft TAM (Technical Account Manager) and
- Azure Rapid Response team: ensure escalation paths to Microsoft CSS are exercised and SLAs understood.
- Monitor Azure Service Health and Microsoft 365 Service Health Dashboard proactively.
- initiate pre-emptive incident declarations for advisory/degraded-service notifications affecting business-criticalservices.
- Participate in Azure Operational Reviews with Cloud Platform and SRE teams to identify observability
- gaps, alerting blind spots, and runbook deficiencies before they manifest as major incidents.
Capability Building & Stakeholder Engagement
- Design and deliver MIM process training programmes for Level 1/2 Service Desk, resolver groups,
- and technology leadership; conduct quarterly simulation exercises (GameDay / IncidentEx).
- Act as a subject matter expert in enterprise-wide DR and BCP exercises; validate incident response
- readiness across all Azure-hosted Tier-0 services.
- Build and manage a network of Incident Coordinators across global IT towers to support follow-the -sun incident coverage.
Salary Range-$100,000-$120,000 a year
#LI-KR3
TCS Employee Benefits Summary:
Discretionary Annual Incentive.
Comprehensive Medical Coverage: Medical & Health, Dental & Vision, Disability Planning & Insurance, Pet Insurance Plans.
Family Support: Maternal & Parental Leaves.
Insurance Options: Auto & Home Insurance, Identity Theft Protection.
Convenience & Professional Growth: Commuter Benefits & Certification & Training Reimbursement.
Time Off: Vacation, Time Off, Sick Leave & Holidays.
Legal & Financial Assistance: Legal Assistance, 401K Plan, Performance Bonus, College Fund, Student Loan Refinancing.