ESSENTIAL DUTIES AND RESPONSIBILITIES
Following is a summary of the essential functions for this job. Other duties may be performed, both major and minor, which are not mentioned below. Specific activities may change from time to time.
Production Support Leadership & Accountability
Own end-to-end production support operations for multiple mission-critical applications supporting key lines of business, ensuring availability, stability, and performance meet defined SLAs and SLOs. Provide accountable, visible leadership for 24x7 operational support, including on-call models, escalation paths, and incident response effectiveness. Act as the senior escalation point for major incidents, ensuring swift recovery, accurate root cause analysis, and durable remediation.
Incident & Problem Management
Lead cross-functional incident recovery efforts in partnership with Incident Management, engineering teams, infrastructure, and business stakeholders. Ensure timely root cause analysis (RCA), post-incident reviews, and corrective actions that prevent recurrence. Establish and mature a production knowledge base, documenting known issues, recovery procedures, and architectural insights.
Engineering-First & SRE Practices
Drive adoption of Site Reliability Engineering (SRE) and lean engineering principles, including:
Reduction of toil through automation
Engineering-based reliability metrics (error budgets, SLIs/SLOs)
Proactive resilience and failure prevention practices
Champion automation of repetitive and manual operational tasks, including incident detection, response, validation, and recovery where feasible. Promote a culture of preventative engineering, partnering with development teams to improve system reliability upstream.
Monitoring, Observability & AI Enablement
Implement and continuously improve real-time monitoring, alerting, and observability across applications and infrastructure. Measure and optimize the effectiveness of monitoring and alerting to eliminate noise and accelerate mean-time-to-detect and mean-time-to-recover. Leverage AI and advanced analytics to correlate telemetry data (logs, metrics, traces) and proactively identify emerging risks and root causes. Champion the safe and responsible use of AI within production operations by adhering to enterprise guardrails and protecting sensitive data and system integrity.
Operational Readiness & Change Enablement
14. Oversee operational readiness across releases, disaster recovery and failover testing and certificate and dependency lifecycle management. Ensure production support is actively embedded in change planning, minimizing risk from releases and infrastructure changes.
People, Vendor & Financial Management
Lead one or more Agile teams (Scrum, Kanban), including onshore and offshore engineers, fostering high performance and accountability. Manage workforce vendors and partners, setting expectations, reviewing performance, and ensuring delivery quality. Own budget and staffing plan aligned to application criticality, operational risk, and business growth objectives.
Risk Management & Governance
Act as the first line of defense in production operations by proactively identifying and mitigating technology, operational, and resiliency risks. Partner effectively with second-line Risk, Audit, and Regulatory teams, ensuring findings are addressed and controls are continuously improved. Ensure compliance with internal policies, regulatory requirements, and external audit expectations. Own and drive remediation plans for risk, audit, and regulatory findings, ensuring timely, effective and sustainable resolution. Lead responses to audit and regulatory inquiries, including providing evidence, clarifying controls, and appropriately challenging findings based on documented compliance.
Strategy, Influence & Continuous Improvement
Serve as a trusted advisor to senior Technology and Business leaders, communicating operational health, risk posture, and improvement roadmaps. Lead or contribute significantly to large-scale initiatives, platform transformations, or regulatory-driven efforts. Continuously assess organizational maturity and lead initiatives to improve reliability, efficiency, and talent capability.
Management Responsibilities
Agile & Operating Model Expectations
Act as an Agile and DevOps champion, embedding production support within fast-moving delivery models.
Balance “keep-the-lights-on” operational excellence with continuous engineering improvement.
Drive measurable outcomes such as improved uptime, reduced incident volume, faster recovery, and improved customer experience.