Job Overview:We are seeking a dynamic, motivated, and experienced Recovery Manager to join our Production Services organization. This Assistant Vice President (AVP) role is a senior individual contributor position with a high degree of operational influence. In this role, you will lead recovery efforts for major and critical production incidents, guide cross‑functional teams through effective resolution, and drive high‑quality post‑incident outcomes. You will play a key role in executing recovery strategies, improving recovery readiness, strengthening runbooks and SOPs, and embedding Site Reliability Engineering (SRE) practices into day‑to‑day operations.
This is an exciting opportunity for a hands‑on operational leader to drive measurable improvements in stability, resilience, and the advisor and investor experience, while partnering closely with engineering, infrastructure, observability, and automation teams.
Responsibilities:Lead and coordinate cross‑functional technical teams during major and critical incidents, ensuring timely recovery and effective stakeholder engagement.
Serve as a recovery lead during declared major incidents, maintaining focus on service restoration and customer impact.
Participate in and facilitate post‑incident reviews and post‑mortems, ensuring outcomes are actionable and measurable.
Drive high‑quality root cause analysis for major incidents using structured techniques such as 5‑Why, Fishbone, and Blameless RCA.
Ensure contributing factors (process, technology, observability, automation, or human factors) are clearly identified and documented.
Partner with domain teams to translate findings into concrete remediation actions.
Develop, document, and maintain incident recovery plans, SOPs, runbooks, and playbooks in collaboration with domain owners.
Support and execute mock drills, recovery tests, and readiness exercises to improve response effectiveness.
Ensure recovery documentation remains accurate, consumable, and operationally relevant.
Work with application, infrastructure, and platform teams to improve diagnostic accuracy and time‑to‑engage during incidents.
Help establish clear ownership, escalation paths, and recovery patterns to reduce dependency on ad‑hoc tribal knowledge.
Promote repeatable recovery patterns across services.
Identify opportunities to improve service reliability, operational maturity, and recovery effectiveness.
Analyze incident data and trends to recommend targeted improvements across people, process, and technology.
Support adoption of SRE‑aligned practices, including error budgets, readiness reviews, and failure mode awareness.
Provide structured feedback to Observability, Automation, Resiliency, and Domain teams on; gaps in monitoring, alerts, and diagnostics; single points of failure; architectural or design weaknesses impacting recoverability
Act as an operational voice to ensure post‑incident learnings inform engineering and platform decisions.
Mentor junior recovery managers or operational staff through hands‑on incident participation and coaching.
Contribute to operational training sessions, tabletop exercises, and knowledge‑sharing initiatives.
Maintain awareness of industry best practices in production operations, incident management, and SRE.
Goals & Success Metrics:
Achieve a 30% reduction in MTTD and MTTR within the first year
Correctly identify the offending service and probable root cause for 70%+ of major incidents within 15–20 minutes of triage
Improve recovery readiness through regular mock drills, training sessions, and improved documentation
Build strong working relationships with key technology and business partners to support consistent, effective recovery outcomes
Requirements:
5+ years of experience in Production Services, Incident Management, Recovery Management, Problem Management, SRE, DevOps, or related disciplines
2+ years of application, infrastructure, and/or cloud technologies, enabling effective triage and informed recovery leadership
2+ years experience using observability tools, logs, metrics, and diagnostics to troubleshoot production issues
Core Competencies:
Strong communication and interpersonal skills, with the ability to collaborate effectively across technical and non‑technical teams
Comfortable engaging with senior leaders and executives, translating technical incidents into clear business impact
Demonstrated ability to influence without direct authority
Analytical and detail‑oriented, with the ability to translate incident data into actionable improvements
Experience identifying incident trends and contributing to measurable operational improvements
Ability to ask the right technical questions and guide teams toward faster resolution (development background preferred but not required)
#LI-Hybrid
Pay Range:
$112,476.00 - $187,460.00
Actual base salary varies based on factors, including but not limited to, relevant skill, prior experience, education, base salary of internal peers, demonstrated performance, and geographic location. Additionally, LPL Total Rewards package is highly competitive, designed to support your success at work, at home, and at play – such as 401K matching, health benefits, employee stock options, paid time off, volunteer time off, and more. Your recruiter will be happy to discuss all that LPL has to offer!