Perficient

Incident Recovery Manager

Perficient$81K — $149K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in Production Services, Recovery, and Problem Management with a background in development.
  • Strong knowledge of infrastructure, cloud, and application technology components.
  • Hands-on experience with observability tools and logs for troubleshooting.
  • Familiarity with cloud platforms (Azure, AWS, GCP) and disaster recovery architectures.
  • Experience with ITSM tools like ServiceNow and automated incident management workflows.
  • Proven success in large-scale enterprise production support.
  • Understanding of CI/CD pipelines and their impact on production stability.

Responsibilities

  • Lead cross-functional teams to address major incidents and conduct post-mortem reviews.
  • Conduct root cause analysis of incidents using various techniques like Blameless RCA.
  • Develop and test incident recovery plans, creating SOPs and knowledge bases.
  • Create playbooks to enhance self-sufficiency in diagnosing issues.
  • Identify process improvements for IT service reliability and operational risk reduction.
  • Provide feedback to teams on enhancing observability and addressing failure points.
  • Mentor team members and promote a culture of ongoing learning and professional growth.

Benefits

  • Opportunities for continuous learning and professional development.
  • Collaborative work environment with cross-functional teams.
  • Engagement in cutting-edge projects involving cloud modernization.
  • Involvement in driving innovation and operational excellence.
  • Potential for career advancement within a dynamic organization.
Full Job Description
Job Description

We currently have a career opportunity for an Incident Recovery Manager to join our team located in Charlotte, NC.

Job Overview:

The Incident Recovery Manager will lead the recovery of critical incidents following major disruptions and drive strategies to ensure effective, timely restoration of services. This role is responsible for managing major incident recovery, leading post-mortem reviews, developing SOPs and runbooks, and implementing proactive measures to prevent future failures through the adoption of SRE principles.

As a dynamic and motivated leader, you will play a key role in shaping and enhancing production support and operational capabilities. You will help identify and implement best-fit solutions that improve stability, reliability, and efficiency across the organization.

This is an opportunity to drive meaningful change and enhance the end user experience. If you are passionate about production operations, stability, SRE, and observability, and have a proven track record of success, we invite you to join us in advancing resilience and operational excellence.

Responsibilities

  • Major Incident Support: Drive cross-functional teams to resolve critical incidents and attend post-mortem/post-incident reviews.
  • Root Cause Analysis (RCA): Investigate underlying causes of major incidents, utilizing techniques like 5-Why, Fishbone, Blameless RCA and other techniques
  • Recovery Strategy & Planning: Develop and test incident recovery plans, establish SOPs, knowledge base and mock drills
  • Self Sufficiency: Develop playbooks by coordinating with domain owners and ensure more self-sufficiency and diagnosis accuracy
  • Process Improvement: Identify opportunities to improve IT service reliability and reduce operational risks related to people, process and technology
  • Feedback Loop: Provide continuous feedback to Observability, Automation, Resiliency and Domain teams on improving observability posture, automation, single points of failures, architectural and design gaps
  • Training and Development: Mentor and develop other team members, providing training. Stay current with industry best practices and technologies, fostering a culture of continuous learning and professional growth.
  • Performance Monitoring & Analytics: Utilize analytical and technical skills to assess system performance, monitor incident trends, and drive continuous improvement initiatives.
  • Cross-Functional Collaboration: Collaborate closely with engineering teams, and third-party vendors during major incidents and on system design, feasibility, and architecture to improve stability and meet resilience objectives.


Qualifications

  • Progressive and proven experience and expertise in Production Services, Recovery and Problem Management, SRE, DevOps, or related fields with development background preferred
  • Strong understanding of foundational technology components across infra, cloud and app to be able to diagnose, ask right questions and effectively lead recovery of a critical incident
  • Hands-on experience with observability tools, logs and diagnostics to be able to troubleshoot and coach people
  • Experience with cloud platforms (e.g., Azure, AWS, GCP) including high availability and disaster recovery architectures
  • Experience with incident management and ITSM tools (e.g., ServiceNow, PagerDuty, Opsgenie) and automated workflows
  • Demonstrated success in complex, large-scale enterprise production support and operations environments, including experience working with large geographically distributed teams
  • Understanding of CI/CD pipelines and deployment strategies (e.g., blue-green, canary) and their impact on production stability
  • Excellent communication and interpersonal skills, with a focus on collaboration and relationship-building
  • Able to communicate effectively with CXOs and convey complex technical details into business terms
  • Ability to influence and drive change across the organization
  • Analytical mindset with the ability to translate data into actionable insights
  • Experience in analyzing incident trends and implementing process improvements to enhance operational efficiency
  • Strong decision-making skills under high-pressure incident scenarios with the ability to balance speed and risk
  • Proactive mindset with a focus on prevention, continuous improvement, and operational excellence
  • Strong ownership and accountability with a bias toward action and results
  • Ability to mentor, coach, and elevate technical and operational capabilities across teams


ABOUT THE TEAM

Our App Modernization team helps businesses transform legacy systems and build future-ready applications. We deliver end-to-end solutions-combining cloud migration, custom application development, multi-cloud strategies, and modern UI and API integration. With expertise in DevSecOps, modern frameworks, and enterprise platforms, our team of engineers, architects, and project leaders partner with leading brands to drive innovation, accelerate delivery, and create lasting business impact. We also integrate AI-driven capabilities-such as intelligent automation, predictive analytics, and generative development tools-to enhance scalability, performance, and user experience.

ADDITIONAL INFORMATION

Applications will be accepted until the position is filled or the posting is removed.

The salary range for this position takes into consideration a variety of factors, including but not limited to skill sets, level of experience, applicable office location, training, licensure and certifications, and other business and organizational needs. The new hire salary range displays the minimum and maximum salary targets for this position across all US locations, and the range has not been adjusted for any specific state differentials. It is not typical for a candidate to be hired at or near the top of the range for their role, and compensation decisions are dependent on the unique facts and circumstances regarding each candidate. A reasonable estimate of the current salary range for this position is $81978 to $149880. Please note that the salary range posted reflects the base salary only and does not include benefits or any potential variable compensation programs. Information regarding the benefits available for this position are in our benefits overview.

#LI-MG1

About Perficient

Perficient is a leading digital consultancy that helps companies transform their businesses and operations through technology. They deliver solutions to clients that range from Fortune 500 companies to emerging businesses. Perficient has a broad range of capabilities, including strategy, design, technology, and operations. They have expertise in a variety of industries, including healthcare, financial services, retail, and energy. Perficient has been recognized as a top employer and a top company for women technologists. They are committed to giving back to their communities through philanthropy and volunteerism.
Learn more about Perficient
Size
6,079 employees
Market Cap
$2.4 billion
Industry
Net Income
$30.1 million
Founded
1998
5 Year Trend
+9.3%
Revenue
$612.1 million
NASDAQ

Similar Jobs

More Jobs at Perficient

More Information Technology Jobs

Find similar Incident Recovery Manager jobs: