Site Reliability Engineer, Manager

Joint Activities

$135K — $216K *
US-AnywhereRemote in United States
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years of experience in site reliability engineering or similar roles in complex, multi-vendor environments.
  • In-depth knowledge of cloud-native infrastructures and container orchestration (e.g., Kubernetes).
  • Experience with automation tools like Terraform, Ansible, or Chef.
  • Proficient in observability technologies such as Prometheus and Grafana.
  • Strong programming skills in languages like Python or Go for automation.
  • Expertise in defining SLIs, SLOs, and error budgets.
  • Excellent communication skills for collaboration across teams.

Responsibilities

  • Design and implement reliability frameworks, including SLOs and automated incident response systems.
  • Lead the development of observability platforms using advanced monitoring tools.
  • Coordinate with vendors and internal teams to manage diverse systems and ensure reliability standards.
  • Drive incident response strategies and lead root cause analysis.
  • Mentor engineering teams and advocate best practices in reliability engineering.
  • Collaborate with product development and security teams for seamless integration of reliability.
  • Prepare executive-level presentations to communicate technical challenges and business impacts.

Benefits

  • Opportunities for career advancement within a large-scale cloud ecosystem.
  • Leadership role with significant autonomy in decision-making.
  • Collaboration with diverse teams across the organization.
  • Involvement in strategic initiatives that impact a vast user base.
  • Support for continuous learning and professional development.
Full Job Description
Responsibilities

Peraton is seeking a Site Reliability Engineer (SRE), Manager- a highly experienced professional responsible for ensuring the availability, reliability, and performance of complex systems in a multi-vendor environment. This role combines deep technical expertise in infrastructure, automation, and system architecture with leadership and collaboration skills to drive reliability frameworks, proactive monitoring, and incident response across diverse platforms and teams.

 

The Site Reliability Engineer, Manager operates with significant autonomy, architecting solutions that enhance system observability, scalability and fault tolerance. They lead reliability initiatives, mentor engineering teams, and collaborate with multiple vendors and internal stakeholders to align reliability strategies with business objectives and customer needs. This role is ideal for a highly skilled engineer who excels in technical leadership, complex system architecture, and multi-stakeholder environments. Principal Site Reliability Engineers are key to building resilient systems that scale efficiently while minimizing downtime and risk.

 

This opportunity will support the modernization of a large-scale multi-tenant cloud ecosystem, providing critical enterprise-wide support for more than 40 million users in a complex stakeholder environment. This position requires senior level leadership skills combined with modern cloud and industry leading technical capabilities including product development, strict security compliance, latest technology cloud solutions, reliable application delivery with SaaS and Artificial Intelligence integrations and rapid continuous delivery.   

 

Core Responsibilities

  • Reliability Architecture and Automation: Design, implement, and oversee reliability frameworks, including SLOs, error budgets, and automated incident response systems. Develop and maintain CI/CD pipelines to ensure seamless deployment and procedural efficiency.
  • Observability and Monitoring: Lead the creation and enhancement of observability platforms using metrics, logging, and tracing tools. Utilize modern technologies like OpenTelemetry, AI/ML for anomaly detection, and streaming data platforms to proactively detect and resolve issues
  • Multi-Vendor Collaboration: Coordinate with external vendors and internal teams to integrate and manage diverse systems and tools. Ensure consistent reliability standards and practices are maintained across different technology stacks and service providers.
  • Incident Management and Risk Mitigation: Drive incident response strategy by leading root cause analysis, post-mortem reviews, and continuous improvement efforts. Identify potential risks and implement mitigation strategies to prevent service disruptions. 

Leadership and Collaboration

  • Technical Leadership: Mentor site reliability and engineering teams, fostering a culture of reliability, automation, and continuous learning. Advocate for best practices in system design and reliability engineering.
  • Cross-Functional Partnership: Work closely with product development, DevOps, and security teams to integrate reliability into the software development lifecycle. Influence platform strategy and roadmap based on reliability insights.
  • Strategic Influence: Collaborate with senior stakeholders and vendors on long-term reliability goals. Prepare executive-level presentations that translate technical challenges into business impact.
  • Agile and DevOps Practices: Lead and refine agile workflows to enhance team productivity and reliability outcomes. Champion DevOps methodologies to align development and cloud services efforts.

**Position could support /work across multiple enterprise- wide efforts within Peraton.**

 

Qualifications

Key Skills and Qualifications:

 

  • Extensive experience (10+ years) in site reliability engineering or related roles, preferably in multi-vendor and complex environments. 
  • Deep knowledge of cloud-native infrastructure, container orchestration (e.g., Kubernetes), and automation tools such as Terraform, Ansible, or Chef.
  • Proficiency in observability technologies, such as Prometheus, Grafana, OpenTelemetry, log aggregation systems, etc.
  • Strong programming and scripting skills for automation and tooling (Python, Go, or similar).
  • Expertise in defining and implementing SLIs, SLOs, and error budgets.
  • Excellent communication skills for collaboration with diverse teams and external vendors.
  • Proven ability to lead large-scale reliability initiatives and mentor engineering teams.
  • Strategic thinker with a focus on aligning reliability engineering with business priorities and customer experience. 

Clearance Requirements:

  • U.S. Citizenship required
  • Ability to obtain agency clearance (public trust)

Preferred Qualifications:

  • Top Secret clearance preferred

 

Target Salary Range$135,000 - $216,000. This represents the typical salary range for this position. Salary is determined by various factors, including but not limited to, the scope and responsibilities of the position, the individual’s experience, education, knowledge, skills, and competencies, as well as geographic location and business and contract considerations. Depending on the position, employees may be eligible for overtime, shift differential, and a discretionary bonus in addition to base pay.

Similar Jobs

More Jobs at Joint Activities

More Information Technology Jobs

Find similar Site Reliability Engineer, Manager jobs: