Red Hat

Senior Site Reliability Engineer

Red Hat$118K — $195K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 4+ years of software engineering experience in cloud environments
  • Strong proficiency in Go with a focus on production-quality code
  • Deep hands-on Kubernetes experience, including building operators and controllers
  • Solid understanding of AWS fundamentals like EC2 and IAM
  • Proven experience owning production systems with real SLOs and incident response
  • Ability to ramp quickly on complex systems and make contributions within weeks
  • Strong systems thinking with a focus on scalability, reliability, and operability

Responsibilities

  • Contribute production-grade software to upstream open source projects
  • Bring a systems perspective to architecture decisions for scalability and operability
  • Operate as a full-stack systems owner and participate in on-call duties
  • Drive improvements based on operational learning for product enhancements
  • Design and evolve observability metrics and SLOs
  • Raise the technical bar through documentation, code review, and knowledge transfer
  • Work autonomously to identify and lead impactful problems from concept to adoption

Benefits

  • Highly autonomous and collaborative work environment
  • Opportunity to contribute to open source projects
  • Focus on both software engineering and production reliability
  • Chance to work on a complex, high-scale platform
  • Engagement with cutting-edge AI-assisted development tools
Full Job Description
Red Hat is seeking a Senior Software Engineer to join the HCP Platform Engineering team, building and operating ROSA (Red Hat OpenShift Service on AWS) Hosted Control Planes (HCP). ROSA HCP is Red Hat's managed Kubernetes platform on AWS, built on a multi-tenant architecture where Red Hat operates shared control plane infrastructure while customers run workloads in their own AWS accounts.

This is not a standard engineering role. You will write and ship production-grade code, contribute to upstream open source projects and take ownership of production systems through on-call. All three are equally important.

You'll work at the intersection of software engineering and production reliability on one of Red Hat's most complex and high-scale platforms. The system spans multiple upstream open source projects and shared, multi-tenant infrastructure, requiring strong engineering judgment and end-to-end ownership.

The team is small, highly autonomous, and trusted to solve meaningful problems-from design through production.

What you will do
  • Contribute production-grade software to upstream open source projects including HyperShift and OpenShift, owning features end-to-end from design and implementation through deployment and long-term lifecycle in production
  • Bring a product and systems lens to architecture decisions, ensuring designs account for scalability, operability, and real-world production constraints from the start
  • Operate as a full-stack systems owner, participating in on-call rotations and taking end-to-end responsibility for diagnosing, fixing, and preventing production issues
  • Drive improvements that eliminate entire classes of failures by turning operational learning into durable product and platform enhancements
  • Design and evolve observability (metrics, logs, traces) and SLOs as part of the software lifecycle, ensuring systems are measurable, debuggable, and resilient by design
  • Raise the technical bar across the team through design docs, code review, pairing, and knowledge transfer during complex engineering work and incidents
  • Work in a high-autonomy engineering team where you identify the most impactful problems and lead them from concept through implementation and production adoption
  • Partner as a peer with product and platform engineering teams to influence architecture, challenge assumptions, and ensure systems are built for scale, reliability, and long-term operability
  • Integrate AI-assisted development tools (GitHub Copilot, Cursor, Claude Code) into daily workflows for design, implementation, and debugging - using human judgment to maintain high engineering standards while increasing delivery velocity and system quality


What you will bring

We're looking for system builders - engineers who design, ship, and own with curiosity, range, and sharp engineering judgment. You go deep in your domain and broad enough across adjacent disciplines to make decisions with full context. You think in systems, communicate with precision, and treat AI as a force multiplier for your craft - not a substitute for your judgment.
  • 4+ years of software engineering experience building and shipping production systems in cloud environments, including microservices, platforms, or distributed systems
  • Strong proficiency in Go - you write production-quality code, review it critically, and ramp quickly on large, unfamiliar codebases
  • Deep, hands-on Kubernetes experience from a builder's perspective: you've written operators, controllers, and CRDs in real-world, multi-tenant environments - not just operated clusters others built
  • Solid understanding of AWS fundamentals (EC2, IAM, networking) and how Kubernetes platforms behave and scale on AWS
  • Proven experience owning production systems under real SLOs, including participating in on-call and leading incident response with a focus on root cause and long-term fixes
  • You ramp fast on complex, unfamiliar systems - forming a mental model and making meaningful contributions within weeks
  • Highly self-directed builder mindset: you identify high-impact problems, propose solutions, and drive them end-to-end without waiting for direction
  • Strong systems thinking - you naturally connect design decisions to their downstream impact on scalability, reliability, and operability in production
  • Clear and effective communicator, able to collaborate with engineers on design, architecture, and tradeoffs


Nice to have:
  • Experience with HyperShift, OpenShift, or ROSA in production environments
  • Familiarity with multi-tenant Kubernetes challenges such as noisy neighbors, control plane scaling, and fleet-level lifecycle management
  • Contributions to open source projects, particularly in the Kubernetes ecosystem
  • Experience designing and operating observability at scale (Prometheus, Grafana, Dynatrace, or similar) across large fleets
  • Experience leveraging AI-assisted development tools (e.g., coding agents, AI-driven code review, spec-driven workflows) to accelerate development and improve quality


The salary range for this position is $118,600.00 - $195,680.00. Actual offer will be based on your qualifications.

Pay Transparency

Red Hat determines compensation based on several factors including but not limited to job location, experience, applicable skills and training, external market value, and internal pay equity. Annual salary is one component of Red Hat's compensation package. This position may also be eligible for bonus, commission, and/or equity. For positions with Remote-US locations, the actual salary range for the position may differ based on location but will be commensurate with job duties and relevant work experience.

About Red Hat

Red Hat, Inc. is a leading provider of open source software solutions, including Linux, Kubernetes, and Ansible. The company was founded in 1993 and is headquartered in Raleigh, North Carolina. Red Hat operates in over 100 countries and has more than 13,000 employees worldwide. The company is committed to open source innovation and has a strong community of developers and partners. Red Hat was acquired by IBM in 2019 and is now part of IBM's Hybrid Cloud division.
Learn more about Red Hat
Size
13,000 employees
Industry
Founded
1993

Similar Jobs

More Jobs at Red Hat

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: