About the role:As a
Senior Site Reliability Engineer, you will lead the architectural strategy and operational excellence for Anyscale's global production systems. You will move beyond day-to-day maintenance to design autonomous, self-healing infrastructure that aligns with our rapid scaling goals. You will be a key driver in establishing a culture of reliability, ensuring that every engineering team at Anyscale can deploy with confidence and high visibility.
As part of this role, you will:- Architect and develop a unified perspective on how cloud components are utilized across the company, taking into account diverse needs and requirements.
- Ensure that deployment methodologies align with the company's reliability goals.
- Design and implement systems that promote understanding of production environments, facilitating quick identification of issues through robust observability infrastructure for metrics, logging, and tracing.
- Create monitoring and alerting systems at different levels, enabling teams to easily contribute and enhance the overall monitoring capabilities.
- Establish testing infrastructure to support the team in writing and executing tests effectively.
- Define and champion organization-wide Service Level Objectives (SLOs) and Error Budgets, ensuring they are integrated into the product development lifecycle.
- Implement best practices and on-call systems, ensuring efficient incident management and up-leveling the incident management system at Anyscale.
- Coordinate the creation and deployment of cloud-based services, including tracking deployments and establishing effective communication channels for issue resolution.
We'd love to hear from you if have:- At least 5 years of relevant work experience in a Site Reliability or DevOps role, with a proven track record in high-growth environments.
- Deep experience managing large-scale distributed systems and microservices architectures in multi-cloud environments (AWS, GCP, or Azure).
- Advanced proficiency in at least one programming language (e.g., Python, Go) and extensive experience with IaC tools like Terraform.
- Hands-on experience architecting and troubleshooting production-grade Kubernetes clusters.
- Demonstrated ability to mentor junior engineers, lead complex technical projects, and influence engineering culture without direct authority.
- Strong ability to leverage data from logging and tracing infrastructure to identify long-term architectural trends and bottlenecks.