Google

Staff Site Reliability Engineer

Google$207K — $301K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science or a related technical field, or equivalent practical experience.
  • 8 years of experience in building and developing infrastructure or distributed systems.
  • 5 years of troubleshooting and debugging experience.
  • 5 years of experience architecting production-quality Machine Learning (ML) systems.
  • 5 years of programming experience in C, Go, or Python.

Responsibilities

  • Lead initiatives to reduce support costs through intelligent alerting and system design improvements.
  • Advance the SRE team from on-call incident responders to proactive system partners.
  • Establish trust and influence with key stakeholders for effective system scaling.
  • Identify and resolve pain points for the team, partners, and customers with balanced solutions.
  • Collaborate with critical customers to enhance the reliability of their user experiences.

Benefits

  • Opportunity to work on large-scale, fault-tolerant systems with Google Cloud.
  • Mentorship and support to encourage learning and professional development.
  • Collaborative and inclusive team environment that fosters diverse perspectives.
  • Focus on automating work to optimize existing systems.
Full Job Description
Minimum qualifications:
  • Bachelor's degree in Computer Science, a related technical field, or equivalent practical experience.
  • 8 years of experience building and developing infrastructure or distributed systems.
  • 5 years of experience in troubleshooting and debugging.
  • 5 years of experience building and architecting production quality Machine Learning (ML) systems.
  • 5 years of experience programming in C , Go, or Python.

Preferred qualifications:
  • Master's degree in Computer Science, or a related technical field.
  • Experience in Site Reliability Engineering.
  • Experience in troubleshooting and supporting applications like web services, data storage, databases, data pipelines, commerce engines, with Linux/Unix or other operating systems.


About the job

Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance.

Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

In this role, you will drive the supportability and reliability of Woodshed and Napa, two key data intelligence systems underlying Google's AI push.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $207000 - $301000 (USD) 20% bonus target bonus equity benefits

Responsibilities
  • Lead the team in our top 2026 challenge, reducing the support cost of the products via correct provisioning intelligent alerting, and system design and deployment improvements.
  • Grow the Site Reliability Engineering (SRE) team from trained on-callers and incident responders to system partners.
  • Build trust with and influence over key stakeholders to drive successful scaling of the supportability of complex systems.
  • Identify problems and painpoints of the team, dev partner teams, and customers; and drive solutions balancing short term and long term needs.
  • Work with critical customers to give them the reliability they need for their key user journeys.


About Google

Google is a multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search engine, cloud computing, software, and hardware. Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University. The company has grown tremendously since then and has become one of the most valuable companies in the world. Google's mission is to organize the world's information and make it universally accessible and useful.
Learn more about Google
Size
156,500 employees
Market Cap
$1,115.4 billion
Industry
Net Income
$40.2 billion
Founded
1998
5 Year Trend
+23.3%
Revenue
$182.5 billion
NASDAQ

Similar Jobs

More Jobs at Google

More Information Technology Jobs

Find similar Staff Site Reliability Engineer jobs: