Google

Tech Lead Site Reliability Engineer, Cloud Reliability Intelligence

Google$207K — $301K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science or related field or equivalent experience.
  • 8 years of experience with data structures and algorithms.
  • 3 years of leading projects and troubleshooting distributed systems.
  • 3 years of technical leadership overseeing projects.
  • Experience with full-stack architectures, linking backend data automation to frontend engineering.

Responsibilities

  • Own the technical roadmap and architecture for the Evergreen platform.
  • Design and scale high-performance backend pipelines and data-rich user interfaces.
  • Prototype and implement LLM-based features for incident data processing.
  • Collaborate with Product Management and Data Science to align policy measurement and enforcement.

Benefits

  • Supportive environment with mentorship opportunities.
  • Collaborative culture that encourages innovation and risk-taking.
  • Focus on intellectual curiosity and problem-solving.
  • Opportunities for self-directed work on meaningful projects.
Full Job Description
Minimum qualifications:
  • Bachelor's degree in Computer Science or a related technical field or equivalent practical experience.
  • 8 years of experience with data structures and algorithms.
  • 3 years of experience leading projects and designing, analyzing, and troubleshooting distributed systems.
  • 3 years of experience in a technical leadership role; overseeing projects.
  • Experience overseeing full-stack architectures, ensuring cohesion between backend data automation layers and engineering frontend.

Preferred qualifications:
  • Experience in applying LLMs or Generative AI to automate workflows.
  • Experience designing and scaling high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular).
  • Familiarity with large-scale reliability analysis, or policy conformance frameworks.


About the job
Site Reliability Engineering (SRE) combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Google Cloud's services-both our internally critical and our externally-visible systems-have reliability, uptime appropriate to customer's needs and a fast rate of improvement. Additionally SRE's will keep an ever-watchful eye on our systems capacity and performance.
Much of our software development focuses on optimizing existing systems, building infrastructure and eliminating work through automation. On the SRE team, you'll have the opportunity to manage the complex challenges of scale which are unique to Google Cloud, while using your expertise in coding, algorithms, complexity analysis and large-scale system design. SRE's culture of intellectual curiosity, problem solving and openness is key to its success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on meaningful projects, while we also strive to create an environment that provides the support and mentorship needed to learn and grow.

The Reliability Outcome Enablement team develops the products, core infrastructure, and datasets that drive and sustain Google Cloud platform's (GCP's) reliability promises. We build the evergreen intelligence platform the core system that automates resilience across the GCP ecosystem. Every product team at Google (from BigQuery to Spanner) relies on our infrastructure and integrated data lake to keep their services bulletproof.

We are currently expanding our platform to integrate Generative AI and LLM-driven workflows, moving from reactive tracking to a predictive system that catches failures and automates risk mitigation.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $207000 - $301000 (USD) 20% bonus target equity benefits

Learn more about benefits at Google .

Responsibilities
  • Own the technical roadmap and long-term architecture for the Evergreen platform, including a unified data model for promise delivery across GCP.
  • Design and scale high-performance backend pipelines (Go, Java) and data-rich user interfaces (TypeScript, Angular) used by over 10,000 Google engineers.
  • Prototype and productionize LLM-based features to parse unstructured incident data, automatically file risk tickets, and suggest reliability fixes.
  • Partner closely with Product Management, Data Science, and leadership to align multiple organizations on a unified approach to policy measurement and enforcement.


About Google

Google is a multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search engine, cloud computing, software, and hardware. Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University. The company has grown tremendously since then and has become one of the most valuable companies in the world. Google's mission is to organize the world's information and make it universally accessible and useful.
Learn more about Google
Size
156,500 employees
Market Cap
$1,115.4 billion
Industry
Net Income
$40.2 billion
Founded
1998
5 Year Trend
+23.3%
Revenue
$182.5 billion
NASDAQ

Similar Jobs

More Jobs at Google

More Information Technology Jobs

Find similar Tech Lead Site Reliability Engineer, Cloud Reliability Intelligence jobs: