AI Accelerator Reliability Uber Tech Lead

Google • $262K — $365K *

Sunnyvale, CA 94087In-Person

Technical Services

8 - 10 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Bachelor's degree in a technical field or equivalent experience.
8 years in reliability, systems, hardware, or software engineering with a focus on reliability.
5 years in a technical leadership role overseeing hardware/software reliability.
Experience using reliability analysis techniques like FMEA and FTA.
Familiarity with the product development lifecycle from concept to production.

Responsibilities

Define and implement the end-to-end reliability strategy for a new AI accelerator system.
Establish and enforce reliability principles and best practices for all system components.
Lead cross-functional teams to ensure reliability is integrated into product development.
Conduct fault injection testing and stress tests to assess system reliability.
Develop robust monitoring and diagnostic systems for quick recovery from failures.

Benefits

Access to comprehensive health, dental, and vision plans.
Generous paid time off and flexible work schedule.
Significant retirement savings plan with company matching contributions.
Employee development opportunities and continuous learning programs.

Full Job Description

Minimum qualifications:

Bachelor's degree in Computer Science, Electrical Engineering, Computer Engineering, a related technical field, or equivalent practical experience.
8 years of experience in reliability engineering, systems engineering, hardware engineering, or software engineering with a focus on system-level reliability.
5 years of experience in a technical leadership role, managing reliability for hardware/software systems.
Experience with reliability analysis techniques (e.g., FMEA, FTA, reliability prediction).
Experience with the product development lifecycle, from concept to production.

Preferred qualifications:

Master's degree or PhD in Computer Science, Electrical Engineering or a related technical field or equivalent practical experience.
15 years of experience in reliability engineering, with significant experience in server/data centers.
Experience with data center operations, SRE practices, and designing for serviceability and maintainability at scale.
Proven track record of defining and implementing reliability strategies for novel, large-scale compute or accelerator systems.
Familiarity with AI/ML accelerator architectures and workloads.
Deep understanding of silicon, packaging, PCB, power, and thermal reliability failure mechanisms and mitigation techniques.

About the job

As a Staff Technical Lead, you will own and drive the end-to-end reliability, availability, and serviceability (RAS) for a groundbreaking, next-generation AI accelerator system. This is a unique opportunity for you to lead the reliability engineering efforts for a complex, large-scale hardware/software co-designed platform that will power future critical AI workloads across Google. You will be responsible for defining the reliability strategy, establishing best practices, and influencing a large cross-functional team of hardware, software, and silicon engineers to ensure this new system meets Google's stringent production standards. Your leadership will be instrumental in delivering a robust, resilient, and maintainable platform from concept through to full-scale deployment.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $262000 - $365000 (USD) 25% bonus target bonus equity benefits

Learn more about benefits at Google .

Responsibilities

Define, own, and drive the end-to-end reliability, availability, and serviceability (RAS) strategy for a novel, large-scale AI accelerator system.
Establish and enforce reliability engineering principles, standards, and best practices across all components of the system, including custom ASICs, trays, racks, power, cooling, and the full software stack (firmware, system software, runtime, and orchestration).
Lead and influence cross-functional teams - including Hardware Engineering, Silicon Design, Software Engineering, Supply Chain, Manufacturing, and Site Reliability Engineering (SRE) - to ensure reliability is designed-in and validated throughout the entire product lifecycle.
Drive the design and implementation of fault injection testing, stress testing, and DiRT-style exercises to validate system behavior under failure conditions.
Define and oversee the development of robust error handling, monitoring, telemetry, and diagnostic capabilities to enable rapid detection, root cause analysis, and recovery from failures.

About Google

Google is a multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search engine, cloud computing, software, and hardware. Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University. The company has grown tremendously since then and has become one of the most valuable companies in the world. Google's mission is to organize the world's information and make it universally accessible and useful.

Learn more about Google

Size

156,500 employees

Market Cap

$1,115.4 billion

Industry

Enterprise Technology

Net Income

$40.2 billion

Founded

1998

5 Year Trend

+23.3%

Revenue

$182.5 billion

NASDAQ

GOOGL

* Ladders Estimates

Similar Jobs

Senior Member of Technical Staff- Data Lens
$156K — $310K *
Nutanix
San Jose, CA 95123 (Santa Clara County)
Today
Director, AI Architect
$230K — $287K *
Headspace
San Francisco, CA 94112 (San Francisco County)
Today
Principal AI Architect - Agentic Verticals
$206K — $451K *
Zoom Video Communications, Inc.
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Principal Engineer, AI Platform
$242K — $499K *
Pinterest
San Francisco, CA 94112 (San Francisco County)
Today
Principal Engineer, AI Platform
$242K — $499K *
Pinterest
Remote
Today
Senior Staff Machine Learning Engineer - US
$193K — $308K *
Workiva, Inc
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Google

Product Manager II, Infrastructure, Google Cloud
$163K — $237K *
Kirkland, WA 98034 (King County)
Today
Enterprise Technology
In-Person
Global AI Strategic Partner Development Manager
$224K — $312K *
Kirkland, WA 98034 (King County)
Today
Enterprise Technology
In-Person
Program Manager II, Supply Planning and Optics, Data Centers
$132K — $190K *
Austin, TX 78745 (Travis County)
Today
Information Technology
In-Person
Senior Staff Technical Lead, Google Ads Recommendations
$262K — $365K *
Mountain View, CA 94040 (Santa Clara County)
Today
Information Technology
In-Person
Strategic Programs Senior Manager, Cloud GTM
$186K — $270K *
Chicago, IL 60629 (Cook County)
Today
Enterprise Technology
In-Person

More Technical Services Jobs

Advanced Applications Engineer, Ohio Valley
$83K — $164K *
Acuity Brands, Inc
Conyers, GA 30094 (Rockdale County)
Today
Data Center Team Lead
$100K *
CBRE Group, Inc
Sacramento, CA 95823 (Sacramento County)
Today
Design Verification Engineer - APS
$120K — $160K *
Texas Instruments
Santa Clara, CA 95051 (Santa Clara County)
Reposted Today
HVAC Department Manager
$75K — $95K *
Benedict Sales and Service
Altoona, WI 54720 (Eau Claire County)
Today
Control Systems Services Manager
$120K — $150K *
E Tech Group
Walnut Creek, CA 94598 (Contra Costa County)
Reposted Today

Find similar AI Accelerator Reliability Uber Tech Lead jobs:

Nationwide Sunnyvale, CA

AI Accelerator Reliability Uber Tech Lead

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar AI Accelerator Reliability Uber Tech Lead jobs:

Get Ready For Your
Next Interview