Minimum qualifications:- Bachelor's degree in Computer Science, Electrical Engineering, Computer Engineering, a related technical field, or equivalent practical experience.
- 8 years of experience in reliability engineering, systems engineering, hardware engineering, or software engineering with a focus on system-level reliability.
- 5 years of experience in a technical leadership role, managing reliability for hardware/software systems.
- Experience with reliability analysis techniques (e.g., FMEA, FTA, reliability prediction).
- Experience with the product development lifecycle, from concept to production.
Preferred qualifications:- Master's degree or PhD in Computer Science, Electrical Engineering or a related technical field or equivalent practical experience.
- 15 years of experience in reliability engineering, with significant experience in server/data centers.
- Experience with data center operations, SRE practices, and designing for serviceability and maintainability at scale.
- Proven track record of defining and implementing reliability strategies for novel, large-scale compute or accelerator systems.
- Familiarity with AI/ML accelerator architectures and workloads.
- Deep understanding of silicon, packaging, PCB, power, and thermal reliability failure mechanisms and mitigation techniques.
About the jobAs a Staff Technical Lead, you will own and drive the end-to-end reliability, availability, and serviceability (RAS) for a groundbreaking, next-generation AI accelerator system. This is a unique opportunity for you to lead the reliability engineering efforts for a complex, large-scale hardware/software co-designed platform that will power future critical AI workloads across Google. You will be responsible for defining the reliability strategy, establishing best practices, and influencing a large cross-functional team of hardware, software, and silicon engineers to ensure this new system meets Google's stringent production standards. Your leadership will be instrumental in delivering a robust, resilient, and maintainable platform from concept through to full-scale deployment.
Individual pay is determined by factors including job-related skills, experience, and relevant education or training.
US: $262000 - $365000 (USD) 25% bonus target bonus equity benefits
Learn more about benefits at Google .
Responsibilities - Define, own, and drive the end-to-end reliability, availability, and serviceability (RAS) strategy for a novel, large-scale AI accelerator system.
- Establish and enforce reliability engineering principles, standards, and best practices across all components of the system, including custom ASICs, trays, racks, power, cooling, and the full software stack (firmware, system software, runtime, and orchestration).
- Lead and influence cross-functional teams - including Hardware Engineering, Silicon Design, Software Engineering, Supply Chain, Manufacturing, and Site Reliability Engineering (SRE) - to ensure reliability is designed-in and validated throughout the entire product lifecycle.
- Drive the design and implementation of fault injection testing, stress testing, and DiRT-style exercises to validate system behavior under failure conditions.
- Define and oversee the development of robust error handling, monitoring, telemetry, and diagnostic capabilities to enable rapid detection, root cause analysis, and recovery from failures.