Technical Program Manager III, GPU Infrastructure Reliability, Google Cloud

Google • $163K — $237K *

Sunnyvale, CA 94087In-Person

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Bachelor's degree in a technical field or equivalent experience.
5 years of experience in program management.
Experience with infrastructure reliability systems.
Background with GPUs or GPU Systems.
Proven cross-functional project management experience preferred.
Strong technical program management skills in software engineering and ML infrastructure.

Responsibilities

Lead the overall development and delivery of AI Infra GPU products from inception to production.
Oversee software qualifications and release strategies for AI hypercompute clusters.
Manage escalations and actively mitigate project risks.
Coordinate with TPMs and ACI leadership on cross-functional AI initiatives.
Develop management software for monitoring Cloud ML solutions.

Benefits

Comprehensive health, dental, vision, life, and disability insurance.
401(k) retirement plan with company match.
20 days of vacation plus additional sick leave.
Generous maternity leave and baby bonding options.
13 paid holidays per year.

Full Job Description

info_outline
X In accordance with Washington state law, we are highlighting our comprehensive benefits package, which is available to all eligible US based employees. Benefits for this role include:

Health, dental, vision, life, disability insurance
Retirement Benefits: 401(k) with company match
Paid Time Off: 20 days of vacation per year, accruing at a rate of 6.15 hours per pay period for the first five years of employment
Sick Time: 40 hours/year (increased to 69 hours/year for Seattle) including 5 discretionary sick days per instance
Maternity Leave (Short-Term Disability Baby Bonding): 28-30 weeks
Baby Bonding Leave: 18 weeks
Holidays: 13 paid days per year

Note: By applying to this position you will have an opportunity to share your preferred working location from the following: Sunnyvale, CA, USA; Kirkland, WA, USA.

Minimum qualifications:

Bachelor's degree in a technical field, or equivalent practical experience.
5 years of experience in program management.
Experience with infrastructure reliability.
Experience with GPUs or GPU Systems.

Preferred qualifications:

5 years of experience managing cross-functional or cross-team projects.
5 years of experience in technical program management, with a focus on software engineering and ML infrastructure projects.
Knowledge of software development, distributed systems, and ML infrastructure or GPU systems.
Ability to think critically and solve problems.
Excellent project management skills, and experience with project planning, execution, and risk management.
Excellent communication and collaboration skills, with the ability to build relationships and influence across all levels of the organization.

About the job

A problem isn't truly solved until it's solved for all. That's why Googlers build products that help create opportunities for everyone, whether down the street or across the globe. As a Technical Program Manager at Google, you'll use your technical expertise to lead complex, multi-disciplinary projects from start to finish. You'll work with stakeholders to plan requirements, identify risks, manage project schedules, and communicate clearly with cross-functional partners across the company. You're equally comfortable explaining your team's analyses and recommendations to executives as you are discussing the technical tradeoffs in product development with engineers.

To empower AI innovation by accelerating the delivery, cloud-based accelerator (GPU) NPIs built into large-scale supercomputer clusters, including next-gen cross-functional development, customer and vendor partnerships, and ML workload monitoring and diagnostic tooling.

As a GPU Technical Program Manager for Google Cloud's AI and Computing Infrastructure team, you will be at the forefront of AI innovation, leading the end-to-end development and delivery of next-generation Cloud GPU products from initial concept to full-scale production. You will take charge of software qualification and release strategies for AI hypercompute clusters, collaborating deeply with engineering, product, and capacity planning teams to align customer and business priorities. Beyond managing critical escalations and mitigating risks, this is a unique opportunity to shape cross-functional initiatives alongside Application Centric Infrastructure (ACI) leadership and Technical Program Managers (TPMs) across the broader organization to streamline customer onboarding and scaled support for our largest, most complex Cloud ML solutions.

The ML, Systems, & Cloud AI (MSCA) organization at Google designs, implements, and manages the hardware, software, machine learning, and systems infrastructure for all Google services (Search, YouTube, etc.) and Google Cloud. Our end users are Googlers, Cloud customers and the billions of people who use Google services around the world.

We prioritize security, efficiency, and reliability across everything we do - from developing our latest TPUs to running a global network, while driving towards shaping the future of hyperscale computing. Our global impact spans software and hardware, including Google Cloud's Vertex AI, the leading AI platform for bringing Gemini models to enterprise customers.
Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $163000 - $237000 (USD) 15% bonus target equity benefits

Learn more about benefits at Google .

Responsibilities

Lead the end-to-end development, project planning, and delivery of next-gen AI Infra GPU products from concept to production.
Lead software qualifications, release strategy, and test infrastructure management for AI hypercompute clusters.
Manage escalations and critical incidents while proactively identifying and mitigating risks that could impact project success.
Coordinate with TPMs in AI2 (e.g., ACI, Platforms, and CSCO) and ACI leadership on cross-functional initiatives related to AI Infra customer onboarding and production support.
Participate in the development of core management software, monitoring, and diagnostic tooling for scalable Cloud ML solutions.

About Google

Google is a multinational technology company that specializes in Internet-related services and products. These include online advertising technologies, search engine, cloud computing, software, and hardware. Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University. The company has grown tremendously since then and has become one of the most valuable companies in the world. Google's mission is to organize the world's information and make it universally accessible and useful.

Learn more about Google

Size

156,500 employees

Market Cap

$1,115.4 billion

Industry

Enterprise Technology

Net Income

$40.2 billion

Founded

1998

5 Year Trend

+23.3%

Revenue

$182.5 billion

NASDAQ

GOOGL

* Ladders Estimates

Similar Jobs

Technical Program Manager, Workspace Horizontal
$163K — $237K *
Google
Sunnyvale, CA 94087 (Santa Clara County)
Today
Sr Professional Services Engagement Specialist (Remote: US)
$118K — $220K *
Veeam Software
Remote
Today
Technical Program Manager 4
$104K — $231K *
Lam Research
Fremont, CA 94536 (Alameda County)
Reposted Today
Director, Technical Program Management
$175K — $230K *
Amkor Technology
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Technical Program Manager, Robotics, DeepMind
$217K — $237K *
Google
Mountain View, CA 94040 (Santa Clara County)
Today
Engineering Program Manager
$130K — $180K *
Apple
Cupertino, CA 95014 (Santa Clara County)
Today

Get Ready For Your
Next Interview

More Jobs at Google

Senior Staff Software Engineer, Cloud, Dataproc, Control-Plane
$262K — $365K *
Kirkland, WA 98034 (King County)
Today
Information Technology
In-Person
Senior Software Engineer, AI/ML Computer Vision, XR
$174K — $253K *
San Jose, CA 95123 (Santa Clara County)
Today
Information Technology
In-Person
Technical Program Manager, Workspace Horizontal
$163K — $237K *
Sunnyvale, CA 94087 (Santa Clara County)
Today
Information Technology
In-Person
Data Center Technician (Dayshift)
$78K — $111K *
Kansas City, MO 64118 (Clay County)
Today
Information Technology
In-Person
Research Scientist, Neutral Atoms, Quantum AI
$147K — $211K *
Boulder, CO 80302 (Boulder County)
Today
Consumer Technology
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
Yesterday
Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Senior Reliability Engineer
$160K — $190K *
Stream Data Centers
Dallas, TX 75217 (Dallas County)
Today
Director, AI Engineering
$130K — $180K *
Royal Bank of Canada
Toronto, ON M3C 0E3
Reposted Today
INFORMATION TECHNOLOGY SPECIALIST
$75K — $95K *
U.S. Marine Corps
Quantico, VA 22134 (Prince William County)
Today

Find similar Technical Program Manager III, GPU Infrastructure Reliability, Google Cloud jobs:

Nationwide Sunnyvale, CA

Technical Program Manager III, GPU Infrastructure Reliability, Google Cloud

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Technical Program Manager III, GPU Infrastructure Reliability, Google Cloud jobs:

Get Ready For Your
Next Interview