Principal Software Engineer, At-Scale Reliability and Fleet Intelligence - CSP Engagements

NVIDIA Corporation • $272K — $431K *

Santa Clara, CA 95051In-Person

Technical Services

11 - 15 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

15+ years of experience in systems software at datacenter scale or reliability engineering focused on at-scale challenges
BS or MS in Computer Science, Electrical Engineering, Statistics or equivalent experience
Deep expertise in multi-NUMA and rack-scale system software and firmware, and statistical failure analysis methods
Experience with fleet-level telemetry and observability systems like time-series databases
Understanding hardware failure modes in large-scale GPU/accelerator deployments
Experience in defining burn-in, stress testing, or certification frameworks for complex hardware systems
Strong communication skills for presenting technical findings to various audiences

Responsibilities

Drive reliability work streams with CSP engineering teams, ensuring shared understanding of MTBI measurement methodology
Gather and synthesize CSP fleet reliability data to identify failure patterns and advocate for improvements
Define a consistent MTBI measurement methodology applicable across CSP operational practices
Conduct fleet-scale failure pattern analysis using statistical methods to classify failures
Drive fleet health monitoring integration architecture to align with CSP workflows
Define and validate burn-in reliability test environments and cluster certification criteria with quality teams
Collaborate with CSPs to ensure completion of reliability integration work before launch
Develop predictive failure models from fleet telemetry and validate their customer effectiveness

Benefits

Eligibility for equity
Access to comprehensive benefits package
Work with leading technology professionals in the industry
Opportunity to drive major impact on reliability at scale
Flexible work environment and schedule

Full Job Description

We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for fleet-scale reliability, working directly with engineering teams of key CSP / hyperscale customers to ensure NVIDIA platforms achieve target MTBI (Mean Time Between Interruptions) in production. In this role, you will augment NVIDIA's internal software/firmware and quality teams with a dedicated CSP-facing focus. You will drive work streams with CSP engineering teams to build shared understanding of reliability software/firmware architecture, methodology, incorporate their fleet telemetry and failure data into NVIDIA's improvement priorities, and validate that reliability improvements measured in the lab translate to real customer environments. Your cross-CSP visibility enables you to distinguish systemic architectural gaps from environmental or configuration-specific issues that no single customer engagement could identify alone.

What you'll be doing:

Drive reliability work streams with CSP engineering teams - ensuring shared understanding of MTBI measurement methodology, failure classification, and health monitoring architecture
Gather and synthesize CSP fleet reliability data - identify failure patterns that appear across multiple customers and champion improvements back into NVIDIA's firmware, driver, and hardware teams
Define consistent MTBI measurement methodology that works across different CSP monitoring environments and operational practices
Conduct fleet-scale failure pattern analysis using statistical methods (Pareto, survival analysis, Weibull) to classify failures as systemic, environmental, or configuration-specific
Drive fleet health monitoring integration architecture - ensure NVIDIA's health agents, telemetry, and reporting align with CSP operational workflows and automation
Define burn-in reliability test environment and cluster certification criteria in collaboration with quality teams, validating with customers that criteria are meaningful
Collaborate with CSPs to ensure reliability-related integration work (health monitoring deployment, telemetry pipeline, alerting configuration) is complete ahead of at-scale launch
Develop predictive failure models using fleet telemetry and validate their effectiveness in customer environments

What we need to see:

15+ years of experience in systems software at datacenter scale, or reliability engineering with focus on at-scale challenges.
BS or MS in Computer Science, Electrical Engineering, Statistics, or related field (or equivalent experience)
Deep expertise in multi-NUMA, rack-scale system software and firmware. Statistical failure analysis methods: MTBF/MTBI calculation, Pareto analysis, root cause classification
Experience with fleet-level telemetry and observability systems: time-series databases, anomaly detection, health scoring, event correlation
Understanding of hardware failure modes in large-scale GPU/accelerator deployments - ability to classify and prioritize across compute, interconnect, memory, power, and thermal domains
Experience defining or operating burn-in, stress testing, or certification frameworks for complex hardware systems. Familiarity with predictive maintenance or anomaly detection approaches applied to fleet health data
Customer obsession - genuine passion for understanding fleet reliability challenges at scale and translating them into actionable engineering priorities
Strong communication - ability to present statistical reliability findings to both deep technical audiences and executive leadership. Demonstrated success driving cross-functional improvements across hardware, firmware, and software teams without direct authority

Ways to stand out from the crowd:

Experience in fleet reliability at a hyperscaler (hardware health, fleet reliability at leading CSP/Hyperscaler)
Familiarity with NVIDIA GPU error taxonomy (Xid errors, NVLink error counters, thermal events, CPER records)
Experience building health scoring or predictive failure models for accelerator or HPC infrastructure
Background in defining MTBI/MTBF measurement standards or certification programs for complex multi-component systems
Understanding of how reliability data flows from device firmware through telemetry pipelines to fleet-level dashboards and automated remediation

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 272,000 USD - 431,250 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 30, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.

Learn more about NVIDIA Corporation

Size

22,473 employees

Market Cap

$350.4 billion

Industry

Manufacturing & Automotive

Net Income

$4.3 billion

Founded

1993

5 Year Trend

+31.3%

Revenue

$16.6 billion

NASDAQ

NVDA

* Ladders Estimates

Similar Jobs

Distinguished Engineer (Messaging & Marketing Technology)
$293K — $335K *
Capital One Financial Corporation
San Francisco, CA 94112 (San Francisco County)
Today
Principal Software Engineer, GPU Firmware and GPU System Software - CSP Engagements
$272K — $431K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Today
Principal Software Engineer, GPU Firmware and GPU System Software - CSP Engagements
$272K — $431K *
NVIDIA Corporation
Remote
Today
Principal Software Engineer, E2E Performance and Goodput - CSP Engagements
$272K — $431K *
NVIDIA Corporation
Remote
Today
Principal Software Engineer, E2E Performance and Goodput - CSP Engagements
$272K — $431K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Today
Principal Software Engineer, Rack-Scale System Software - CSP Engagements
$272K — $431K *
NVIDIA Corporation
Remote
Today

Get Ready For Your
Next Interview

More Jobs at NVIDIA Corporation

Senior AI Solutions Architect - Industrial Engineering
$152K — $287K *
Santa Clara, CA 95051 (Santa Clara County)
Today
Technical Services
In-Person
Senior AI Solutions Architect - Industrial Engineering
$152K — $287K *
Remote
Today
Technical Services
Remote in Santa Clara, CA
Senior Software Engineer - Autonomous Driving
$224K — $356K *
Santa Clara, CA 95051 (Santa Clara County)
Today
Manufacturing & Automotive
In-Person
Senior Technical Program Manager - Supply Chain
$200K — $322K *
Santa Clara, CA 95051 (Santa Clara County)
Today
Enterprise Technology
In-Person
Pricing Analyst - Compute Systems, Software, and Services
$84K — $178K *
Santa Clara, CA 95051 (Santa Clara County)
Today
Enterprise Technology
In-Person

More Technical Services Jobs

General Manager
$100K — $200K + 30% bonus *
Lunova Group
Memphis, TN 38101 (Shelby County)
Reposted Today
BI Consultant & Solutions Lead
$120K — $150K *
Confidential Company
San Diego, CA 92101 (San Diego County)
1 week ago
Duck Creek Technical Lead - Billing
$90K — $120K *
Next Level Solutions
Springfield, MO 65807 (Greene County)
Reposted Today
Principal Physical Design Engineer
$174K — $352K *
Hewlett Packard Enterprise Development LP
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Workday Functional HCM Lead (Remote)
$100K — $130K *
Azusa Solutions LLC
Remote
Today

Find similar Principal Software Engineer, At-Scale Reliability and Fleet Intelligence - CSP Engagements jobs:

Nationwide Santa Clara, CA