NVIDIA Corporation

Senior Reliability Engineer, DGX Cloud

NVIDIA Corporation$168K — $333K *
Enterprise Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years of industry experience in reliability engineering or related fields.
  • B.S. or M.S. degree, or equivalent experience in operating large-scale systems.
  • Deep, hands-on experience with large-scale production systems.
  • Proven experience in establishing and maintaining SLO programs.
  • Strong software engineering skills in Go, Python, or similar languages.
  • Understanding of failure modes in complex systems, including cascading failures.
  • Ability to influence teams through expertise and credibility.

Responsibilities

  • Build an organization-wide reliability strategy for a 24/7 operational environment.
  • Establish and maintain a rigorous SLO program across teams.
  • Lead incident response for high-severity incidents, ensuring effective resolution.
  • Enhance our data platform and related tooling through production code improvements.
  • Implement chaos engineering and resilience testing as standard practices.
  • Set the standard for operational excellence through hands-on leadership.

Benefits

  • Comprehensive benefits package for you and your family.
  • Eligibility for equity in the company.
  • Supportive work environment recognized in the tech industry.
  • Opportunity to work at the forefront of operational excellence.
Full Job Description
Are you passionate about building world-class reliability systems? Join NVIDIA as a Sr. Reliability Engineer, DGX Cloud, and be a pivotal part of a team that redefines operational excellence. Our team is at the forefront of redefining how DGX Cloud approaches reliability, making it an outstanding opportunity to develop strategies and drive innovation. We're looking for a seasoned engineer with experience in running large-scale systems and a deep understanding of operational practices.

What you'll be doing:
  • Build org-wide reliability strategy, guiding how NVIDIA matures its operational practices in a 24/7 environment.
  • Stand up a rigorous SLO program, defining and maintaining high standards across teams.
  • Lead incident response for high severity incidents, ensuring low drama and high signal resolution.
  • Build and improve production code daily, enhancing our data platform and related tooling.
  • Implement chaos engineering, failure injection, and resilience testing to elevate our team's standard practices.
  • Improve standards by setting an example with your hands-on experience and leadership.


What we need to see:
  • Deep, hands-on experience running large-scale production systems with a proven track record.
  • A detailed understanding of failure modes in large systems, including cascading dependencies and retry storms.
  • Strong software engineering skills with current, hands-on experience in Go, Python, or similar languages.
  • Proven experience in establishing and maintaining an SLO program with operational rigor.
  • Practical experience in reliability fields such as chaos engineering and failure injection.
  • The ability to influence across team boundaries through credibility and expertise.
  • 10+ years of industry experience with a Bachelor's or Master's degree, or equivalent experience operating systems at scale.


Ways to stand out from the crowd:
  • Experience within a world-class reliability function like Google SRE or Meta production engineering.
  • Expertise in operating GPU, HPC, or AI training infrastructure with outstanding failure modes.
  • A track record of measurable reliability improvements within an organization.
  • Proficiency with modern observability and operational tools like Prometheus, OpenTelemetry, Grafana, PagerDuty, and Rootly.


Widely considered to be one of the technology world's most desirable employers, NVIDIA offers highly competitive salaries and a comprehensive benefits package. As you plan your future, see what we can offer to you and your family www.nvidiabenefits.com/

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 168,000 USD - 270,250 USD for Level 4, and 208,000 USD - 333,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 26, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

Similar Jobs

More Jobs at NVIDIA Corporation

More Enterprise Technology Jobs

Find similar Senior Reliability Engineer, DGX Cloud jobs: