NVIDIA Corporation

Senior System Architect, Infrastructure Reliability

NVIDIA Corporation$184K — $356K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 6+ years of systems programming experience with a BS, MS, or PhD in Computer Science or Electrical Engineering.
  • Proven experience in building automated Root Cause Analysis (RCA) pipelines for HPC or cloud-scale setups.
  • In-depth understanding of CPU architecture metrics, including IPC, cache contention, and NUMA.
  • Expertise in programming languages C++ and Python, focusing on performance monitoring tools.
  • Familiarity with cluster resource managers like Slurm, LSF, or Kubernetes for job management.

Responsibilities

  • Architect scalable failure attribution frameworks for EDA jobs.
  • Develop automated diagnostics for correlating hardware and software failures.
  • Implement low-overhead logging and tracing across multi-node clusters.
  • Create machine learning models to automate the classification of job failures.
  • Collaborate with teams to establish signals for anticipating failures.

Benefits

  • Eligibility for equity compensations and benefits.
Full Job Description

NVIDIA is seeking a Senior System Architect: Heterogeneous EDA Systems to solve a complex challenge in accelerated computing: Failure Attribution at Scale. As EDA or equivalent experience workloads scale across thousands of heterogeneous nodes, a single failure can cause massive resource waste. We need an engineer to develop and build an automated framework. This framework will ingest telemetry from CPU and GPU clusters to identify the root cause of job failures in real-time. It will distinguish between hardware faults, infrastructure instability, and software defects.


What you'll be doing:

  • Architect Failure Attribution Frameworks: Build a scalable "flight recorder" for EDA jobs that captures high-fidelity state across the CPU, GPU, and Fabric at the moment of failure.

  • Build automated diagnostics that correlate GPU XID errors, PCIe bus failures, and CUDA memory exceptions. Connect these errors with system-level events such as OOM kills or NUMA-related hangs.

  • Distributed Logging & Tracing: Implement low-overhead tracing mechanisms (using tracing tools or custom agents) that provide access to job execution across multi-node Slurm or Kubernetes clusters.

  • Root Cause Automation: Develop heuristics and models based on machine learning to classify failures as "Hardware Fault," "Software Bug," or "Environment Issue." This reduces the Mean Time to Identify (MTTI) for R&D teams.

  • Resiliency Engineering: Work closely with hardware and infrastructure teams to define "signals of impending failure," enabling proactive job migration or check-pointing before a crash occurs.


What we need to see:

  • Distributed Systems Mastery: BS, MS, or PhD in Computer Science or Electrical Engineering (or equivalent experience) with 6+ years in systems programming.

  • Experience building automated RCA (Root Cause Analysis) pipelines for HPC or cloud-scale environments.

  • CPU Architecture Deep-Dive: Expert knowledge of x86/ARM node-level metrics: IPC (Instructions Per Cycle), cache contention, NUMA imbalance, and hardware interrupts.

  • Programming Proficiency: Strong C++ and Python skills, with the ability to build high-performance daemons that monitor system health without impacting workload performance.

  • Scale Experience: Familiarity with cluster resource managers (Slurm, LSF, or Kubernetes) and how they manage job lifecycle and signal propagation.


Ways To Stand Out From The Crowd:

  • Low-Level Diagnostics: Expert knowledge of the Linux kernel and its error-reporting interfaces (/dev/mcelog, dmesg, journald). Understand how the kernel handles hardware exceptions and memory faults.

  • GPU Infrastructure Proficiency: Deep experience with the NVIDIA DCGM (Data Center GPU Manager) and NVIDIA Management Library (NVML) for monitoring device health and capturing state-dumps.

  • Experience with tools doing non-intrusive monitoring of application health and syscall-level failure patterns.

  • Experience with checkpoint/restore technologies (like CRIU) and their application in long-running EDA flows.


#LI-Hybrid 

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until March 1, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

Similar Jobs

More Jobs at NVIDIA Corporation

More Information Technology Jobs

Find similar Senior System Architect, Infrastructure Reliability jobs: