NVIDIA Corporation

Senior AI Infrastructure Engineer - DGX Cloud

NVIDIA Corporation$152K — $287K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • BS degree in Computer Science or related technical field (or equivalent experience)
  • 5+ years of experience in a related role
  • Proven ability to initiate and collaborate on projects
  • Experience in infrastructure automation and distributed systems architecture
  • Proficient in one or more programming languages (Python, Go, C/C++, Java)
  • Strong understanding of Linux, Networking, Storage, and Containers
  • Familiarity with Public Cloud, Infrastructure as Code (IAAC), and Terraform

Responsibilities

  • Design, build, deploy, and maintain internal tooling for AI training and inference platforms
  • Conduct performance analysis on multi-GPU and multi-node clusters
  • Manage the full lifecycle of services from inception to refinement
  • Provide system design consulting and develop software tools for capacity management
  • Monitor and maintain system health, availability, and latency once services are live
  • Scale and evolve systems through automation and reliability improvements
  • Participate in on-call rotation and sustainable incident response

Benefits

  • Equity opportunities
  • Comprehensive health benefits
  • Flexible work arrangements
  • Access to cutting-edge technology and resources
  • Continuous learning and development opportunities
Full Job Description
NVIDIA is looking for an outstanding, passionate, and dedicated Senior AI Infrastructure Engineer to join our DGX Cloud group. This engineering role will design, build and maintain large-scale production systems with high efficiency and availability using a combination of software and systems engineering practices. This role demands knowledge across different systems, networking, coding, databases, capacity management, continuous delivery and deployment, and open-source cloud-enabling technologies like Kubernetes and OpenStack.

The DGX Cloud SRE at NVIDIA ensures our GPU cloud services deliver maximum reliability and uptime. They carefully prepare and plan changes to the system. They also manage capacity and performance.

What You'll Be Doing:
  • Design, build, deploy, and run internal tooling for large-scale AI training and inferencing platform built on top of cloud infrastructure.
  • Conduct in-depth performance characterization and analysis on large multi-GPU and multi-node clusters.
  • Engage in and improve the whole lifecycle of services-from inception and design through deployment, operation, and refinement.
  • Support services before they become available through activities such as system design consulting, developing software tools, platforms, and frameworks, capacity management, and launch reviews.
  • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
  • Practice sustainable incident response and blameless postmortems.
  • Be part of an on-call rotation to support production systems.


What We Need To See:
  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience.
  • 5+ years of experience.
  • A track record showing a good balance between initiating your own projects, convincing others to collaborate with you, and collaborating well on projects initiated by others.
  • Background in infrastructure automation and distributed systems architecture focused on building tools to manage large-scale private or public cloud platforms in production.
  • Experience working with one or more of the following languages: Python, Go, C/C++, Java.
  • Comprehensive understanding in one or more of Linux, Networking, Storage, and Containers Technologies.
  • Experience with Public Cloud and Infrastructure as a Code (IAAC) and Terraform.
  • Distributed system experience.


Ways to Stand Out from the Crowd:
  • Interest in crafting, analyzing, and fixing large-scale distributed systems.
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
  • Capability to identify issues and improve code performance while automating routine tasks. Experience in operating or handling large private and public cloud systems based on Kubernetes or Slurm.


Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 8, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

Similar Jobs

More Jobs at NVIDIA Corporation

More Enterprise Technology Jobs

Find similar Senior AI Infrastructure Engineer - DGX Cloud jobs: