NVIDIA Corporation

Senior ML Platform Engineer

NVIDIA Corporation$152K — $287K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
  • Strong proficiency in Infrastructure-as-Code tools (Ansible, Terraform) for managing production infrastructure.
  • SRE mindset with experience in diagnosing system-level issues and platform reliability.
  • Solid understanding of ML workflows from data preprocessing to deployment.
  • Proficiency in operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in Python or Go, focusing on automation and production-grade code.

Responsibilities

  • Design, build, and maintain core ML platform infrastructure as code using Ansible and Terraform.
  • Apply SRE principles to troubleshoot and resolve system issues for AI workloads.
  • Develop internal automation and tooling for ML workflow orchestration and resource scheduling.
  • Collaborate with ML researchers to streamline their experimentation needs.
  • Evolve and operate multi-cloud and hybrid environments with appropriate monitoring.
  • Participate in on-call rotation, supporting platform services and conducting root cause analysis.
  • Write maintainable code to contribute to the orchestration platform and automate processes.

Benefits

  • Eligible for equity and benefits.
  • Diverse work environment committed to equal opportunity.
  • Opportunity to work with cutting-edge technology in AI and ML.
Full Job Description
In this role, you will architect, build, and scale our high-performance ML infrastructure using modern Infrastructure-as-Code practices. Your primary focus will be on creating reliable, automated platforms that empower scientists and engineers to train and deploy the most advanced ML models on some of the world's most powerful GPU systems. Join our top team and apply your SRE and software engineering skills to craft robust, user-friendly platforms for seamless ML development.

What You'll Be Doing:
  • Design, build, and maintain our core ML platform infrastructure as code, primarily using Ansible and Terraform, ensuring reproducibility and scalability across large-scale, distributed GPU clusters.
  • Apply SRE principles to diagnose, troubleshoot, and resolve complex system issues across the entire stack, ensuring high availability and performance for critical AI workloads.
  • Develop robust internal automation and tooling for ML workflow orchestration, resource scheduling, and platform operations, with a strong focus on software engineering best practices.
  • Collaborate with ML researchers and applied scientists to understand infrastructure needs and build solutions that streamline their end-to-end experimentation.
  • Evolve and operate our multi-cloud and hybrid (on-prem + cloud) environments, implementing monitoring, alerting, and incident response protocols.
  • Participate in on-call rotation to provide support for platform services and infrastructure running critical ML jobs, driving root cause analysis and implementing preventative measures.
  • Write high-quality, maintainable code (Python, Go) to contribute to the core orchestration platform and automate manual processes.
  • Drive the adoption of modern GPU technologies and ensure smooth integration of next-generation hardware into ML pipelines (e.g., GB200, NVLink, etc.).


What We Need To See:
  • BS/MS in Computer Science, Engineering, or equivalent experience.
  • 5+ years in software/platform engineering or SRE roles, including 3+ years focused on ML infrastructure or distributed compute systems.
  • Strong proficiency in Infrastructure-as-Code (IaC) tools, specifically Ansible and Terraform, with a proven track record of building and managing production infrastructure.
  • SRE-oriented mindset with extensive experience in diagnosing system-level issues, performance tuning, and ensuring platform reliability.
  • Solid understanding of ML workflows and lifecycle-from data preprocessing to deployment.
  • Proficiency in operating containerized workloads with Kubernetes and Docker.
  • Strong software engineering skills in languages such as Python or Go, with a focus on automation, tooling, and writing production-grade code.
  • Experience with Linux systems internals, networking, and performance tuning at scale.


Ways To Stand Out From The Crowd:
  • Experience building or operating ML platforms supporting frameworks like PyTorch or TensorFlow at scale.
  • Deep understanding of distributed training techniques (e.g., data/model parallelism, Horovod, NCCL).
  • Expertise with modern CI/CD methodologies and GitOps practices.
  • Passion for building developer-centric platforms with great UX and strong operational reliability.
  • Proven ability to contribute code to complex orchestration or automation platforms.


Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 152,000 USD - 241,500 USD for Level 3, and 184,000 USD - 287,500 USD for Level 4.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 9, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

Similar Jobs

More Jobs at NVIDIA Corporation

More Information Technology Jobs

Find similar Senior ML Platform Engineer jobs: