NVIDIA Corporation

Director, Engineering Operations and Site Reliability Engineering - Datacenter Server Systems

NVIDIA Corporation$292K — $442K *
Information Technology
11 - 15 years of experience
Job Overview by Ladders

Qualifications

  • BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
  • 12+ years of experience in infrastructure, systems engineering, or datacenter operations
  • 7+ years of management experience in technical teams
  • Strong understanding of server systems, Linux, and large-scale infrastructure
  • Proven ability to drive collaboration across diverse technical teams

Responsibilities

  • Lead teams to maintain the reliability of NVIDIA's rack-scale server systems and clusters
  • Drive execution in fleet operations, including incident response and change management
  • Build automation tools and dashboards to improve system visibility and issue resolution
  • Collaborate with cross-functional teams for the deployment and debugging of complex systems
  • Establish feedback mechanisms to enhance product quality and development speed
  • Mentor and develop a high-performing technical team focused on ownership and automation

Benefits

  • Eligibility for equity opportunities
  • Comprehensive benefits package
  • Opportunities for continuous learning and professional development
  • Exposure to cutting-edge AI technologies
  • Work in a dynamic, innovative environment
Full Job Description
NVIDIA is seeking a strong technology leader for our Engineering Operations and Site Reliability Engineering for our next-generation datacenter server systems. This role sits at the intersection of execution, reliability, automation, and large-scale system operations, where we keep NVIDIA's rack-scale systems healthy, observable, and highly available for internal engineering users. These systems bring together the full power of NVIDIA CPUs, GPUs, NVLink, InfiniBand/Spectrum-X networking, cluster management technologies, and our optimized AI/HPC software stack. We enable fast product development by ensuring large internal racks, clusters, and lab infrastructure are reliable, well-instrumented, and operated with scalable engineering practices. This is a technical leadership role focused on execution excellence for large-scale internal datacenter systems. The ideal candidate has strong engineering judgment, experience operating complex distributed infrastructure, and the ability to build teams that combine focused operations with automation-first software engineering. What you will be doing: • Lead teams that help us ensure NVIDIA's internal rack-scale server systems, clusters, and lab facilities remain available, healthy, and reliable. • Drive execution across fleet operations, incident response, roadmap planning, change management, operational readiness, and reliability metrics. • Build automation, telemetry, alerting, and dashboards that improve visibility and help teams resolve issues faster. • Partner with hardware, firmware, software, networking, validation, and infrastructure teams to deploy, sustain, and debug complex systems. • Create feedback loops into NPI and sustaining teams to improve product quality, serviceability, and development velocity. • Grow and mentor a high-performing technical team with a culture of ownership, learning, and automation-first execution. What we need to see: • BS or MS in Computer Science, Electrical Engineering, Computer Engineering, or related field (or equivalent experience). • 12+ overall years of experience in infrastructure, systems engineering, reliability, datacenter operations, distributed systems, or related areas, including 7+ years of people management experience. • Strong understanding of server systems, Linux, cluster operations, high-speed networking, and large-scale infrastructure. • Experience operating complex systems with high availability expectations, including monitoring, incident management, automation, and fleet-health practices. • Proven track record of driving execution across multiple teams, priorities, and technical domains, including close partnership with hardware, firmware, software, networking, validation, and infrastructure organizations. • Clear written and verbal communication skills, including executive-level reporting on operational health, risks, and priorities. • Track record of building cohesive teams and developing technical leaders who improve reliability and execution. Ways to stand out from the crowd: • Prior Director or Senior Manager experience leading infrastructure, reliability, platform engineering, or large-scale lab operations teams. • Experience operating GPU, AI, HPC, cloud, or hyperscale datacenter infrastructure. • Broad knowledge of rack-scale systems, including server management, networking, storage, power, thermal, and RAS concepts. • Experience building automation, telemetry, fleet health, or dashboarding systems that improve product quality, serviceability, or engineering velocity. Do you enjoy making complex AI infrastructure reliable at scale while enabling engineering teams to move faster? Come join our datacenter server systems team and help build the reliable, token-efficient computing platforms driving NVIDIA's success in this exciting and rapidly growing field. Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 292,000 USD - 442,750 USD. You will also be eligible for equity and benefits. Applications for this job will be accepted at least until June 30, 2026. This posting is for an existing vacancy. NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

More Jobs at NVIDIA Corporation

More Information Technology Jobs

Find similar Director, Engineering Operations and Site Reliability Engineering - Datacenter Server Systems jobs: