NVIDIA Corporation

Senior Systems Software Engineer, Observability and Telemetry Platform

NVIDIA Corporation$184K — $356K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • BS degree in Computer Science or a related field, or equivalent experience
  • 8+ years in Infrastructure automation and distributed systems design
  • 5+ years delivering foundational infrastructure and observability platforms
  • Proficiency in Python, Go, Perl, or Ruby
  • Deep knowledge of Linux, Networking, and Containers

Responsibilities

  • Design and maintain the observability and telemetry platform focusing on performance and real-time monitoring
  • Enhance the entire lifecycle of services from design to operational refinement
  • Consult on system design, develop tools/platforms, and conduct launch reviews prior to service deployment
  • Monitor services post-launch for availability, latency, and overall health
  • Implement automation to sustainably scale systems and improve reliability
  • Conduct blameless postmortems and sustainable incident response
  • Participate in on-call rotations for production system support

Benefits

  • Eligible for equity
  • Comprehensive benefits package
  • Opportunities for mentorship and support in career growth
  • Collaborative and inclusive work environment
  • Focus on intellectual curiosity and problem-solving culture
Full Job Description
Senior Systems Software Engineer (SRE) at NVIDIA is an engineering discipline to design, build and maintain large scale production systems with high efficiency and availability using the combination of software and systems engineering practices. This is a highly specialized discipline which demands knowledge across different systems, networking, coding, database, capacity management, continuous delivery and deployment and open source cloud enabling technologies like Kubernetes and OpenStack. Senior Systems Software Engineer (SRE) at NVIDIA ensures that our internal and external facing GPU cloud services run maximum reliability and uptime as promised to the users and at the same time enabling developers to make changes to the existing system through careful preparation and planning while keeping an eye on capacity, latency and performance. Senior Systems Software Engineer (SRE) is also a mindset and a set of engineering approaches to running better production systems and optimizations. Much of our software development focuses on eliminating manual work through automation, performance tuning and growing efficiency of production systems.

The Senior Systems Software Engineer (SRE) are responsible for the big picture of how our systems relate to each other, we use a breadth of tools and approaches to tackle a broad spectrum of problems. Practices such as limiting time spent on reactive operational work, blameless postmortems and proactive identification of potential outages' factor into iterative improvement that is key to both product quality and exciting dynamic day-to-day work. The Senior Systems Software Engineer (SRE) culture of diversity, intellectual curiosity, problem solving and willingness is important to our success. Our organization brings together people with a wide variety of backgrounds, experiences and perspectives. We encourage them to collaborate, think big and take risks in a blame-free environment. We promote self-direction to work on relevant projects, while we also strive to build an environment that provides the support and mentorship needed to learn and grow.

What you'll be doing:
  • Design, implement and support operational and reliability aspects of large scale Observability & Telemetry collection platform with a focus on performance at scale, real time monitoring, logging and alerting
  • Engage in and improve the whole lifecycle of services-from inception and design through deployment, operation and refinement
  • Support services before they go live through activities such as system design consulting, developing software tools, platforms and frameworks, capacity management and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity
  • Practice sustainable incident response and blameless postmortems
  • Be part of an on call rotation to support production systems


What we need to see:
  • BS degree in Computer Science or a related technical field involving coding (e.g., physics or mathematics), or equivalent experience
  • 8+ years of experience with Infrastructure automation, distributed systems design, experience with design, develop tools for running large scale private or public cloud system in Production
  • 5+ years experience delivering foundational infrastructure and observability platforms.
  • Experience in one or more of the following: Python, Go, Perl or Ruby
  • In depth knowledge on Linux, Networking and Containers


Ways to stand out from the crowd:
  • Interest in crafting, analyzing and fixing large-scale distributed systems
  • Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive. Ability to debug and optimize code and automate routine tasks
  • Experience in using or running large private and public cloud systems based on Kubernetes, OpenStack and Docker. Experience running Grafana, OpenTelemetry, Prometheus, and similar observability focused tools


Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 184,000 USD - 287,500 USD for Level 4, and 224,000 USD - 356,500 USD for Level 5.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until June 28, 2026.

This posting is for an existing vacancy.

NVIDIA uses AI tools in its recruiting processes.

About NVIDIA Corporation

Nvidia, a global leader in graphics, gaming, and AI technology, offers Nvidia careers and internship opportunities for those passionate about driving innovation in the tech industry. you'll find a company committed to growth, teamwork, and leadership in computer science and machine learning domains.

About Nvidia

A Pioneer in Technology and Innovation

Nvidia has cemented its reputation as a powerhouse in developing advanced graphics processing units (GPUs) and has significantly contributed to the gaming industry's evolution. Moreover, its foray into AI and machine learning has opened new frontiers in technology, making Nvidia a beacon of innovation and a desirable workplace for ambitious tech professionals.

Job Opportunities

Diverse Positions in a Dynamic Field

Nvidia is continuously on the lookout for talented individuals across various domains, including hardware and software engineering, product design, marketing, and sales. Employment opportunities at Nvidia are vast, catering to a wide range of expertise and career aspirations.

Employment in Hardware and Graphics

For those fascinated by the intricacies of hardware and graphics technology, Nvidia offers positions that sit at the forefront of gaming and computing advancements.

Growth in Machine Learning and AI

Nvidia's leadership in AI and machine learning has created numerous vacancies for specialists eager to contribute to groundbreaking projects.

Recruitment in Computer Science

With the constant demand for innovation, Nvidia's recruitment efforts focus on computer science experts capable of pushing the boundaries of what's possible.

Internship Program

Opening Doors to Future Innovators

Nvidia's internship program is designed to nurture the next generation of technology leaders, offering hands-on experience in a culture that celebrates creativity and teamwork.

Benefits and Culture

Interns at Nvidia enjoy a plethora of benefits, from competitive stipends to mentorship opportunities, all within an environment that values growth and learning.

Opportunities for Students

Whether you're an undergraduate, a master's student, or a Ph.D. candidate, Nvidia's internships provide a real-world glimpse into the tech industry, offering valuable experience in various technology fields.

Pathways to Full-Time Employment

Many interns have transitioned into full-time positions, marking the start of successful careers at Nvidia. The internship program is more than a stepping stone into the company; it’s an investment in the professional development of interns. The goal is to ensure that interns are well-equipped for future challenges.

Nvidia Careers: More Than Just a Job

Nvidia offers more than just a job to its employees; it provides a front-row seat on the journey into the future of technology. Nvidia stands as a pillar of innovation with its vast opportunities in hardware, graphics, gaming, machine learning, and computer science. Nvidia careers serve as a launching pad for talented workers who aim to redefine the technological landscape. Whether through full-time positions or internships, joining Nvidia means contributing to a legacy of breakthroughs and becoming part of a global community dedicated to pushing the boundaries of what's possible.
Learn more about NVIDIA Corporation
Size
22,473 employees
Market Cap
$350.4 billion
Industry
Net Income
$4.3 billion
Founded
1993
5 Year Trend
+31.3%
Revenue
$16.6 billion
NASDAQ

Similar Jobs

More Jobs at NVIDIA Corporation

More Information Technology Jobs

Find similar Senior Systems Software Engineer, Observability and Telemetry Platform jobs: