Site Reliability Engineer - GPU in Santa Clara, CA

View All Telecommunications & Hardware jobs

Industry:

Telecommunications & Hardware   •  

8 - 10 years

Posted 7 weeks ago

NVIDIA is looking for a Site Reliability Engineer to work in IPP (Infrastructure, Planning and Process). IPP is a global organization within NVIDIA. This group works with various other groups within NVIDIA Software such as Graphics Processors, Mobile Processors, Deep Learning, Artificial Intelligence and Driverless Cars to cater to their infrastructure needs. These cloud services provide almost half a million automated jobs per day on thousands of servers helping with the productivity of thousands of NVIDIA's software engineers worldwide.

The cloud hosts heterogeneous mix of machines and devices with various operating systems (Windows/Linux/Android), multitude of hardware platforms both NVIDIA GPUs and Tegra Processors.

Are you passionate about infrastructure and looking for complex relevant issues, ready to build the next generation of cloud services, design creative solutions, mine through data to uncover real problems and fix them? We are excited to have a fun-loving person like you.

What you'll be doing:


  • Work on NVIDIA Hardware, install them onto servers and provision them in our cloud.
  • Develop automation for cloud deployments
  • Lead end to end servicing of systems.
  • You will run and maintain reliability and availability of the hardware.
  • Research and work with multi-functional teams for bring up of new hardware technologies.

What we need to see:


  • Undergraduate / Graduate Degree in Computer Science or Software Engineering is requirement
  • Strong background working on installing and debugging servers, desktops and other engineering hardware.
  • Strong background on Linux administration and scripting.
  • Knowledge of Deployment techniques and methodologies.
  • Experience with networking gear and storage elements.
  • 8+ Years of experience working in a data center or a software lab.

Ways to stand out from the crowd:


  • Experience with virtualization technologies like VMs or containers.
  • Experience with Configuration management tools used for provisioning.
  • Ability to design automations that can work well with minimal operational support.