As Systems Engineer on the Datacenter Engineering team you will be responsible for the maintenance and further development of the tools that we use to monitor and deploy infrastructure in the datacenters on a global scale. Our team performs all the on and off-premise datacenter work that supports all production and engineering work that makes Tesla a world leader in self-driving EV, energy storage, and solar power technology - including building some of the world's fastest supercomputers. Continuous deployment, monitoring, reporting, maintenance, improvement, and rapid turn-around on service requests from all over the organization is imperative to drive a successful production environment in the datacenter.
You’ll be the highly engaged and hands-on representative to the rest of the organization for a closely integrated, cross-functional, and versatile team that builds and operates all Tesla datacenter resources globally. With the ever-growing need for more and more data and compute, locally, and in remote locations – datacenter engineering needs to follow suit, be scalable through more automated processes for deployment, monitoring, and alerting. You will be responsible for ensuring greatly improved processes in precision deployments of production systems by leveraging your solid background in systems and automation, and the combined resources your team provides.
- Work with engineering teams to understand useful metrics to collect and implement such monitoring and alerting with existing monitoring solutions at the datacenter level.
- Ensure all datacenter resources are properly monitored, accessible, and provide automated alerting when things go wrong
- Consolidate, maintain, and improve monitoring dashboards for datacenter status overview by accessing APIs available from tools used to gather data
- Respond to and document submitted support tickets relating to the functionality of various systems present in the datacenter.
- Develop automated tools and workflows to collect information that can be directly used to assist users creating root cause analysis for issues reported.
- Other tasks as assigned
- MS in Computer Science, Electrical Engineering or related field or a Bachelor’s degree with 5 years of additional equivalent experience
- 5+ years experience with:
- Computer deployment and operations (CPU / GPU)
- Networking infrastructure deployment and operation
- Linux operating system flavors (CentOS/RHEL, Ubuntu)
- Systems monitoring and alerting (Ganglia, Telegraf, Splunk, etc.)
- Programming & Scripting, Python / Bash
- 3+ years experience with:
- Storage systems (On-prem and/or in-cloud)
- DCIM type software for monitoring, alerting, automation
- Working knowledge of datacenter, network, and compute deployments at scale
- Working knowledge of SNMP, rPDUs, UPS systems, BMS systems, etc.
- Demonstratable programming and/or scripting skills with python, bash
- Excellent time management and communication skills are absolute musts
- Ability to step up and take ownership to bring complex tasks to completion
Nice to have:
- Experience with multi-site on-prem and in cloud hybrid software and hardware deployments
- Experience with automating workflows in JIRA (or similar tools)