Peloton is looking for a Site Reliability Engineer with a focus on Kubernetes operations to work with teams across the organization to help build and maintain a monitorable, performant, reliable and highly-scalable deployment platform. We are a growing team of engineers tackling challenging problems with scaling Kubernetes to handle thousands of nodes and pods spread across many deployments.
The Kubernetes working group at Peloton works closely with development teams to ensure that the platform is robust, stable, and delivers features that include the following:
- Automatic, fast autoscaling for live rides and special large events
- Hosting critical infrastructure that ensures that our members have the best experience possible on tens of thousands of pods across multiple clusters
- Provides a platform for machine learning (and other awesome workloads) so that we can be at the forefront of the industry
- Allows developers to move quickly and experiment, without getting in the way
What You'll Be Doing:
- Evangelize best practices for building and operating highly reliable systems
- Serve as subject matter expert in observability and monitoring
- Consult in system design to meet reliability and capacity requirements
- Automate everything, from infrastructure down to day-to-day tasks.
- Conduct timely post-mortems of infrastructure incidents
- Assist with all aspects of operational security and compliance
- Seek out potential threats to security and reliability and advocate solutions
- We work with Amazon Web Services, Chef, Python, Ubuntu, Nginx, Jenkins, and Terraform
What We’re Looking For:
- Experience maintaining scalable and stable Kubernetes clusters.
- Knowledge of best practices when it comes to the observability and monitoring required of running Kubernetes at scale.
- Knowledge of best practices in regards to securing a Kubernetes cluster and its deployments at scale.
- A passion for helping development teams make the transition to a container-native world.
- Experience with CI/CD Systems such as for example: Jenkins, ArgoCD, Harness, Tekton, etc.
- Experience deployment infrastructure using Infrastructure as Code utilities such as Terraform or Pulumi.
- Know when to triage and when to dive down into a root-cause analysis.
- Passion for reliable, scalable, observable software with strong sense of ownership.
- Experience with a programming language like Python, Golang, Java, C.