About the Role:The Compute Platform team is responsible for designing and operating the systems that orchestrate and scale batch and distributed workloads across our environments. You will work at the intersection of infrastructure, distributed systems, and developer experience-ensuring that complex workloads are reliable, efficient, and easy to run.
As a Senior Compute Platform Engineer, you will design and operate high scale batch compute systems and workflow orchestration systems that power engineers across the company.
Responsibilities:- Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters.
- Build and maintain compute platform abstractions.
- Optimize utilization of compute resources.
- Develop and improve multi-tenant scheduling strategies.
- Improve reliability and fault tolerance of large-scale distributed jobs and platform components.
- Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
- Contribute to platform tooling, automation, and CI/CD workflows.
Qualifications: - 7+ years of experience building and operating distributed systems or infrastructure platforms.
- Strong experience with Kubernetes and container orchestration in production grade environments.
- Proficiency developing in Golang and Python.
- Experience designing and operating large-scale batch compute systems.
- Strong debugging and problem-solving skills in complex distributed systems.
- Ability to collaborate across teams and communicate technical concepts clearly.
- Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm.