Senior Compute Platform Engineer

Stack AV

$120K — $160K *
US-AnywhereRemote in United States
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 7+ years of experience building and operating distributed systems or infrastructure platforms.
  • Strong experience with Kubernetes and container orchestration in production grade environments.
  • Proficiency developing in Golang and Python.
  • Experience designing and operating large-scale batch compute systems.
  • Strong debugging and problem-solving skills in complex distributed systems.
  • Ability to collaborate across teams and communicate technical concepts clearly.
  • Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm.

Responsibilities

  • Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters.
  • Build and maintain compute platform abstractions.
  • Optimize utilization of compute resources.
  • Develop and improve multi-tenant scheduling strategies.
  • Improve reliability and fault tolerance of large-scale distributed jobs and platform components.
  • Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
  • Contribute to platform tooling, automation, and CI/CD workflows.

Benefits

  • Comprehensive health insurance plans.
  • Generous retirement savings options.
  • Flexible work environment promoting work-life balance.
  • Opportunities for professional development and continuing education.
  • Access to advanced tools and technologies for personal growth.
Full Job Description
About the Role:

The Compute Platform team is responsible for designing and operating the systems that orchestrate and scale batch and distributed workloads across our environments. You will work at the intersection of infrastructure, distributed systems, and developer experience-ensuring that complex workloads are reliable, efficient, and easy to run.

As a Senior Compute Platform Engineer, you will design and operate high scale batch compute systems and workflow orchestration systems that power engineers across the company.

Responsibilities:
  • Design and operate distributed systems for scheduling and executing large-scale batch workloads across Kubernetes clusters.
  • Build and maintain compute platform abstractions.
  • Optimize utilization of compute resources.
  • Develop and improve multi-tenant scheduling strategies.
  • Improve reliability and fault tolerance of large-scale distributed jobs and platform components.
  • Collaborate with teams across the company to understand workload requirements and improve platform capabilities.
  • Contribute to platform tooling, automation, and CI/CD workflows.

Qualifications:
  • 7+ years of experience building and operating distributed systems or infrastructure platforms.
  • Strong experience with Kubernetes and container orchestration in production grade environments.
  • Proficiency developing in Golang and Python.
  • Experience designing and operating large-scale batch compute systems.
  • Strong debugging and problem-solving skills in complex distributed systems.
  • Ability to collaborate across teams and communicate technical concepts clearly.
  • Experience with at least one batch scheduling system such as Kueue, Armada, Volcano, or Slurm.


Similar Jobs

More Jobs at Stack AV

More Enterprise Technology Jobs

Find similar Senior Compute Platform Engineer jobs: