CloudKitchens

Staff Machine Learning Infrastructure Engineer

CloudKitchens$224K — $280K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 8+ years of professional software engineering experience
  • Strong backend systems programming skills in Go, Python, Java (Rust a plus)
  • Proficient with Kubernetes for cloud-agnostic environments
  • Experience with distributed ML compute frameworks like Ray
  • Hands-on with MLOps pipelines and model registries, e.g. MLflow
  • Managing high-throughput data pipelines with distributed data engines

Responsibilities

  • Design and implement machine learning infrastructure for large-scale distributed GPU training
  • Leverage distributed compute frameworks for managing concurrent ML training jobs
  • Integrate model management and experiment tracking tools for deep observability
  • Build and optimize data ingestion pipelines for petabyte-scale vehicle logs
  • Architect infrastructure for model validation and continuous integration testing
  • Collaborate with robotics engineers and ML researchers to streamline workflows

Benefits

  • Medical, Dental, Vision, Disability, and Life Insurance
  • Flexible Spending Account / Health Savings Account options
  • 401(k) plan
  • Equity options
  • Unlimited flexible time off and paid holidays
  • Paid parental leave
  • Pre-tax commuter benefit plan
  • Team lunches twice a week
Full Job Description
What you'll do

We are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers our next-generation autonomous transport models. In this role, you will design the high-performance training pipelines and validation environments that enable our world-class robotics and ML researchers to iterate rapidly. You will own the challenge of scaling distributed GPU workloads to support a high volume of concurrent training runs across an expanding vehicle fleet, building a platform that can flexibly run on whatever GPU capacity is available, regardless of provider or environment, directly accelerating innovation across the platform.
  • Training Infrastructure: Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.
  • Distributed Computing & Orchestration: Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters.
  • Experiment Tracking & MLOps: Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance.
  • Data Engineering Pipelines: Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments.
  • Validation at Scale: Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free.
  • Cross-Functional Collaboration: Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle.


What we're looking for
  • 8+ years of professional software engineering career experience
  • Strong backend systems programming skills with proficiency in Go, Python, Java or similar (with familiarity or exposure to Rust considered a plus).
  • Proficiency with Kubernetes for container orchestration and building cloud-agnostic environments from scratch.
  • Experience implementing distributed ML compute frameworks (e.g., Ray) to coordinate large pools of GPUs for heavy, multi-node workloads.
  • Hands-on experience building MLOps pipelines, metadata tracking architectures, and model registries using platforms like MLflow.
  • Prior experience managing high-throughput data pipelines using modern distributed data engines to feed data-hungry neural network architectures.


What else you need to know

This role is based in our San Francisco office. Atoms is a company driven by invention and continuous change - we are constantly reimagining our industries, building new products, and refining how we operate. We do our best work together. That's why all of our office-based teams work onsite, five days a week.

The base salary range for this role is $224,000 - $280,000 per year.

Actual compensation will be determined on an individual basis and may vary depending on experience, skills, and qualifications.

Base salary is just one part of your total rewards package. You may also be eligible for equity awards and an annual performance-based bonus.

Benefits Summary (USA Full-Time Exempt Employees):
  • Medical, Dental, Vision, Disability, and Life Insurance
  • Flexible Spending Account / Health Savings Account Options
  • 401(k)
  • Equity
  • Sick Time, Unlimited Flexible Time Off, and Paid Holidays
  • Paid Parental Leave
  • Pre-Tax Commuter Benefit Plan
  • Team lunch in our SoMa office every Tuesday and Thursday

Benefits are subject to change at the company's discretion.
Atoms accepts applications on an ongoing basis.

Ready to join us as we serve those who serve others?

#LI-Onsite

About CloudKitchens

CloudKitchens is a technology company that provides a platform for restaurants to operate delivery-only kitchens. The company's platform allows restaurants to expand their delivery reach without the need for additional physical locations, while also providing real-time data and analytics to optimize operations. CloudKitchens was founded in 2016 by Travis Kalanick, the co-founder of Uber, and is headquartered in Los Angeles, California.
Learn more about CloudKitchens
Size
1,000 employees
Industry
Founded
2016

Similar Jobs

More Jobs at CloudKitchens

More Information Technology Jobs

Find similar Staff Machine Learning Infrastructure Engineer jobs: