Staff Machine Learning Infrastructure Engineer

CloudKitchens • $224K — $280K *

San Francisco, CA 94112In-Person

Information Technology

8 - 10 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

8+ years of professional software engineering experience
Strong backend systems programming skills in Go, Python, Java (Rust a plus)
Proficient with Kubernetes for cloud-agnostic environments
Experience with distributed ML compute frameworks like Ray
Hands-on with MLOps pipelines and model registries, e.g. MLflow
Managing high-throughput data pipelines with distributed data engines

Responsibilities

Design and implement machine learning infrastructure for large-scale distributed GPU training
Leverage distributed compute frameworks for managing concurrent ML training jobs
Integrate model management and experiment tracking tools for deep observability
Build and optimize data ingestion pipelines for petabyte-scale vehicle logs
Architect infrastructure for model validation and continuous integration testing
Collaborate with robotics engineers and ML researchers to streamline workflows

Benefits

Medical, Dental, Vision, Disability, and Life Insurance
Flexible Spending Account / Health Savings Account options
401(k) plan
Equity options
Unlimited flexible time off and paid holidays
Paid parental leave
Pre-tax commuter benefit plan
Team lunches twice a week

Full Job Description

What you'll do

We are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers our next-generation autonomous transport models. In this role, you will design the high-performance training pipelines and validation environments that enable our world-class robotics and ML researchers to iterate rapidly. You will own the challenge of scaling distributed GPU workloads to support a high volume of concurrent training runs across an expanding vehicle fleet, building a platform that can flexibly run on whatever GPU capacity is available, regardless of provider or environment, directly accelerating innovation across the platform.

Training Infrastructure: Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.
Distributed Computing & Orchestration: Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters.
Experiment Tracking & MLOps: Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance.
Data Engineering Pipelines: Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments.
Validation at Scale: Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free.
Cross-Functional Collaboration: Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle.

What we're looking for

8+ years of professional software engineering career experience
Strong backend systems programming skills with proficiency in Go, Python, Java or similar (with familiarity or exposure to Rust considered a plus).
Proficiency with Kubernetes for container orchestration and building cloud-agnostic environments from scratch.
Experience implementing distributed ML compute frameworks (e.g., Ray) to coordinate large pools of GPUs for heavy, multi-node workloads.
Hands-on experience building MLOps pipelines, metadata tracking architectures, and model registries using platforms like MLflow.
Prior experience managing high-throughput data pipelines using modern distributed data engines to feed data-hungry neural network architectures.

What else you need to know

This role is based in our San Francisco office. Atoms is a company driven by invention and continuous change - we are constantly reimagining our industries, building new products, and refining how we operate. We do our best work together. That's why all of our office-based teams work onsite, five days a week.

The base salary range for this role is $224,000 - $280,000 per year.

Actual compensation will be determined on an individual basis and may vary depending on experience, skills, and qualifications.

Base salary is just one part of your total rewards package. You may also be eligible for equity awards and an annual performance-based bonus.

Benefits Summary (USA Full-Time Exempt Employees):

Medical, Dental, Vision, Disability, and Life Insurance
Flexible Spending Account / Health Savings Account Options
401(k)
Equity
Sick Time, Unlimited Flexible Time Off, and Paid Holidays
Paid Parental Leave
Pre-Tax Commuter Benefit Plan
Team lunch in our SoMa office every Tuesday and Thursday

Benefits are subject to change at the company's discretion.
Atoms accepts applications on an ongoing basis.

Ready to join us as we serve those who serve others?

#LI-Onsite

About CloudKitchens

CloudKitchens is a technology company that provides a platform for restaurants to operate delivery-only kitchens. The company's platform allows restaurants to expand their delivery reach without the need for additional physical locations, while also providing real-time data and analytics to optimize operations. CloudKitchens was founded in 2016 by Travis Kalanick, the co-founder of Uber, and is headquartered in Los Angeles, California.

Learn more about CloudKitchens

Size

1,000 employees

Industry

Enterprise Technology

Founded

2016

* Ladders Estimates

Similar Jobs

Staff Software Engineer, Embedded Finance
$193K — $309K *
Toast
Remote
Reposted Today
Staff Software Engineer, Identity & Access Management
$182K — $228K *
Synthesia
Reno, NV 89502 (Washoe County)
Yesterday
Staff Software Engineer, Identity & Access Management
$182K — $228K *
Synthesia
San Ramon, CA 94582 (Contra Costa County)
Yesterday
Staff Software Engineer, Edge
$177K — $364K *
Pinterest
Remote
Yesterday
Staff Software Engineer, AI/ML, Search Ads
$207K — $301K *
Google
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Forward Deployed Engineer
$200K — $250K *
Buildertrend
Remote
Yesterday

Get Ready For Your
Next Interview

More Jobs at CloudKitchens

Robotics Test Engineer
$182K — $230K *
San Francisco, CA 94112 (San Francisco County)
Today
Technical Services
In-Person
Finance Analytics
$122K — $143K *
Los Angeles, CA 90011 (Los Angeles County)
Today
Finance & Insurance
In-Person
Staff Machine Learning Infrastructure Engineer
$224K — $280K *
San Francisco, CA 94112 (San Francisco County)
Today
Information Technology
In-Person
Supply Chain Manager
$170K — $200K *
San Francisco, CA 94112 (San Francisco County)
Today
Manufacturing & Automotive
In-Person
Culinary R&D Associate
$70K — $110K *
Los Angeles, CA 90011 (Los Angeles County)
Today
Food & Beverages
In-Person

More Information Technology Jobs

Senior Research Engineer, Threat Intelligence
$140K — $150K *
SecurityScorecard
Remote
Today
(USA) Senior Manager, Advanced Analytics (BI Platform Enablement - Operations)
$90K — $180K *
Walmart, Inc.
Bentonville, AR 72712 (Benton County)
Reposted Today
Principal Solutions Architect (AWS Technical Alliances)
$175K — $233K *
Tenable Network Security
Remote
Today
Senior Software Engineer, Backend
$144K — $180K *
Archer Aviation Inc.
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Advanced Application Engineer
$90K — $120K *
Honeywell
Fort Mill, SC 29708 (York County)
Reposted Today

Find similar Staff Machine Learning Infrastructure Engineer jobs:

Nationwide San Francisco, CA

Staff Machine Learning Infrastructure Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff Machine Learning Infrastructure Engineer jobs:

Get Ready For Your
Next Interview