What you'll doWe are seeking a foundational Machine Learning Infrastructure Engineer to design and build the large-scale ML training infrastructure that powers our next-generation autonomous transport models. In this role, you will design the high-performance training pipelines and validation environments that enable our world-class robotics and ML researchers to iterate rapidly. You will own the challenge of scaling distributed GPU workloads to support a high volume of concurrent training runs across an expanding vehicle fleet, building a platform that can flexibly run on whatever GPU capacity is available, regardless of provider or environment, directly accelerating innovation across the platform.
- Training Infrastructure: Design, implement, and scale repeatable machine learning infrastructure utilizing Kubernetes to support large-scale distributed GPU training of novel neural networks.
- Distributed Computing & Orchestration: Leverage distributed compute frameworks to efficiently manage and execute a high volume of complex ML training jobs concurrently across large GPU clusters.
- Experiment Tracking & MLOps: Integrate advanced model management and experiment tracking tools to provide researchers with deep observability into training metrics and run performance.
- Data Engineering Pipelines: Build and optimize high-throughput data ingestion pipelines to seamlessly stream petabyte-scale multi-sensor vehicle logs into training environments.
- Validation at Scale: Architect robust infrastructure for autonomous model validation and continuous integration testing, ensuring new vehicle policy releases are entirely regression-free.
- Cross-Functional Collaboration: Partner closely with core robotics engineers and machine learning researchers to eliminate workflow bottlenecks and accelerate the deploy-to-vehicle lifecycle.
What we're looking for- 8+ years of professional software engineering career experience
- Strong backend systems programming skills with proficiency in Go, Python, Java or similar (with familiarity or exposure to Rust considered a plus).
- Proficiency with Kubernetes for container orchestration and building cloud-agnostic environments from scratch.
- Experience implementing distributed ML compute frameworks (e.g., Ray) to coordinate large pools of GPUs for heavy, multi-node workloads.
- Hands-on experience building MLOps pipelines, metadata tracking architectures, and model registries using platforms like MLflow.
- Prior experience managing high-throughput data pipelines using modern distributed data engines to feed data-hungry neural network architectures.
What else you need to knowThis role is based in our San Francisco office. Atoms is a company driven by invention and continuous change - we are constantly reimagining our industries, building new products, and refining how we operate. We do our best work together. That's why all of our office-based teams work onsite, five days a week.
The base salary range for this role is
$224,000 - $280,000 per year. Actual compensation will be determined on an individual basis and may vary depending on experience, skills, and qualifications.
Base salary is just one part of your total rewards package. You may also be eligible for equity awards and an annual performance-based bonus.
Benefits Summary (USA Full-Time Exempt Employees):- Medical, Dental, Vision, Disability, and Life Insurance
- Flexible Spending Account / Health Savings Account Options
- 401(k)
- Equity
- Sick Time, Unlimited Flexible Time Off, and Paid Holidays
- Paid Parental Leave
- Pre-Tax Commuter Benefit Plan
- Team lunch in our SoMa office every Tuesday and Thursday
Benefits are subject to change at the company's discretion.
Atoms accepts applications on an ongoing basis.
Ready to join us as we serve those who serve others? #LI-Onsite