Machine Learning Engineer - ML Training Platform

Pluralis

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

5 - 7 years of experience

2 months ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years in infrastructure/platform engineering with a strong focus on decentralized systems
Hands-on experience with Kubernetes, Docker, and GPU workloads
Expertise in infrastructure-as-code tools like Pulumi, Terraform, or CloudFormation
Proficient in Python with skills in asynchronous programming and observability tools
Deep understanding of distributed training frameworks and multi-cloud environments

Responsibilities

Design multi-cloud resource management systems for AWS, GCP, and Azure using infrastructure-as-code
Architecture of fault-tolerant infrastructure for distributed machine learning
Create systems that simulate real-world network conditions to improve data training
Manage dynamic scaling and state synchronization across extensive compute nodes
Integrate health monitoring and recovery strategies for model training failures

Benefits

Work in a pioneering startup environment with cutting-edge technology
Collaborate with a world-class team of ML researchers
Opportunity to shape the future of decentralized AI development
Support from top-tier investors, enhancing stability and growth
Contribute to a mission-driven organization focused on open access to AI resources

Full Job Description

Overview

We9re looking for an ML Training Platform Engineer to architect, build, and scale the foundational infrastructure powering our decentralized ML training platform. You will own core systems spanning infrastructure orchestration, distributed compute, and services integration, enabling continuous experimentation and large-scale model training.

Responsibilities

Multi-Cloud Infrastructure: Design resource management systems provisioning and orchestrating compute across AWS, GCP, and Azure using infrastructure-as-code (Pulumi/Terraform). Handle dynamic scaling, state synchronization, and concurrent operations across hundreds of heterogeneous nodes.
Distributed Training Systems: Architect fault-tolerant infrastructure for distributed ML. GPU clusters, NVIDIA runtime, S3 checkpointing, Large dataset management and streaming, health monitoring, and resilient retry strategies.
Real-World Networking: Build systems that simulate and handle real-world network conditions - bandwidth shaping, latency injection, packet loss - while managing dynamic node churn and ensuring efficient data flow across workers with heterogeneous connectivity, because our training happens on consumer nodes and non co-located infrastructure, not in a datacenter.

What You9ll Bring

Ideally, you9ll have 5+ years of work experience with deep experience in:

Infrastructure & Platform Engineering: Production experience with infrastructure-as-code (Pulumi/Terraform/CloudFormation) managing multi-cloud deployments, lifecycle orchestration, self-healing systems, Docker/Kubernetes (EKS), GPU workloads, and heterogeneous clusters at scale.
Distributed Systems & ML Infrastructure: Deep understanding of distributed training workflows, checkpointing, data sharding, model versioning, long-running job orchestration, decentralized networking (P2P, NAT traversal, traffic shaping), and real-world bandwidth constraints.
Systems Programming & Reliability: Strong Python engineering (asyncio, concurrency, retry logic, cloud SDKs, CLI tooling) with hands-on experience in observability, SRE practices, monitoring (Prometheus/Grafana), performance profiling, and incident response.

What we9re looking for

Experience in a startup environment with an emphasis on micro-services orchestration or big tech background
Deep understanding of multi-cloud infra & distributed training systems
A team player with high attention to detail
A strong passion to join

* Ladders Estimates

Similar Jobs

Cloud Systems Engineer
$100K — $130K *
ECS
Remote
Today
Staff Agentic Search Infrastructure Engineer - Moveworks
$120K — $160K *
ServiceNow
Mountain View, CA 94040 (Santa Clara County)
Today
Software Engineer, Security
$120K — $160K *
David AI
San Francisco, CA 94112 (San Francisco County)
Yesterday
Cloud Provisioning Engineer
$98K — $167K *
Allstate Insurance Company
Remote
Reposted Yesterday
AWS Public Cloud Senior Consultant
$140K — $160K *
Ensono
Remote
Reposted 2 days ago
DevOps Software Engineer
$104K — $166K *
Joint Activities
Remote
2 days ago

Get Ready For Your
Next Interview

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
3 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Software Tester
$70K — $95K *
Seequent
Toronto, ON M3C 0E3
Today
Principal Software Engineer (React + Node) - Remote -EU or USA
$120K — $150K *
pubGENIUS
Remote
Reposted Today

Find similar Machine Learning Engineer - ML Training Platform jobs:

Nationwide San Francisco, CA

Machine Learning Engineer - ML Training Platform

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Machine Learning Engineer - ML Training Platform jobs:

Get Ready For Your
Next Interview