Research Engineer - Training Platform

Rhoda AI

• $120K — $160K *

Mountain View, CA 94040In-Person

Enterprise Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of software engineering experience in MLOps or ML platform engineering
Proficient with distributed training frameworks like PyTorch DDP or similar
Hands-on experience with experiment tracking and artifact management systems
Familiar with GPU cluster environments management (Slurm, Kubernetes, etc.)
Strong skills in reliability engineering, including monitoring and failure recovery

Responsibilities

Build and maintain training orchestration systems for GPU clusters
Develop tools for job configuration, tracking, and reproducibility
Create observability infrastructure for monitoring training runs
Optimize the research iteration loop for efficiency
Manage job scheduling for GPU cluster utilization
Build internal interfaces to accelerate researcher workflows
Collaborate with teams to enhance platform support

Benefits

High visibility and direct feedback from users on the platform
Opportunity to influence training velocity and reliability across experiments
Work on systems that scale for future advanced model training
Be part of a team that builds critical tools used by researchers and engineers

Full Job Description

We're looking for a Research Engineer to build and maintain the training platform that powers our model development - experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

What You'll Do

Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters
Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management
Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection
Optimize and automate the research iteration loop from experiment launch to results analysis
Manage job scheduling and cluster utilization for efficient use of GPU compute
Build internal tooling and interfaces that help researchers move faster
Collaborate with training systems, data infrastructure, and research teams to support their platform needs

What We're Looking For

Strong software engineering skills with experience in MLOps or ML platform engineering
Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)
Experience building experiment tracking, reproducibility, and artifact management systems
Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)
Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice to Have (But Not Required)

Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)
Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)
Experience supporting large model training pipelines (LLMs, VLMs, or video models)
Understanding of parallelism strategies and how they affect training efficiency and debugging
Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Why This Role

Your platform is the daily tool every researcher and engineer uses to train models
Improvements to training velocity and reliability compound across every experiment the team runs
High visibility with direct feedback from researchers and ML engineers
Build systems that scale from today's models to future frontier training runs

* Ladders Estimates

Similar Jobs

Cloud Engineer - Clearance Required
$122K — $200K *
Logistics Management Institute
Remote
Today
Cloud Developer III
$114K — $125K *
Verily
Mountain View, CA 94040 (Santa Clara County)
Reposted Today
System Development Engineer, ESCAPE
$122K — $160K *
Amazon
San Luis Obispo, CA 93405 (San Luis Obispo County)
Reposted Yesterday
Software Engineer III, Speech Production, Infrastructure
$147K — $211K *
Google
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Software Engineer, Core Infrastructure
$120K — $160K *
Poshmark
Redwood City, CA 94061 (San Mateo County)
Yesterday
Site Reliability Engineering Technical Leader (Remote)
$149K — $218K *
Cisco
Remote
Yesterday

Get Ready For Your
Next Interview

More Jobs at Rhoda AI

Cloud Infrastructure Engineer
$120K — $160K *
Mountain View, CA 94040 (Santa Clara County)
Today
Information Technology
In-Person
Research Scientist / Engineer - Training Systems
$130K — $180K *
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Consumer Technology
In-Person
Fullstack Engineer
$120K — $160K *
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Enterprise Technology
In-Person
Robotics Application Engineer
$90K — $130K *
Mountain View, CA 94040 (Santa Clara County)
2 days ago
Technical Services
In-Person
Research Scientist / Engineer - Reasoning
$120K — $180K *
Mountain View, CA 94040 (Santa Clara County)
2 days ago
Consumer Technology
In-Person

More Enterprise Technology Jobs

D365 Business Central Consultant
$90K — $120K *
Centre Technologies
Remote
Today
Sr Manager, Prod Mgmt - Tech - AMZ9675765
$211K — $267K *
Amazon
Seattle, WA 98115 (King County)
Reposted Today
Startups Field Sales Ops Lead, NAMER/LATAM
$133K — $181K *
Amazon
Seattle, WA 98115 (King County)
Reposted Today
Senior Business Systems Analyst
$75K — $90K *
Prototek Sheetmetal Fabrication, LLC
Menomonie, WI 54751 (Dunn County)
Today
Sr. Strategic Account Manager, Data Center - Security
$100K — $130K *
Johnson Controls
Alexandria, VA 22304 (Alexandria City County)
Today

Find similar Research Engineer - Training Platform jobs:

Nationwide Mountain View, CA

Research Engineer - Training Platform

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Engineer - Training Platform jobs:

Get Ready For Your
Next Interview