Research Engineer - Training Platform

Rhoda AI

$120K — $160K *
Enterprise Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of software engineering experience in MLOps or ML platform engineering
  • Proficient with distributed training frameworks like PyTorch DDP or similar
  • Hands-on experience with experiment tracking and artifact management systems
  • Familiar with GPU cluster environments management (Slurm, Kubernetes, etc.)
  • Strong skills in reliability engineering, including monitoring and failure recovery

Responsibilities

  • Build and maintain training orchestration systems for GPU clusters
  • Develop tools for job configuration, tracking, and reproducibility
  • Create observability infrastructure for monitoring training runs
  • Optimize the research iteration loop for efficiency
  • Manage job scheduling for GPU cluster utilization
  • Build internal interfaces to accelerate researcher workflows
  • Collaborate with teams to enhance platform support

Benefits

  • High visibility and direct feedback from users on the platform
  • Opportunity to influence training velocity and reliability across experiments
  • Work on systems that scale for future advanced model training
  • Be part of a team that builds critical tools used by researchers and engineers
Full Job Description
We're looking for a Research Engineer to build and maintain the training platform that powers our model development - experiment orchestration, job management, observability, and the tooling that lets researchers move from idea to result as fast as possible.

What You'll Do
  • Build and maintain training orchestration systems for large-scale distributed model training across GPU clusters
  • Develop experiment management tooling: job configuration, tracking, reproducibility, and artifact management
  • Build observability infrastructure for training runs: loss curves, compute utilization, gradient statistics, and anomaly detection
  • Optimize and automate the research iteration loop from experiment launch to results analysis
  • Manage job scheduling and cluster utilization for efficient use of GPU compute
  • Build internal tooling and interfaces that help researchers move faster
  • Collaborate with training systems, data infrastructure, and research teams to support their platform needs

What We're Looking For
  • Strong software engineering skills with experience in MLOps or ML platform engineering
  • Familiarity with distributed training frameworks (PyTorch DDP, FSDP, DeepSpeed, Megatron, or similar)
  • Experience building experiment tracking, reproducibility, and artifact management systems
  • Comfortable managing and operating GPU cluster environments (Slurm, Kubernetes, or similar)
  • Strong reliability engineering instincts: monitoring, alerting, and failure recovery

Nice to Have (But Not Required)
  • Experience with training orchestration tools (Slurm, Ray, Kubernetes, or similar schedulers)
  • Familiarity with experiment tracking tools (Weights & Biases, MLflow, or custom solutions)
  • Experience supporting large model training pipelines (LLMs, VLMs, or video models)
  • Understanding of parallelism strategies and how they affect training efficiency and debugging
  • Experience with cloud-based training infrastructure (AWS, GCP, or Azure)

Why This Role
  • Your platform is the daily tool every researcher and engineer uses to train models
  • Improvements to training velocity and reliability compound across every experiment the team runs
  • High visibility with direct feedback from researchers and ML engineers
  • Build systems that scale from today's models to future frontier training runs

Similar Jobs

More Jobs at Rhoda AI

More Enterprise Technology Jobs

Find similar Research Engineer - Training Platform jobs: