Research Scientist / Engineer - Training Systems

Rhoda AI

$130K — $180K *
Consumer Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in large-scale distributed training performance enhancement
  • Hands-on experience with modern ML frameworks, primarily PyTorch
  • In-depth knowledge of data/tensor/pipeline parallelism and communication strategies
  • Strong system-level intuition for optimizing compute, communication, and memory efficiency
  • Highly skilled in debugging and measuring performance bottlenecks
  • Proactive ownership mentality in dynamic environments

Responsibilities

  • Own and enhance the performance of multimodal training systems end-to-end
  • Diagnose bottlenecks and implement performance improvements
  • Design and evolve parallelism strategies tailored to large-scale systems
  • Create performance metrics and tools to track and optimize training efficiency
  • Collaborate with researchers to transform model innovations into operational methodologies
  • Partner with infrastructure teams for enhanced cluster-level efficiency
  • Facilitate rapid iteration and experimentation within the research team

Benefits

  • Direct impact on research acceleration and iteration
  • Ownership of large-scale training performance affecting real-world applications
  • Opportunity to work in a small, elite team focused on significant improvements
  • High influence on efficiency measures that benefit the entire organization
Full Job Description
Were looking for a Staff / Principal ML Training Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale - driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.

What Youll Do

Own training performance end-to-end
  • Diagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)
  • Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritization
  • Drive measurable gains in:
    • Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)
    • Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)
    • Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)

Design training systems (not just tune them)
  • Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approaches
  • Improve execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvements
  • Contribute to and extend training frameworks where needed


Make performance observable and measurable
  • Establish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiency
  • Build tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurations
  • Develop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressions


Partner deeply with researchers
  • Work side-by-side with research scientists and research engineers - no silos
  • Translate model innovations into scalable, efficient implementations
  • Advise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length data


Collaborate on cluster-level efficiency
  • Work with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behavior


What Were Looking For
  • Proven track record improving large-scale distributed training performance
  • Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
  • Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clusters
  • Strong systems intuition - ability to reason across compute, communication, and memory bottlenecks
  • Exceptional debugging and measurement ability: turn training is slow into clear bottlenecks, experiments, and validated improvements
  • High ownership mindset and comfort in a fast-moving environment


Nice to Have (But Not Required)
  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
  • Experience with multimodal or video training (variable-length sequences, packing/bucketing)
  • Experience working on large-scale training frameworks or distributed runtimes
  • Familiarity with cluster topology, networking, and large-scale scheduling effects


Why This Role
  • Direct leverage on research velocity - every efficiency gain you make accelerates model iteration across the entire research team
  • Own the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarks
  • Improvements you make compound across every training run the company executes - high ownership, high impact, small elite team

Similar Jobs

More Jobs at Rhoda AI

More Consumer Technology Jobs

Find similar Research Scientist / Engineer - Training Systems jobs: