Research Scientist / Engineer - Training Systems

Rhoda AI

• $130K — $180K *

Mountain View, CA 94040In-Person

Consumer Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in large-scale distributed training performance enhancement
Hands-on experience with modern ML frameworks, primarily PyTorch
In-depth knowledge of data/tensor/pipeline parallelism and communication strategies
Strong system-level intuition for optimizing compute, communication, and memory efficiency
Highly skilled in debugging and measuring performance bottlenecks
Proactive ownership mentality in dynamic environments

Responsibilities

Own and enhance the performance of multimodal training systems end-to-end
Diagnose bottlenecks and implement performance improvements
Design and evolve parallelism strategies tailored to large-scale systems
Create performance metrics and tools to track and optimize training efficiency
Collaborate with researchers to transform model innovations into operational methodologies
Partner with infrastructure teams for enhanced cluster-level efficiency
Facilitate rapid iteration and experimentation within the research team

Benefits

Direct impact on research acceleration and iteration
Ownership of large-scale training performance affecting real-world applications
Opportunity to work in a small, elite team focused on significant improvements
High influence on efficiency measures that benefit the entire organization

Full Job Description

Were looking for a Staff / Principal ML Training Systems Engineer to own training systems performance end-to-end. You will define how our models train at scale - driving efficiency, scalability, and correctness across large-scale multimodal training. This is a core systems role, not infrastructure support. Your work directly determines how efficiently we use compute, how well models scale across thousands of GPUs, and how quickly research can iterate.

What Youll Do

Own training performance end-to-end

Diagnose and improve performance of large-scale multimodal training (vision, video, proprioception, actions, language)
Build systematic performance attribution: step-time decomposition (compute vs communication vs input pipeline), scaling curves across cluster sizes, and bottleneck identification and prioritization
Drive measurable gains in:
- Distributed efficiency (comm/compute overlap, bucketization, topology-aware mapping, parallelism strategies)
- Compute efficiency (kernel hotspots, operator fusion, attention optimization, framework/runtime overhead)
- Memory efficiency (activation checkpointing, sequence packing/bucketing, fragmentation reduction)

Design training systems (not just tune them)

Define and evolve parallelism strategies: data / tensor / pipeline / sharding / hybrid approaches
Improve execution efficiency through communication scheduling and overlap, graph capture and execution optimization, and runtime-level improvements
Contribute to and extend training frameworks where needed

Make performance observable and measurable

Establish source-of-truth performance metrics: step-time breakdowns, MFU / throughput / scaling efficiency
Build tools to identify bottlenecks quickly, track performance across model families, and compare scaling behavior across configurations
Develop regression detection: microbenchmarks, performance baselines, and automated detection of efficiency regressions

Partner deeply with researchers

Work side-by-side with research scientists and research engineers - no silos
Translate model innovations into scalable, efficient implementations
Advise on training tradeoffs for robotics world models: long-horizon sequences, rollout/evaluation cadence, multimodal and variable-length data

Collaborate on cluster-level efficiency

Work with infrastructure/SRE teams to improve utilization across large distributed jobs, impact of network and collective performance on training, and topology-aware job placement and scaling behavior

What Were Looking For

Proven track record improving large-scale distributed training performance
Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
Strong understanding of data / tensor / pipeline parallelism, sharded training (FSDP / ZeRO-style), communication patterns and overlap strategies, and scaling behavior across large GPU clusters
Strong systems intuition - ability to reason across compute, communication, and memory bottlenecks
Exceptional debugging and measurement ability: turn training is slow into clear bottlenecks, experiments, and validated improvements
High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
Experience with multimodal or video training (variable-length sequences, packing/bucketing)
Experience working on large-scale training frameworks or distributed runtimes
Familiarity with cluster topology, networking, and large-scale scheduling effects

Why This Role

Direct leverage on research velocity - every efficiency gain you make accelerates model iteration across the entire research team
Own the scalability and performance of large-scale multimodal training for real-world embodied intelligence, not static benchmarks
Improvements you make compound across every training run the company executes - high ownership, high impact, small elite team

* Ladders Estimates

Similar Jobs

Staff Software Engineer, Full-Stack
$160K — $200K *
Atlas Data Storage, Inc.
South San Francisco, CA 94080 (San Mateo County)
Reposted Today
(USA) Staff, Software Engineer
$143K — $286K *
Walmart
Sunnyvale, CA 94087 (Santa Clara County)
Today
Staff Machine Learning Engineer
$130K — $180K *
Monogram Health
Remote
Today
Staff Software Engineer - Backend (Python / Typescript / Big Data / AWS / Kubernetes)
$104K — $130K *
Varicent
Remote
Today
Staff, Software Engineer - AI
$143K — $286K *
Walmart, Inc.
Sunnyvale, CA 94087 (Santa Clara County)
Today
Staff, Software Engineer
$143K — $286K *
Walmart, Inc.
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Rhoda AI

Fullstack Engineer
$120K — $160K *
Mountain View, CA 94040 (Santa Clara County)
Today
Enterprise Technology
In-Person
Robotics Application Engineer
$90K — $130K *
Mountain View, CA 94040 (Santa Clara County)
Today
Technical Services
In-Person
Research Scientist / Engineer - Reasoning
$120K — $180K *
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Consumer Technology
In-Person
Robot Controls Engineer
$120K — $150K *
Mountain View, CA 94040 (Santa Clara County)
Yesterday
Consumer Technology
In-Person
Research Scientist / Engineer - Post-training & Robot Learning
$120K — $150K *
Mountain View, CA 94040 (Santa Clara County)
2 days ago
Consumer Technology
In-Person

More Consumer Technology Jobs

Applied Scientist, Prime Video - Title Lifecycle Presentation
$142K — $193K *
Amazon
Seattle, WA 98115 (King County)
Reposted Today
Product Marketing Specialist
$62K — $145K *
Hewlett Packard Enterprise Development LP
Chicago, IL 60629 (Cook County)
Today
Senior Data Scientist, Product
$185K — $225K *
Mudflap
Palo Alto, CA 94303 (Santa Clara County)
Today
Trust and Safety Intelligence Analyst, Fraud and System Abuse
$116K — $167K *
Google
Washington, DC 20011 (District Of Columbia County)
Today
Sr. Sales Manager - Access Controls
$125K — $206K *
The Blackstone Group LP
Remote
Reposted Today

Find similar Research Scientist / Engineer - Training Systems jobs:

Nationwide Mountain View, CA

Research Scientist / Engineer - Training Systems

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Scientist / Engineer - Training Systems jobs:

Get Ready For Your
Next Interview