About the roleThis role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data pipelines, observability, and reliability. You will create the durable platform that enables researchers and applied ML engineers to run, debug, and reproduce large-scale RL experiments.
Responsibilities- Build infrastructure for distributed RL training and inference across thousands of GPUs
- Improve the reliability, debuggability, and throughput of RL experiments.
- Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily.
- Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance.
- Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization.
- Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility.
Minimum Requirements- Strong software engineering experience.
- Experience building infrastructure for LLM inference and/or RL training.
- Experience with GPU clusters, distributed training, model serving, or high-throughput inference systems.
- Familiarity with vLLM, SGLang and modern LLM-RL training frameworks
- Strong understanding of system reliability, observability, testing, debugging, and performance optimization.
- Ability to work closely with ML researchers and translate messy experimental workflows into durable infrastructure.
- Experience building tools, platforms, or services used by other technical users.
- Strong judgment around technical tradeoffs: when to prototype, when to harden, when to simplify, and when to redesign.
- Clear written and verbal communication, especially around system design, operational risks, and engineering tradeoffs.
Nice to have- Experience supporting research teams or fast-moving ML teams.
- Experience at a high engineering bar organization where reliability, ownership, and code quality were central.
- Evidence of strong independent technical work, such as open-source projects, infrastructure projects, competitions, or substantial systems built from scratch.
- Experience reducing operational complexity in systems that had become brittle, slow, or hard to debug.
Role specific location policy- This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement
CompensationThe expected salary range for this position is $300,000 - $500,000 USD