About the roleRL has become the de-facto method of post-training LLMs. We are limited by the sample efficiency of the current policy gradient algorithms in use today, and are looking for a talented researcher to weave together pre-LLM and post-LLM approaches to learning from experience.
Responsibilities- Develop new RL algorithms for post-training language models.
- Adapt ideas from pre-LLM reinforcement learning, such as model-based RL, temporal abstraction, and value-based learning, to modern LLM and agentic settings.
- Establish empirical baselines and evaluation protocols for measuring sample efficiency, robustness, generalization, and reward exploitation in LLM RL.
- Analyze failure modes of RL-trained models, including reward hacking, mode collapse, over-optimization, exploration failures, and distribution shift.
- Collaborate with researchers working on environments, evals, interpretability, reward modeling, and infrastructure to turn algorithmic ideas into reliable training systems.
- Own and develop a research agenda within Vmax, from identifying promising directions to executing experiments and communicating results.
Minimum Requirements- PhD or equivalent experience in machine learning, reinforcement learning, or a closely related field.
- Track record of research excellence, as demonstrated by publications, open source work, deployed AI systems, or other substantial technical contributions.
- Deep understanding of modern machine learning, especially reinforcement learning, representation learning, and large language models.
- Strong familiarity with LLM post-training methods.
- Experience designing and running rigorous ML experiments, including ablations, baselines, evaluation design, and failure analysis.
- Experience with large-scale ML infrastructure, distributed training, experiment tracking, data pipelines, and debugging unstable training runs.
- Expertise with Python and at least one major ML framework such as PyTorch or JAX.
- Ability to work independently on open-ended research problems and turn ambiguous ideas into concrete experimental programs.
Nice to have- Experience developing new RL algorithms or improving existing ones in domains such as robotics, games, simulated control, language models, or agents.
- Experience with LLM pre-training.
- Strong understanding of reward modeling, verifiers, process supervision, outcome supervision, or automated evaluation systems.
- Demonstrated software engineering ability
- Strong communication skills, especially the ability to explain algorithmic ideas, empirical results, and research implications to both technical and non-technical audiences
Role specific location policy- This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement
CompensationThe expected salary range for this position is $300,000 - $500,000 USD