About the roleLLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
Responsibilities- Develop methods for using mechanistic interpretability to extract useful training signals from the internal states of language models.
- Turn representations, features, circuits, and causal model behaviors into intrinsic rewards for reinforcement learning.
- Compare interpretability-derived rewards against human feedback, learned reward models, verifiers, and task-level outcome rewards.
- Design metrics and baselines for reward quality, including alignment with intended behavior, generalization across tasks, robustness, and resistance to reward hacking.
- Investigate how internal representations evolve during RL and post-training, and use these insights to improve training objectives.
- Develop infrastructure for reproducible, large-scale experiments on LLM agents, interpretability tools, and RL environments.
- Define and pursue a high-impact research agenda that advances Vmax's goal of open-ended learning beyond imitation of human expertise.
Minimum Requirements- PhD or equivalent experience in machine learning, reinforcement learning, or a closely related field.
- Track record of research excellence, as demonstrated by publications, open source work, deployed AI systems, or other substantial technical contributions.
- Deep understanding of modern machine learning, especially reinforcement learning, representation learning, and large language models.
- Strong familiarity with LLM post-training methods
- Experience designing and running rigorous ML experiments, including ablations, baselines, evaluation design, and failure analysis.
- Expertise with Python and at least one major ML framework such as PyTorch or JAX.
- Ability to work independently on open-ended research problems and turn ambiguous ideas into concrete experimental programs.
Nice to have- Experience with mechanistic interpretability techniques such as activation patching, probing, sparse autoencoders, feature attribution
- Experience training or evaluating language-model agents in interactive, tool-using, or multi-step reasoning settings.
- Familiarity with scalable RL infrastructure, distributed training, experiment tracking, and large-scale evaluation pipelines.
- Experience developing reward models, verifiers, process supervision methods, or automated evaluation systems.
- Demonstrated software engineering ability, especially in research codebases that require reliability, reproducibility, and iteration speed.
- Ability to present technical results and their strategic implications to both research and non-research audiences.
Role specific location policy- This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement
CompensationThe expected salary range for this position is $300,000 - $500,000 USD