Member of Technical Staff - Mechanistic Interpretability

Vmax

$300K — $500K *
Technical Services
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • PhD or equivalent experience in machine learning, reinforcement learning, or closely related field.
  • Proven research excellence through publications, open source projects, or deployed AI systems.
  • In-depth knowledge of modern machine learning, with a focus on reinforcement learning and large language models.
  • Strong familiarity with post-training methods for LLMs.
  • Experience in designing rigorous ML experiments and thorough evaluation methods.
  • Proficiency in Python and familiarity with ML frameworks like PyTorch or JAX.
  • Capable of independently tackling open-ended research challenges and defining experimental programs.

Responsibilities

  • Develop methods leveraging mechanistic interpretability to derive useful training signals.
  • Transform internal representations and causal behaviors into intrinsic rewards for reinforcement learning.
  • Evaluate interpretability-derived rewards against various forms of feedback and outcome evaluation.
  • Design metrics and baselines for assessing reward quality and resistance to reward manipulation.
  • Study evolution of internal representations during RL, applying insights to enhance training objectives.
  • Create infrastructure for large-scale, reproducible experiments on LLM agents and interpretability tools.
  • Establish a high-impact research agenda to advance open-ended learning beyond human imitation.

Benefits

  • Opportunity to work in a cutting-edge field of AI with innovative research.
  • Access to collaborative and engaging work environment in San Francisco.
  • Consideration for hybrid work arrangements for exceptional candidates.
  • Involvement in developing impactful AI solutions with substantial real-world implications.
Full Job Description
About the role

LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
Responsibilities
  • Develop methods for using mechanistic interpretability to extract useful training signals from the internal states of language models.
  • Turn representations, features, circuits, and causal model behaviors into intrinsic rewards for reinforcement learning.
  • Compare interpretability-derived rewards against human feedback, learned reward models, verifiers, and task-level outcome rewards.
  • Design metrics and baselines for reward quality, including alignment with intended behavior, generalization across tasks, robustness, and resistance to reward hacking.
  • Investigate how internal representations evolve during RL and post-training, and use these insights to improve training objectives.
  • Develop infrastructure for reproducible, large-scale experiments on LLM agents, interpretability tools, and RL environments.
  • Define and pursue a high-impact research agenda that advances Vmax's goal of open-ended learning beyond imitation of human expertise.
Minimum Requirements
  • PhD or equivalent experience in machine learning, reinforcement learning, or a closely related field.
  • Track record of research excellence, as demonstrated by publications, open source work, deployed AI systems, or other substantial technical contributions.
  • Deep understanding of modern machine learning, especially reinforcement learning, representation learning, and large language models.
  • Strong familiarity with LLM post-training methods
  • Experience designing and running rigorous ML experiments, including ablations, baselines, evaluation design, and failure analysis.
  • Expertise with Python and at least one major ML framework such as PyTorch or JAX.
  • Ability to work independently on open-ended research problems and turn ambiguous ideas into concrete experimental programs.
Nice to have
  • Experience with mechanistic interpretability techniques such as activation patching, probing, sparse autoencoders, feature attribution
  • Experience training or evaluating language-model agents in interactive, tool-using, or multi-step reasoning settings.
  • Familiarity with scalable RL infrastructure, distributed training, experiment tracking, and large-scale evaluation pipelines.
  • Experience developing reward models, verifiers, process supervision methods, or automated evaluation systems.
  • Demonstrated software engineering ability, especially in research codebases that require reliability, reproducibility, and iteration speed.
  • Ability to present technical results and their strategic implications to both research and non-research audiences.
Role specific location policy
  • This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement
Compensation

The expected salary range for this position is $300,000 - $500,000 USD

Similar Jobs

More Jobs at Vmax

More Technical Services Jobs

Find similar Member of Technical Staff - Mechanistic Interpretability jobs: