Member of Technical Staff - Mechanistic Interpretability

Vmax

• $300K — $500K *

San Francisco, CA 94112In-Person

Technical Services

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

PhD or equivalent experience in machine learning, reinforcement learning, or closely related field.
Proven research excellence through publications, open source projects, or deployed AI systems.
In-depth knowledge of modern machine learning, with a focus on reinforcement learning and large language models.
Strong familiarity with post-training methods for LLMs.
Experience in designing rigorous ML experiments and thorough evaluation methods.
Proficiency in Python and familiarity with ML frameworks like PyTorch or JAX.
Capable of independently tackling open-ended research challenges and defining experimental programs.

Responsibilities

Develop methods leveraging mechanistic interpretability to derive useful training signals.
Transform internal representations and causal behaviors into intrinsic rewards for reinforcement learning.
Evaluate interpretability-derived rewards against various forms of feedback and outcome evaluation.
Design metrics and baselines for assessing reward quality and resistance to reward manipulation.
Study evolution of internal representations during RL, applying insights to enhance training objectives.
Create infrastructure for large-scale, reproducible experiments on LLM agents and interpretability tools.
Establish a high-impact research agenda to advance open-ended learning beyond human imitation.

Benefits

Opportunity to work in a cutting-edge field of AI with innovative research.
Access to collaborative and engaging work environment in San Francisco.
Consideration for hybrid work arrangements for exceptional candidates.
Involvement in developing impactful AI solutions with substantial real-world implications.

Full Job Description

About the role

LLMs are fantastically powerful and there is a rapidly growing corpus of work devoted to understanding their internal representations and computations. We use the tools of mechanistic interpretability to enhance reinforcement learning by generating intrinsic rewards as a supplement or alternative to downstream human-generated verifiers.
Responsibilities

Develop methods for using mechanistic interpretability to extract useful training signals from the internal states of language models.
Turn representations, features, circuits, and causal model behaviors into intrinsic rewards for reinforcement learning.
Compare interpretability-derived rewards against human feedback, learned reward models, verifiers, and task-level outcome rewards.
Design metrics and baselines for reward quality, including alignment with intended behavior, generalization across tasks, robustness, and resistance to reward hacking.
Investigate how internal representations evolve during RL and post-training, and use these insights to improve training objectives.
Develop infrastructure for reproducible, large-scale experiments on LLM agents, interpretability tools, and RL environments.
Define and pursue a high-impact research agenda that advances Vmax's goal of open-ended learning beyond imitation of human expertise.

Minimum Requirements

PhD or equivalent experience in machine learning, reinforcement learning, or a closely related field.
Track record of research excellence, as demonstrated by publications, open source work, deployed AI systems, or other substantial technical contributions.
Deep understanding of modern machine learning, especially reinforcement learning, representation learning, and large language models.
Strong familiarity with LLM post-training methods
Experience designing and running rigorous ML experiments, including ablations, baselines, evaluation design, and failure analysis.
Expertise with Python and at least one major ML framework such as PyTorch or JAX.
Ability to work independently on open-ended research problems and turn ambiguous ideas into concrete experimental programs.

Nice to have

Experience with mechanistic interpretability techniques such as activation patching, probing, sparse autoencoders, feature attribution
Experience training or evaluating language-model agents in interactive, tool-using, or multi-step reasoning settings.
Familiarity with scalable RL infrastructure, distributed training, experiment tracking, and large-scale evaluation pipelines.
Experience developing reward models, verifiers, process supervision methods, or automated evaluation systems.
Demonstrated software engineering ability, especially in research codebases that require reliability, reproducibility, and iteration speed.
Ability to present technical results and their strategic implications to both research and non-research audiences.

Role specific location policy

This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement

Compensation

The expected salary range for this position is $300,000 - $500,000 USD

* Ladders Estimates

Similar Jobs

Member of Technical Staff - RL Algorithms
$300K — $500K *
Vmax
San Francisco, CA 94112 (San Francisco County)
Today
Staff Research Scientist, AI Safety
$241K — $301K *
Biohub
Redwood City, CA 94061 (San Mateo County)
Today
Research Engineer - Reinforcement Learning, Self-Driving
$126K — $423K *
Applied Intuition
Sunnyvale, CA 94087 (Santa Clara County)
Yesterday
Research Engineer - Robot Learning
$126K — $423K *
Applied Intuition
Sunnyvale, CA 94087 (Santa Clara County)
2 days ago
Research Scientist - Reinforcement Learning, Self-Driving
$126K — $423K *
Applied Intuition
Sunnyvale, CA 94087 (Santa Clara County)
2 days ago
Research Scientist - Reinforcement Learning, Robotics
$126K — $423K *
Applied Intuition
Sunnyvale, CA 94087 (Santa Clara County)
2 days ago

Get Ready For Your
Next Interview

More Jobs at Vmax

Member of Technical Staff - Mechanistic Interpretability
$300K — $500K *
San Francisco, CA 94112 (San Francisco County)
Today
Technical Services
In-Person
Member of Technical Staff - RL Algorithms
$300K — $500K *
San Francisco, CA 94112 (San Francisco County)
Today
Technical Services
In-Person

More Technical Services Jobs

BI Consultant & Solutions Lead
$120K — $150K *
Confidential Company
San Diego, CA 92101 (San Diego County)
1 week ago
Technical Engineering Writer 2
$75K — $95K *
Chipton Ross
Elkridge, MD 21075 (Howard County)
Today
Hybrid: Commissioning Agent
$100K — $110K *
Planate Management Group
Orlando, FL 32828 (Orange County)
Reposted Today
Technical Product Support (TPS) Engineer III
$110K — $152K *
Applied Materials, Inc
Santa Clara, CA 95051 (Santa Clara County)
Reposted Today
VMWare Support Engineer
$115K — $165K *
World Wide Technology
Honolulu, HI 96817 (Honolulu County)
Today

Find similar Member of Technical Staff - Mechanistic Interpretability jobs:

Nationwide San Francisco, CA

Member of Technical Staff - Mechanistic Interpretability

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Member of Technical Staff - Mechanistic Interpretability jobs:

Get Ready For Your
Next Interview