The RoleWe are looking for a research engineer to build the evaluation infrastructure that tells us whether our models are getting better in ways we care about. You'll own the frameworks, pipelines, and tooling that measure model behavior across capabilities. Some example areas you might work on (not limited to):
- Design and build evaluation frameworks that measure model capabilities along realistic axes, beyond standard benchmarks.
- Build automated eval pipelines and regression-detection systems that run continuously and surface signal quickly.
- Develop agent-assisted workflows for humans to efficiently inspect model behavior.
- Instrument training runs with observability tooling so researchers can understand what's changing in model behavior, and why.
- Partner with post-training and RL teams to close the loop between eval signal and training decisions.
If you're excited about the hard problem of knowing whether a frontier AI system is actually improving, we'd love to hear from you.
We offer a base salary of $350,000-$500,000 USD and a meaningful equity grant, depending on experience and background, along with competitive benefits.