The RoleWe are looking for research engineers to help build the post-training stack for frontier reasoning models.
This role sits at the point where model capability, training dynamics, data, verification, and infrastructure all meet. You will design and run the experiments that turn a strong base model into a model that can solve difficult tasks reliably: choosing training objectives, shaping data mixtures, building verifiers, debugging reward signals, scaling runs, and understanding why a recipe works or fails.
Researchers are also expected to have strong engineering skills. The best work here will involve both: forming hypotheses about training behavior, implementing them in real systems, running large-scale experiments, reading the resulting traces carefully, and turning the lessons into the next training run.
Some areas you may work on include:
- Post-training recipes: Develop and iterate on RL, SFT, and distillation recipes. Understand how choices in objectives, data mixtures, hyperparameters, rollout generation, and filtering affect efficiency, stability, capability, and final model behavior.
- Scaling RL: Make post-training work at larger scales: more tokens, longer trajectories, larger models, more steps, and larger compute budgets. This includes identifying the bottlenecks that appear only when an approach leaves the small-run regime.
- Long-horizon reasoning: Train models on tasks where success depends on many intermediate decisions. Develop methods for assigning useful feedback across long trajectories, where sparse rewards, credit assignment, exploration, and verification all become harder.
- Off-policy and asynchronous training: Work on training regimes where data is generated by older policies, different policies, or partially filtered policies. Build intuition and tooling for when off-policy data helps, when it hurts, and how to control the resulting instabilities.
- Verification and reward quality: Build robust verification pipelines for tasks where correctness can be checked automatically or semi-automatically. Detect and reduce reward hacking, false positives, brittle verifiers, and other failure modes that make RL look better than it really is.
- Multi-task post-training: Scale recipes across different task families and domains. Study the tradeoffs between specialization and generality, and design training mixtures that improve all capabilities together.
- Experiment analysis and debugging: Develop a deep empirical understanding of training runs. Diagnose regressions, separate real improvements from noise, design better ablations, and build the probes and analyses needed to make post-training less opaque.
- End-to-end execution: Work closely with systems, infrastructure, and data teams to get experiments from idea to production-scale runs. This includes making training pipelines reliable, ensuring data and verifier quality, and turning successful experiments into repeatable and scalable recipes.
If you're excited about building the infrastructure that makes frontier RL research possible at scale, we'd love to hear from you.
We offer a base salary of $350,000-$500,000 USD and a meaningful equity grant, depending on experience and background, along with competitive benefits.