Member of Technical Staff, Post-Training, RL

Mirendil

$350K — $500K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in research engineering or a related field.
  • Strong background in reinforcement learning (RL) and machine learning techniques.
  • Experience with large-scale experimentation and debugging in model training.
  • Proficient in designing experimental protocols and analyzing complex data sets.
  • Solid engineering skills with a focus on system integration and scalability.

Responsibilities

  • Design experiments to enhance model reliability for complex tasks.
  • Develop and refine post-training recipes using techniques such as RL and distillation.
  • Scale RL processes to handle larger models and data sets.
  • Create methods for effective long-horizon reasoning in tasks requiring multiple decisions.
  • Establish verification pipelines to ensure quality and mitigate errors in reward systems.
  • Explore multi-task training strategies to balance specialization against general abilities.
  • Work with cross-functional teams to bring research experiments to production.

Benefits

  • Competitive health insurance plans.
  • Retirement savings plans with company matching.
  • Flexible working hours and remote work options.
  • Generous paid time off for personal and family needs.
  • Opportunities for professional development and career advancement.
Full Job Description
The Role

We are looking for research engineers to help build the post-training stack for frontier reasoning models.

This role sits at the point where model capability, training dynamics, data, verification, and infrastructure all meet. You will design and run the experiments that turn a strong base model into a model that can solve difficult tasks reliably: choosing training objectives, shaping data mixtures, building verifiers, debugging reward signals, scaling runs, and understanding why a recipe works or fails.

Researchers are also expected to have strong engineering skills. The best work here will involve both: forming hypotheses about training behavior, implementing them in real systems, running large-scale experiments, reading the resulting traces carefully, and turning the lessons into the next training run.

Some areas you may work on include:
  • Post-training recipes: Develop and iterate on RL, SFT, and distillation recipes. Understand how choices in objectives, data mixtures, hyperparameters, rollout generation, and filtering affect efficiency, stability, capability, and final model behavior.
  • Scaling RL: Make post-training work at larger scales: more tokens, longer trajectories, larger models, more steps, and larger compute budgets. This includes identifying the bottlenecks that appear only when an approach leaves the small-run regime.
  • Long-horizon reasoning: Train models on tasks where success depends on many intermediate decisions. Develop methods for assigning useful feedback across long trajectories, where sparse rewards, credit assignment, exploration, and verification all become harder.
  • Off-policy and asynchronous training: Work on training regimes where data is generated by older policies, different policies, or partially filtered policies. Build intuition and tooling for when off-policy data helps, when it hurts, and how to control the resulting instabilities.
  • Verification and reward quality: Build robust verification pipelines for tasks where correctness can be checked automatically or semi-automatically. Detect and reduce reward hacking, false positives, brittle verifiers, and other failure modes that make RL look better than it really is.
  • Multi-task post-training: Scale recipes across different task families and domains. Study the tradeoffs between specialization and generality, and design training mixtures that improve all capabilities together.
  • Experiment analysis and debugging: Develop a deep empirical understanding of training runs. Diagnose regressions, separate real improvements from noise, design better ablations, and build the probes and analyses needed to make post-training less opaque.
  • End-to-end execution: Work closely with systems, infrastructure, and data teams to get experiments from idea to production-scale runs. This includes making training pipelines reliable, ensuring data and verifier quality, and turning successful experiments into repeatable and scalable recipes.


If you're excited about building the infrastructure that makes frontier RL research possible at scale, we'd love to hear from you.

We offer a base salary of $350,000-$500,000 USD and a meaningful equity grant, depending on experience and background, along with competitive benefits.

Similar Jobs

More Jobs at Mirendil

More Information Technology Jobs

Find similar Member of Technical Staff, Post-Training, RL jobs: