Member of Technical Staff - RL Infrastructure

Vmax

• $300K — $500K *

San Francisco, CA 94112In-Person

Technical Services

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Strong software engineering experience
Experience building infrastructure for LLM inference and/or RL training
Proficiency with GPU clusters and distributed training systems
Familiarity with vLLM, SGLang, and modern LLM-RL frameworks
Understanding of system reliability and observability
Ability to collaborate with ML researchers to improve workflows
Experience creating tools for technical users

Responsibilities

Build infrastructure for distributed RL training and inference across thousands of GPUs
Enhance reliability, debuggability, and throughput of RL experiments
Create user-friendly interfaces for experiment management
Oversee infrastructure projects from design to long-term maintenance
Identify and eliminate performance bottlenecks in training and data processes
Maintain high engineering standards for RL infrastructure

Benefits

Flexible work policy with potential hybrid arrangements
Engagement with fast-paced ML teams
Opportunity to work with cutting-edge RL technology
A high engineering bar environment promoting ownership and quality
Support for independent technical projects and open-source contributions

Full Job Description

About the role

This role is for strong infrastructure engineers who can build the systems layer for RL at scale: distributed rollouts, training orchestration, inference, evals, data pipelines, observability, and reliability. You will create the durable platform that enables researchers and applied ML engineers to run, debug, and reproduce large-scale RL experiments.
Responsibilities

Build infrastructure for distributed RL training and inference across thousands of GPUs
Improve the reliability, debuggability, and throughput of RL experiments.
Build interfaces that allow researchers and applied ML engineers to launch, inspect, compare, and reproduce experiments easily.
Own infrastructure projects end to end, from architecture and implementation through deployment, documentation, and long-term maintenance.
Identify and eliminate bottlenecks in training, rollout generation, eval execution, data movement, and cluster utilization.
Maintain engineering standards for RL infrastructure, including testing, observability, versioning, and reproducibility.

Minimum Requirements

Strong software engineering experience.
Experience building infrastructure for LLM inference and/or RL training.
Experience with GPU clusters, distributed training, model serving, or high-throughput inference systems.
Familiarity with vLLM, SGLang and modern LLM-RL training frameworks
Strong understanding of system reliability, observability, testing, debugging, and performance optimization.
Ability to work closely with ML researchers and translate messy experimental workflows into durable infrastructure.
Experience building tools, platforms, or services used by other technical users.
Strong judgment around technical tradeoffs: when to prototype, when to harden, when to simplify, and when to redesign.
Clear written and verbal communication, especially around system design, operational risks, and engineering tradeoffs.

Nice to have

Experience supporting research teams or fast-moving ML teams.
Experience at a high engineering bar organization where reliability, ownership, and code quality were central.
Evidence of strong independent technical work, such as open-source projects, infrastructure projects, competitions, or substantial systems built from scratch.
Experience reducing operational complexity in systems that had become brittle, slow, or hard to debug.

Role specific location policy

This role is based in our San Francisco office; for exceptional candidates we are willing to consider a hybrid arrangement

Compensation

The expected salary range for this position is $300,000 - $500,000 USD

* Ladders Estimates

Similar Jobs

Senior Reliability Engineer, DGX Cloud
$168K — $333K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Today
Senior Reliability Engineer, DGX Cloud
$168K — $333K *
NVIDIA Corporation
Remote
Today
Senior Presales Systems Engineer
$166K — $343K *
Hewlett Packard Enterprise Development LP
Fall River Mills, CA 96028 (Shasta County)
Reposted Today
Software Engineering Manager, Site Reliability Engineering
$207K — $301K *
Google
Sunnyvale, CA 94087 (Santa Clara County)
Yesterday
Systems Development Engineer 5 - Creative Compute and Storage
$340K — $490K *
Netflix
Los Gatos, CA 95032 (Santa Clara County)
Yesterday
Senior Site Reliability Engineer, CORE (Member Experience / Resilience Operations)
$388K — $500K+*
Netflix
Remote
Reposted Yesterday

Get Ready For Your
Next Interview

More Technical Services Jobs

BI Consultant & Solutions Lead
$120K — $150K *
Confidential Company
San Diego, CA 92101 (San Diego County)
1 week ago
Sr AI Engineer
$84K — $105K *
Renesas Electronics America
Columbia, MD 21044 (Howard County)
Today
Sr. Systems Engineer TS/SCI Poly
$154K — $278K *
Leidos Holding
Annapolis, MD 21401 (Anne Arundel County)
Today
Principal Security Consultant - F5 Services
$110K — $140K *
World Wide Technology
Remote
Today
Field Service Staff Manager
$90K — $120K *
INNIO
Remote
Today

Find similar Member of Technical Staff - RL Infrastructure jobs:

Nationwide San Francisco, CA

Member of Technical Staff - RL Infrastructure

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Member of Technical Staff - RL Infrastructure jobs:

Get Ready For Your
Next Interview