Software Engineer, RL Training Infra

OpenAI • $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 week ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in machine learning infrastructure or related fields
Strong debugging skills essential for troubleshooting complex systems
Ability to learn quickly across various technology layers
Experience in reinforcement learning or related ML infrastructure
Excellent communication and ownership attributes

Responsibilities

Ensure smooth operation of large-scale RL training systems by addressing urgent technical issues
Troubleshoot and fix problems in training, inference, and orchestration systems
Enhance the reliability of training runs and improve overall system efficiency
Support researchers in developing complex integrations like memory or multi-agent capabilities
Transform recurring operational challenges into systematic solutions
Collaborate closely with research and infrastructure teams on model timelines
Adapt quickly to ambiguous situations and take ownership of projects

Benefits

Opportunity to work with leading-edge AI technologies
Collaborative and high-impact work environment
Access to professional development and learning opportunities
Ability to influence the direction of frontier model training
Work within a team that values speed, reliability, and innovation

Full Job Description

About the Team

The Post-Training Frontiers team creates the frontier agents OpenAI ships to the world. We do the reinforcement learning training for the agentic models we ship in Codex, ChatGPT, and the API (from o1 to 5.5).

Our role consists of (1) shepherding all integrations that should go into the final RL run and deciding what can make it in, (2) babysitting and scaling the final run, and (3) building the research and infra for horizontal integrations, such as improving function calling, factuality, multi-agent capabilities, memory, calibrated thinking, etc.

About the Role

This role focuses on keeping our frontier RL training runs fast, reliable, and unblocked. You will work across engineering and infrastructure problems as they emerge, from scaling and orchestration issues to inference bottlenecks, numerical problems, and hardware failures, as well as supporting large horizontal integrations in the big run, like multi-agent capabilities or memory. This is a role for a strong generalist who quickly learns anything needed for the task, has high attention to detail, debugs deeply, and is motivated by fixing the highest-impact problem in front of the team.

In this role, you will:

- Keep large-scale RL training runs moving by jumping into the most urgent engineering and infrastructure problems.

- Debug issues across training systems, inference, orchestration, scaling, and distributed infrastructure.

- Solve hard technical problems at the boundary between research and engineering: scaling experiments, improving training reliability, debugging distributed systems, reducing latency and cost, and making new capabilities robust under real workloads.

- Improve reliability and efficiency for RL training runs.

- Help researchers who are developing infra-heavy integrations, such as multi-agent capabilities or memory.

- Turn recurring operational issues into better tools, systems, processes, or abstractions.

- Work closely with research, infrastructure, and partner teams during tight model run timelines.

- Become useful quickly in messy, ambiguous areas where ownership matters more than a perfectly scoped project.

- Debug failures that cut across model behavior, training data, RL systems, evaluation infrastructure, serving systems, and agent harnesses, then turn those failures into hypotheses, fixes, and durable improvements.

You might thrive in this role if you:

- Want to train and ship our frontier models and ensure we make agents genuinely useful for developers, enterprises, researchers, and everyday users.

- Are a strong generalist engineer with experience in some layer of ML infrastructure.

- Have worked on RL, inference, scaling, training systems, orchestration, or adjacent ML infrastructure.

- Learn extremely quickly and are comfortable operating across unfamiliar layers.

- Are a strong debugger with high ownership, low ego, and excellent communication.

- Can land in a messy area with tight timelines, become useful quickly, and gradually raise the quality of the whole system.

- Are energized by fast-moving environments where reliability, speed, and judgment matter.

- Like building load-bearing systems and processes when that is what the team needs, even if the work is not glamorous.

Nice to have:

- Experience supporting large-scale model training, async RL systems, or high-throughput ML infrastructure.

- Experience debugging distributed systems across GPUs, networking, orchestration, or inference stacks.

- Background in performance optimization, scaling, or production-critical infrastructure.

- Experience working directly with researchers or fast-moving model teams.

About OpenAI

OpenAI is an artificial intelligence research laboratory consisting of the for-profit corporation OpenAI LP and its parent company, the non-profit OpenAI Inc. The company was founded in 2015 by a group of technology leaders, including Elon Musk, Sam Altman, Greg Brockman, Ilya Sutskever, and John Schulman. OpenAI's mission is to develop and promote friendly AI for the betterment of humanity. The company has developed a number of cutting-edge AI technologies, including GPT-3, a language processing system that can generate human-like text. OpenAI has received funding from a number of high-profile investors, including LinkedIn co-founder Reid Hoffman and venture capitalist Peter Thiel.

Learn more about OpenAI

Size

100 employees

Industry

Information Technology

Founded

2015

* Ladders Estimates

Similar Jobs

Site Reliability Engineer
$142K — $158K *
General Dynamics
Remote
Yesterday
Site Reliability Engineer, Customer Systems
$120K — $160K *
Apple
Sunnyvale, CA 94087 (Santa Clara County)
2 weeks ago
Site Reliability Engineer, Enterprise Technology Services
$130K — $180K *
Apple
Sunnyvale, CA 94087 (Santa Clara County)
2 weeks ago
Site Reliability Engineer (SRE)
$170K — $230K *
Mithril
Palo Alto, CA 94303 (Santa Clara County)
1 month ago
Site Reliability Engineer
$130K — $180K *
Cognition
San Francisco, CA 94112 (San Francisco County)
1 month ago

Get Ready For Your
Next Interview

More Jobs at OpenAI

Analytics & Automation Lead, User Safety & Risk Operations
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Today
Consumer Technology
In-Person
Revenue Operations BP - Technical Success
$120K — $150K *
San Francisco, CA 94112 (San Francisco County)
Today
Consumer Technology
In-Person
Government Account Director, National Security
$130K — $180K *
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Education, Government & Non-Profit
In-Person
Software Engineer, Monetization ML Infrastructure
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Yesterday
Information Technology
In-Person
Partner Manager, Cloud Service Provider Partnerships
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Yesterday
Enterprise Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Enterprise Security Engineer
$165K — $180K *
Amundsen Davis
Los Angeles, CA 90011 (Los Angeles County)
Today
Manager, Software Development
$116K — $145K *
Navitus Health Solutions, LLC
Remote
Today
Virtual Desktop Systems Engineer
$112K — $179K *
Joint Activities
Herndon, VA 20171 (Fairfax County)
Today

Find similar Software Engineer, RL Training Infra jobs:

Nationwide San Francisco, CA

Software Engineer, RL Training Infra

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Software Engineer, RL Training Infra jobs:

Get Ready For Your
Next Interview