ML Systems Engineer

Periodic Labs

• $300K — $400K *

Menlo Park, CA 94025In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in large-scale ML infrastructure
Strong background in low-level systems programming and kernel optimization
Experience with GPU cluster scheduling and orchestration using Ray, Slurm, or Kubernetes
Proficient in writing and optimizing CUDA kernels and communication primitives
Adept in profiling and benchmarking distributed ML systems
Experience in checkpoint management and direct cloud storage integration
Familiarity with open source ML infrastructure projects and research co-design with ML researchers

Responsibilities

Build scheduling systems for GPU clusters to maximize performance
Create profiling tools to identify bottlenecks in training and inference
Implement checkpoint streaming to mitigate I/O bottlenecks during training
Benchmark RL configurations for optimal models and hardware performance
Design zero-copy RDMA weight synchronization for training and inference
Develop sandbox environments for rapid execution of model actions
Contribute to and engage with open source ML communities

Benefits

Visa sponsorship available
Flexible location options within the San Francisco Bay Area
Opportunity to work at the intersection of infrastructure and research
Gain direct influence over scientific discovery processes
Engage with cutting-edge technologies and open-source projects

Full Job Description

About the Role

You will own the systems layer that makes our frontier model training and inference fast, efficient, and tightly coupled to the RL feedback loop that drives scientific discovery.

This is not a pure infrastructure role and it is not a pure research role - it sits exactly at their intersection. You will go deep into the stack: scheduling, kernels, RDMA, weight synchronization, and communication primitives, while working shoulder-to-shoulder with researchers to co-design the algorithms and infrastructure together.

The RL loop is central to how Periodic Labs works. Models propose experiments, experiments generate data, data feeds back into training. The speed and reliability of that loop is a direct multiplier on the pace of scientific discovery. You will own the infrastructure that makes it fast.

What You'll Do
Orchestration

Build rack and topology-aware scheduling for GB series GPUs across Ray, Slurm, and Kubernetes, minimizing latency and maximizing utilization across heterogeneous cluster configurations

Training & Inference

Build online and offline profilers that surface bottlenecks across the training and inference stack and translate findings into actionable optimizations
Implement direct S3 checkpoint streaming to eliminate I/O bottlenecks in large-scale training runs
Run methodical benchmarking to identify optimal RL training configurations across model sizes, batch strategies, and hardware topologies
Write and optimize communication and GPU kernels to extract maximum throughput from the hardware

RL Loop

Design and implement zero-copy RDMA weight synchronization between training and inference to keep the RL loop tight and low-latency
Build fast sandbox execution environments that allow rapid rollout of model-generated actions and return of rewards without blocking the training pipeline

Open Source Collaboration

Engage directly with the SGLang, Megatron, and Ray communities - contributing upstream, influencing roadmaps, and pulling in improvements that benefit Periodic Labs' workloads

Research Co-Design

Work in close collaboration with RL and pretraining researchers to co-design algorithms and infrastructure together - you will shape what is possible at the research level by knowing what is achievable at the systems level, and vice versa

The net result: high-throughput, fault-tolerant training and inference systems tightly coupled with a low-latency RL feedback loop that accelerates scientific discovery at every turn.

You Might Thrive in This Role if You Have Experience With

Large-scale inference infrastructure: load balancing, traffic shifting, scheduling, and serving architecture at production scale
Low-level systems programming: RDMA, NVLink, kernel-level work, and network stack optimization
GPU cluster scheduling and orchestration across Ray, Slurm, or Kubernetes, with awareness of rack topology and hardware locality
Writing and optimizing CUDA kernels, communication primitives, or distributed training collective operations
Profiling and benchmarking distributed ML systems to identify and eliminate bottlenecks across compute, memory, and network
Checkpoint management and streaming at scale, including direct cloud storage integration
Building or contributing to open source ML infrastructure projects (e.g., SGLang, Megatron-LM, vLLM, Ray)
Working directly with ML researchers on algorithm-infrastructure co-design - you understand the research well enough to make systems decisions that serve it

Why This Role Is Critical to Our Mission

The pace of scientific discovery at Periodic Labs is directly governed by the speed of our RL loop. Our models learn by doing: they generate hypotheses, run experiments, receive graded results, and train on the outcomes. Every inefficiency in that cycle - every idle GPU, every blocked weight sync, every slow rollout - compounds into slower science. Right now, our researchers are running frontier-scale RL on thousands of GPUs across Megatron and SGLang/vLLM, and the infrastructure constraints are real and active. Trainer idle time, node pressure, weight sync reliability, and time-to-first-batch are not abstract concerns - they are daily rate limiters on what our researchers can explore and how fast they can learn.

We are building out internal inference platforms with OSS libraries such as SGLang and vLLM, using prefill-decode disaggregation to optimize throughput, working to compress our node footprint so more researchers can run experiments in parallel, and designing a more modular RL infrastructure that decouples inference replicas from training jobs. These are not future roadmap items - they are problems being worked on today, by researchers who should be focused on the science. The person in this role will take those problems off their plate and own them end-to-end, with the technical depth and judgment to make the right architectural calls without being told what to do.

There is also a deeper reason this role matters. Periodic Labs' scientific tasks - XRD phase identification, crystal structure prediction, synthesis planning - have unusually long and expensive verification loops compared to math or code benchmarks. A model rollout that requires running a Rietveld refinement or executing a DFT calculation is fundamentally different from one that checks a unit test. That asymmetry means inference throughput, sandbox execution speed, and RL loop latency have outsized leverage on our research velocity in ways they simply do not at other labs. Getting this infrastructure right is not a supporting function - it is a primary research accelerant.

The person who fills this role will work at the center of everything: tightly coupled to the research team, directly influencing what science gets done and how fast, and building systems that no one else in the world is building for exactly this problem. That is a rare opportunity, and we are looking for someone who recognizes it.

Mechanics

Minimum education: Bachelor's degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual compensation range for this role - $300,00-$400,000

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

* Ladders Estimates

Similar Jobs

Senior Presales Systems Engineer
$146K — $343K *
Hewlett Packard Enterprise Development LP
Fall River Mills, CA 96028 (Shasta County)
Reposted Today
Senior Site Reliability Engineer, CORE (Member Experience / Resilience Operations)
$388K — $500K+*
Netflix
Remote
Reposted Yesterday
Senior Site Reliability Engineer, Reliability Team - USDS
$187K — $359K *
TikTok
San Jose, CA 95123 (Santa Clara County)
Reposted 1 week ago
Senior/Staff Site Reliability Engineer
$325K — $485K *
Ivo
San Francisco, CA 94112 (San Francisco County)
1 week ago
Infrastructure, Speech
$180K — $450K *
Hark
San Jose, CA 95123 (Santa Clara County)
3 weeks ago
Staff System Modeling Engineer, Warfighter Systems
$240K — $318K *
Anduril Industries
Mountain View, CA 94040 (Santa Clara County)
3 weeks ago

Get Ready For Your
Next Interview

More Jobs at Periodic Labs

Research Engineer - Data
$350K — $400K *
Menlo Park, CA 94025 (San Mateo County)
3 weeks ago
Information Technology
In-Person
HPC Engineer
$350K — $450K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
ML Systems Engineer
$300K — $400K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
HR Business Partner
$200K — $300K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person
Technical Sourcer - physical sciences
$200K — $250K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Senior Data Engineer
$120K — $150K *
ECS
Remote
Today
Engineer I- Software
$70K — $95K *
Microchip Technology
Chandler, AZ 85225 (Maricopa County)
Today
Software Engineer lll - Payments Modernization
$102K — $179K *
Bank of America Corporation
Charlotte, NC 28269 (Mecklenburg County)
Reposted Today

Find similar ML Systems Engineer jobs:

Nationwide Menlo Park, CA

ML Systems Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar ML Systems Engineer jobs:

Get Ready For Your
Next Interview