Applied Research Engineer - Training Infra

Snorkel AI • $150K — $180K *

US-AnywhereRemote in United States

Technical Services

Less than 5 years of experience

2 months ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in managing GPU clusters on cloud platforms like AWS
Proficiency with orchestration tools such as Kubernetes or Slurm
Strong knowledge of distributed training concepts and optimization techniques
Experience with ML experiment tracking and versioning tools
Solid Python programming skills and software engineering principles
Ability to thrive in fast-paced, ambiguous environments
Familiarity with post-training workflows like supervised fine-tuning is a plus

Responsibilities

Manage GPU cluster infrastructure for efficient distributed model training
Build and maintain job orchestration systems for training and evaluation
Integrate and stabilize ML training frameworks at scale
Establish and oversee experiment tracking and model artifact management
Monitor cluster health and optimize resource utilization
Collaborate with research scientists to unblock infrastructure challenges

Benefits

Flexible remote work options
Opportunities for professional growth and skill development
Supportive environment for career advancement
Participation in decision-making processes
Comprehensive diversity and inclusion initiatives

Full Job Description

THE ROLE

As an Applied Research Engineer at Snorkel AI, you will own the infrastructure that powers our model training and evaluation work. This is a hands-on role where you will build and operate GPU cluster infrastructure, training pipelines, and the tooling that allows our research and engineering teams to run experiments reliably and at scale. You will work closely with research scientists and engineers, translating training requirements into robust, reproducible systems-and proactively removing infrastructure blockers before they slow down the work that matters most.

Snorkel AI operates in a fast-paced, high-impact environment. We are looking for someone who takes pride in operational excellence, loves solving complex distributed systems problems, and thrives when given real ownership.

Location: Redwood City or San Francisco - OR REMOTE

MAIN RESPONSIBILITIES

Set up and manage GPU cluster infrastructure on major cloud providers (e.g., AWS HyperPod) for distributed model training, including networking, provisioning, and cost tracking.
Build and operate job orchestration and scheduling systems (e.g., Kubernetes, Slurm, or cloud-native equivalents) to reliably launch and manage training, rollout, and evaluation jobs across multi-node clusters.
Integrate and maintain ML training frameworks and post-training pipelines, ensuring they run stably and reproducibly at scale.
Set up and maintain experiment tracking, dataset versioning, and model artifact management to support fast iteration.
Monitor and optimize cluster health, inter-node communication, and resource utilization; implement fault tolerance and auto-recovery so long-running jobs survive node failures.
Work closely with research scientists and ML engineers to understand requirements, unblock experiments, and evolve infrastructure as our training workloads needs change.

PREFERRED QUALIFICATIONS

Hands-on experience managing GPU clusters on major cloud providers, including provisioning, network configuration, and cost management.
Experience with distributed compute orchestration tools such as Kubernetes, Slurm, or equivalent cluster management systems.
Working knowledge of distributed training concepts: parallelism strategies, memory optimization techniques, and inter-node communication.
Experience with setting up, managing, and integrating ML experiment tracking and data/model versioning tools..
Strong Python proficiency and solid software engineering fundamentals such as version control, modular design, and automation.
Ability to work in a fast-moving, iterative environment and take end-to-end ownership of ambiguous infrastructure problems.
Hands-on experience with post-training workflows such as supervised fine-tuning (SFT) or reinforcement learning (RLHF, GRPO, or similar) is a strong plus, but not required.

The salary range is $150,000.00 - $180,000.00.

This role is a great fit for engineers who love building reliable systems close to the frontier of AI research. We welcome applicants from a wide range of backgrounds-whether your experience comes from industry, research labs, or direct hands-on work with distributed infrastructure at scale.

About Snorkel AI

Snorkel AI is an artificial intelligence company that provides a platform for building and managing machine learning models. The company was founded in 2019 and is headquartered in San Francisco, California. Snorkel AI's platform is designed to make it easier for developers and data scientists to create and manage machine learning models, using a technique called programmatic labeling. The company's platform is used by a number of large enterprises, including Intel, Google, and Microsoft. Snorkel AI has raised over $50 million in funding to date.

Learn more about Snorkel AI

Size

50 employees

Industry

Information Technology

Founded

2019

* Ladders Estimates

Similar Jobs

DevOps Software Engineer
$104K — $166K *
Joint Activities
Remote
Today
Embedded Software Infrastructure Engineer
$120K — $160K *
Apple
Cupertino, CA 95014 (Santa Clara County)
Today
Software Engineer III, Infrastructure, Ads Safety
$147K — $211K *
Google
Mountain View, CA 94040 (Santa Clara County)
Today
Software Engineer II, Cloud Infrastructure - Slack
$120K — $150K *
Salesforce
Remote
Reposted Today
Senior Software Engineer - Infrastructure
$120K — $160K *
Confluent
Remote
Today
Senior Platform Engineer
$120K — $160K *
Lambda
Bellevue, WA 98006 (King County)
Today

Get Ready For Your
Next Interview

More Jobs at Snorkel AI

Research Scientist - RL Training
$200K — $325K *
Remote
4 days ago
Information Technology
Remote in United States
Research Scientist - RL Training
$200K — $325K *
Redwood City, CA 94061 (San Mateo County)
4 days ago
Information Technology
Hybrid
Research Scientist - RL Training
$200K — $325K *
San Francisco, CA 94112 (San Francisco County)
4 days ago
Consumer Technology
Hybrid
Staff HR Business Partner
$192K — $240K *
San Francisco, CA 94112 (San Francisco County)
5 days ago
Business Services
Hybrid
Staff HR Business Partner
$192K — $240K *
Redwood City, CA 94061 (San Mateo County)
5 days ago
Enterprise Technology
Hybrid

More Technical Services Jobs

Fire Alarm Inspector
$62K — $114K *
Encore Fire Protection
Needham, MA 02492 (Norfolk County)
Today
Technical Sales Manager
$70K — $95K *
Hayward Holdings, Inc.
Allentown, PA 18102 (Lehigh County)
Today
Solution Engineer
$80K — $120K *
Hyland Software
Remote
Today
Comfort Advisor
$120K — $200K *
Fire & Ice
Columbus, OH 43230 (Franklin County)
Today
Senior Electrical Commissioning Consultant
$115K — $180K *
Kimley-Horn and Associates, Inc.
Warrenville, IL 60555 (Dupage County)
Today

Find similar Applied Research Engineer - Training Infra jobs:

Nationwide Remote

Applied Research Engineer - Training Infra

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Applied Research Engineer - Training Infra jobs:

Get Ready For Your
Next Interview