ML Infrastructure Engineer

Sygaldry

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in a systems-oriented role, focusing on research computing or developer tooling.
Proficiency in managing multi-cloud environments (AWS, GCP) and GPU compute resources.
Strong background in Python and familiarity with ML frameworks such as PyTorch or JAX.
Experience with CI/CD pipeline development and management (e.g., GitHub Actions, containers).
Track record of supporting research teams with diverse computational needs.

Responsibilities

Build compute abstractions to support GPU and CPU workloads across various frameworks.
Establish experiment tracking and reproducibility systems for research efforts.
Develop tools that simplify cloud compute interactions for researchers.
Scale experimentation from small-scale to large-scale production environments.
Design orchestration for multi-cloud job routing based on resource availability and cost.
Manage cloud expenditure and optimize resource usage effectively.
Create and maintain CI/CD pipelines for research workflows including data processing.

Benefits

Visa sponsorship available to attract top talent.
Competitive salary with equity options to ensure investment in employee success.
Comprehensive health coverage for employees and dependents to promote well-being.
Social and team-building activities to foster connections among employees.
Unlimited PTO policy encouraging work-life balance and time for personal recharge.

Full Job Description

About the Role

Our AI & Algorithms team is growing fast - research scientists, applied mathematicians, and quantum algorithm researchers developing the algorithms that will accelerate and transform AI. They need compute infrastructure that stays out of their way: GPU access that's reliable, experiments that are reproducible, and workloads that scale without requiring each researcher to become a cloud expert. You'll build and manage the compute platform this team runs on. The workloads are diverse -- quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation -- across multiple cloud providers and on-prem GPU servers. You own the full stack from cloud provider configuration to the Python APIs that researchers use to launch jobs.

What You'll Work On

Research Computing & Developer Experience

Build compute abstractions that handle the team's diverse workloads: GPU-accelerated simulation, distributed training, high-throughput CPU jobs, and interactive analysis -- across PyTorch, JAX, and scientific computing frameworks
Stand up experiment tracking and reproducibility infrastructure
Create developer tooling that makes cloud compute feel local: environment setup, job submission, monitoring, and artifact management
Scale experiments from single-GPU prototyping to multi-node production runs

Multi-Cloud GPU Orchestration

Design multi-provider workload orchestration: route jobs based on cost, availability, and capability
Manage and optimize spend across cloud providers -- track credit balances, burn rates, and expiration dates
Configure hybrid local + cloud workflows as on-prem GPU infrastructure comes online
Coordinate with our infrastructure engineer on cloud administration and security

Pipeline Infrastructure

Build CI/CD pipelines for research workloads: automated testing, evaluation benchmarks, artifact management
Create data generation and preprocessing pipelines at the throughput the team's simulators demand
Set up monitoring, alerting, and cost dashboards that surface problems before researchers hit them

You May Be a Good Fit If You

Think in systems: you see how compute, storage, networking, and cost interact
Care about developer experience: you've felt the pain of bad research infrastructure
Are pragmatic about tooling: right tool for the job, no over-engineering
Take ownership: you want to own a critical function with autonomy
Write things down: you document decisions and create runbooks

Strong Candidates May Have

Deep AWS experience (EC2, S3, IAM, CloudFormation or Terraform)
GPU compute management (instance types, spot strategies, multi-GPU, distributed training)
Python-based ML and scientific computing tooling (PyTorch, JAX)
GCP and/or Modal experience
MLops or research computing platforms (MLflow, W&B, Kubeflow, or HPC job schedulers)
CI/CD pipeline management (GitHub Actions, containers)
Hybrid cloud / on-prem GPU cluster management
Experience supporting research teams with heterogeneous computing needs

Culture & Benefits

Visa Sponsorship - We know what it takes to make top talent thrive here. We're open to supporting visas whenever possible.
Compensation - We value your contribution and invest in your future with a competitive salary and meaningful equity.
Benefits - Your well-being matters. We provide company-sponsored health coverage to give you and your family peace of mind.
Connection - Whether it's company offsite or casual crew socials, we make time to connect, recharge, and have fun together.
Time Off - We trust you to take the time you need. Unlimited PTO so you can rest, recharge, and come back ready to make an impact.

We encourage you to apply even if you do not believe you meet every single qualification. If you don't think this role is right for you, but you believe that you would have something meaningful to contribute to our mission, please reach out at [redacted]

* Ladders Estimates

Similar Jobs

Infrastructure Service Delivery Manager
$130K — $160K *
Prophecy Technologies
San Francisco, CA 94112 (San Francisco County)
Today
Senior Infrastructure and DevOps Engineer
$137K — $265K *
Intel
Folsom, CA 95630 (Sacramento County)
Reposted 1 week ago
Senior Infrastructure and DevOps Engineer
$137K — $265K *
Intel
Santa Clara, CA 95051 (Santa Clara County)
Reposted 1 week ago
Software Engineer - Infrastructure
$120K — $160K *
Modern Treasury
Remote
1 week ago
Senior Software Engineer - BaseOS
$152K — $287K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Reposted 1 week ago
Release Engineer
$180K — $200K *
Jito Labs
Remote
3 weeks ago

Get Ready For Your
Next Interview

More Jobs at Sygaldry

Technical Program Manager, AI, GTM
$180K — $250K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person
ML Infrastructure Engineer
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
1 month ago
Information Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
3 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Senior Software Engineer Test
$94K — $179K *
Mercury Insurance
Brea, CA 92821 (Orange County)
Today
Test Engineer II
$76K — $142K *
Mercury Insurance
Brea, CA 92821 (Orange County)
Today

Find similar ML Infrastructure Engineer jobs:

Nationwide San Francisco, CA

ML Infrastructure Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar ML Infrastructure Engineer jobs:

Get Ready For Your
Next Interview