ML Infrastructure Engineer

Sygaldry

$130K — $180K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in a systems-oriented role, focusing on research computing or developer tooling.
  • Proficiency in managing multi-cloud environments (AWS, GCP) and GPU compute resources.
  • Strong background in Python and familiarity with ML frameworks such as PyTorch or JAX.
  • Experience with CI/CD pipeline development and management (e.g., GitHub Actions, containers).
  • Track record of supporting research teams with diverse computational needs.

Responsibilities

  • Build compute abstractions to support GPU and CPU workloads across various frameworks.
  • Establish experiment tracking and reproducibility systems for research efforts.
  • Develop tools that simplify cloud compute interactions for researchers.
  • Scale experimentation from small-scale to large-scale production environments.
  • Design orchestration for multi-cloud job routing based on resource availability and cost.
  • Manage cloud expenditure and optimize resource usage effectively.
  • Create and maintain CI/CD pipelines for research workflows including data processing.

Benefits

  • Visa sponsorship available to attract top talent.
  • Competitive salary with equity options to ensure investment in employee success.
  • Comprehensive health coverage for employees and dependents to promote well-being.
  • Social and team-building activities to foster connections among employees.
  • Unlimited PTO policy encouraging work-life balance and time for personal recharge.
Full Job Description
About the Role

Our AI & Algorithms team is growing fast - research scientists, applied mathematicians, and quantum algorithm researchers developing the algorithms that will accelerate and transform AI. They need compute infrastructure that stays out of their way: GPU access that's reliable, experiments that are reproducible, and workloads that scale without requiring each researcher to become a cloud expert. You'll build and manage the compute platform this team runs on. The workloads are diverse -- quantum circuit simulation, large-scale numerical optimization, model training, tensor network contractions, and high-throughput data generation -- across multiple cloud providers and on-prem GPU servers. You own the full stack from cloud provider configuration to the Python APIs that researchers use to launch jobs.

What You'll Work On

Research Computing & Developer Experience
  • Build compute abstractions that handle the team's diverse workloads: GPU-accelerated simulation, distributed training, high-throughput CPU jobs, and interactive analysis -- across PyTorch, JAX, and scientific computing frameworks
  • Stand up experiment tracking and reproducibility infrastructure
  • Create developer tooling that makes cloud compute feel local: environment setup, job submission, monitoring, and artifact management
  • Scale experiments from single-GPU prototyping to multi-node production runs

Multi-Cloud GPU Orchestration
  • Design multi-provider workload orchestration: route jobs based on cost, availability, and capability
  • Manage and optimize spend across cloud providers -- track credit balances, burn rates, and expiration dates
  • Configure hybrid local + cloud workflows as on-prem GPU infrastructure comes online
  • Coordinate with our infrastructure engineer on cloud administration and security

Pipeline Infrastructure
  • Build CI/CD pipelines for research workloads: automated testing, evaluation benchmarks, artifact management
  • Create data generation and preprocessing pipelines at the throughput the team's simulators demand
  • Set up monitoring, alerting, and cost dashboards that surface problems before researchers hit them

You May Be a Good Fit If You
  • Think in systems: you see how compute, storage, networking, and cost interact
  • Care about developer experience: you've felt the pain of bad research infrastructure
  • Are pragmatic about tooling: right tool for the job, no over-engineering
  • Take ownership: you want to own a critical function with autonomy
  • Write things down: you document decisions and create runbooks

Strong Candidates May Have
  • Deep AWS experience (EC2, S3, IAM, CloudFormation or Terraform)
  • GPU compute management (instance types, spot strategies, multi-GPU, distributed training)
  • Python-based ML and scientific computing tooling (PyTorch, JAX)
  • GCP and/or Modal experience
  • MLops or research computing platforms (MLflow, W&B, Kubeflow, or HPC job schedulers)
  • CI/CD pipeline management (GitHub Actions, containers)
  • Hybrid cloud / on-prem GPU cluster management
  • Experience supporting research teams with heterogeneous computing needs

Culture & Benefits
  • Visa Sponsorship - We know what it takes to make top talent thrive here. We're open to supporting visas whenever possible.
  • Compensation - We value your contribution and invest in your future with a competitive salary and meaningful equity.
  • Benefits - Your well-being matters. We provide company-sponsored health coverage to give you and your family peace of mind.
  • Connection - Whether it's company offsite or casual crew socials, we make time to connect, recharge, and have fun together.
  • Time Off - We trust you to take the time you need. Unlimited PTO so you can rest, recharge, and come back ready to make an impact.

We encourage you to apply even if you do not believe you meet every single qualification. If you don't think this role is right for you, but you believe that you would have something meaningful to contribute to our mission, please reach out at [redacted]

Similar Jobs

More Jobs at Sygaldry

More Information Technology Jobs

Find similar ML Infrastructure Engineer jobs: