Staff SRE, AI Infrastructure

Andromeda Cluster, Inc

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years operating large-scale GPU infrastructure
Demonstrated track record as a senior SRE in critical environments
Deep knowledge of NVIDIA hardware and GPU systems
Experience with high-performance networking technologies
Familiarity with distributed training processes and tools
Proficient in Go, Python, or Rust for production engineering
Expertise in Linux and systems internals

Responsibilities

Lead highest-priority incident responses on customer training runs
Manage daily operations and health of GPU fleets
Develop observability and health systems for infrastructure
Define and implement on-call practices and protocols
Serve as the senior reliability voice in technical discussions with customers
Collaborate closely with engineering on feature and reliability improvements
Influence hardware design and build-out processes

Benefits

Remote work flexibility within North America
Opportunity to work at the forefront of AI infrastructure
Significant autonomy and ownership over decisions
Direct impact on customer AI training success
Mentorship opportunities to guide fellow engineers
Engagement in a culture of continuous learning and improvement

Full Job Description

Staff SRE, AI Infrastructure

Location: North America Remote / San Francisco • Full-Time

The Role

We're hiring a Staff SRE to own the reliability of Andromeda's infrastructure end to end - from a node being racked and joined to a cluster, through the schedulers and control planes that place jobs on it, up to the customer-facing surface where a training run either succeeds or doesn't.

We're looking for someone with multiple years of hands-on experience operating GPU infrastructure at scale. You read NVIDIA release notes the day they drop. You have war stories about NCCL, fabric topology choices, and what it takes to keep a multi-thousand-GPU run healthy. You move comfortably from a kernel-level perf trace to a customer incident bridge in the same hour, and you write the postmortem yourself.

What You'll Own

Highest-Priority Incident Leadership: Carry the pager. When a top-customer training run degrades or a multi-cluster incident hits, you're the engineer who walks the stack from PyTorch 12 NCCL 12 driver 12 fabric 12 hardware until the answer is found. You lead the response, write the postmortem, and ship the systemic fix.
Production Operations of GPU Fleets: Own the day-to-day health of thousands of GPUs across providers and generations. Node lifecycle, burn-in, validation, draining, repair workflows, firmware rollouts, driver upgrades - the unglamorous work that decides whether the platform actually holds up.
Observability & Health Systems: Build and own the telemetry, GPU health checks, fabric monitoring, and automated remediation that let us catch a degraded NVLink or a flaky transceiver before a customer does. Tooling lives on your laptop; you ship it.
On-Call Practice: Define how on-call works at Andromeda - rotations, escalation, runbooks, incident command, blameless review. As the team grows, you set the bar.
Customer-Facing Technical Presence: Be the senior reliability voice in the room with sophisticated AI infra customers and providers. Run incident reviews with a customer's principal engineer. Scope demanding workloads. Sit in on architecture deep-dives and deal cycles where reliability credibility closes the room.
Partnership with Engineering: Work shoulder-to-shoulder with the product team. You design with SLOs, error budgets, and failure modes in mind; they ship features; together you close the loop on every systemic issue. Translate customer pain into actionable priorities for product teams.
Hardware & Buildout Influence: Partner with providers and DC teams on physical design - rack and pod layout, power and cooling envelopes, network topology, burn-in and validation - to keep failure modes out of production before they arrive.
Mentorship as a Daily Practice: Spend real time every day making other engineers better. Incident reviews, pairing on diagnosis, written guidance, hiring.

What We're Looking For

Years in This Space, Not Months: Multiple years building and operating large-scale GPU infrastructure as your primary job. You came up through this industry.
Staff-Level SRE Track Record: A clear history of owning the reliability of load-bearing infrastructure. You've been the senior engineer a team relies on when production is on fire and the failure mode is in a layer no one's touched yet.
GPU Systems Obsession: Deep, hands-on with NVIDIA H100/H200/B200/GB200 (or equivalent) at scale. You understand memory hierarchies, ECC and SBE/DBE behavior, thermal envelopes, NVLink and NVSwitch topology, and hardware failure modes from direct production experience. You also have opinions about what's coming next and why.
High-Performance Networking, in Production: Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training. You can diagnose a slow all-reduce, find a degraded link in a fat-tree, reason about congestion control, and design topology for the workloads it'll actually carry.
Distributed Training Internals: Working knowledge of how large training jobs actually run - NCCL, CUDA, PyTorch distributed, FSDP, DeepSpeed, Megatron, and modern checkpointing/recovery patterns. When a 1,000+ GPU job stalls, you know where to look first.
Production-Grade Engineering: Strong Go, Python, or Rust. You build production tooling, controllers, and automation - not throwaway scripts. Comfortable in Kubernetes-with-GPUs (device plugins, topology-aware scheduling, multi-cluster) and/or Slurm/HPC schedulers. Terraform/Helm/Ansible is table stakes.
Linux & Systems Internals: Expert-level: kernel tuning, NVIDIA driver and CUDA toolkit lifecycle, cgroups/namespaces, perf and BPF, firmware management.
On-Call Composure: Comfortable being the senior engineer on a P0 bridge with the customer on the line and the provider listening. You triage calmly, decide fast, and document afterward.
Customer Presence: Comfortable being the senior technical voice in a room with sophisticated AI infra customers, providers, and prospects. You can run an incident review with a customer's principal engineer, then walk into a deal review and frame the same content for a CTO buying compute.

Strong Candidates May Have

Built or significantly contributed to a custom GPU health system, fleet manager, fabric controller, or on-call/incident tooling in production.
Distributed storage depth (VAST, Weka, Lustre, GPFS) and a clear opinion on checkpoint I/O patterns at scale.
Profiling and diagnosis of distributed training - MFU work, straggler mitigation, collective tuning across multi-thousand-GPU runs.
Experience as the senior SRE partner in enterprise relationships for AI infrastructure or HPC.
Open-source contributions in the GPU/AI infra stack (NCCL, Kubernetes scheduler plugins, GPU operators, DCGM tooling, etc.).
Public talks, writing, or community presence in the GPU/AI infra industry.

Why You'll Love It Here

This is the role where one engineer's reliability decisions show up in every customer's training run. You'll have significant autonomy and the leverage of working on infrastructure that the most ambitious AI labs in the world depend on - staying as hands-on as you want in the code, in the room with customers, and on the bridge when it matters.

* Ladders Estimates

Similar Jobs

Principal Engineer - Marketing Technologies
$159K — $305K *
Wells Fargo
Concord, CA 94521 (Contra Costa County)
Reposted Today
Principal Engineer - Marketing Technologies
$159K — $305K *
Wells Fargo
Concord, CA 94521 (Contra Costa County)
Reposted Today
Principal Engineer - Marketing Technologies
$159K — $305K *
Wells Fargo
Concord, CA 94521 (Contra Costa County)
Reposted Today
Camera Architect
$130K — $180K *
Apple
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Sr. Project Engineer, SDS
$100K — $130K *
Fujifilm Manufacturing USA, Inc
Remote
Reposted Today
Cloud Network Engineer
$154K — $232K *
AIS
Remote
Reposted Today

Get Ready For Your
Next Interview

More Jobs at Andromeda Cluster, Inc

Commercial Counsel
$210K — $235K *
Remote
1 week ago
Legal & Accounting
Remote in United States
Staff SRE, AI Infrastructure
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
1 month ago
Information Technology
In-Person
Staff SRE, AI Infrastructure
$130K — $180K *
Remote
1 month ago
Information Technology
Remote in San Francisco, CA

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
1 week ago
Lead Software Engineer
$197K — $225K *
Capital One Financial Corporation
Mclean, VA 22101 (Fairfax County)
Today
Lead Software Engineer, Full Stack (TypeScript, Vue.js, Node.js, AWS)
$179K — $204K *
Capital One Financial Corporation
Richmond, VA 23223 (Richmond City County)
Today
Manager, Information Security Office Consultant
$197K — $225K *
Capital One Financial Corporation
Mclean, VA 22101 (Fairfax County)
Today
Risk Manager, Endpoint Security
$179K — $204K *
Capital One Financial Corporation
Plano, TX 75025 (Collin County)
Today

Find similar Staff SRE, AI Infrastructure jobs:

Nationwide San Francisco, CA

Staff SRE, AI Infrastructure

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff SRE, AI Infrastructure jobs:

Get Ready For Your
Next Interview