Infrastructure / Cluster Engineer

Gimlet Labs

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in infrastructure, cluster engineering, or distributed systems
Deep understanding of Linux systems including performance debugging and kernel issues
Hands-on expertise with Kubernetes, Slurm, or similar orchestration tools
Strong automation skills using Terraform, Ansible, or Python
Familiarity with GPU/accelerator infrastructure and relevant software stacks
Experience with high-performance networking technologies like InfiniBand or high-speed Ethernet
Ability to navigate and thrive in a fast-paced startup environment

Responsibilities

Design, deploy, and operate large-scale clusters for AI inference
Automate provisioning, configuration, upgrades, and lifecycle management
Scale heterogeneous bare-metal provisioning systems across datacenters
Debug production issues in Linux, networking, storage, and orchestration layers
Build networking infrastructure with a focus on performance and RDMA
Develop observability metrics for cluster health and performance
Enhance reliability and recovery of multi-node production systems
Collaborate with teams to support high-throughput AI workloads

Benefits

Opportunity to work with cutting-edge AI infrastructure technology
Hands-on role in a dynamic startup environment
Collaborative culture with cross-functional teamwork
Professional growth in emerging areas of heterogeneous computing
Focus on operational excellence and scalable infrastructure practices

Full Job Description

About this Role

We are looking for an Infrastructure / Cluster Engineer to design, build, and operate the cluster infrastructure behind Gimlet's heterogeneous inference cloud. Unlike traditional cloud platforms built around a single hardware ecosystem, Gimlet's infrastructure spans multiple accelerator vendors and architectures. Infrastructure engineers play a key role in bringing new hardware platforms online, building the operational abstractions that make heterogeneous infrastructure manageable at scale, and ensuring new silicon can serve production workloads reliably from day one.

This role is highly hands-on. You will work across bare metal, Linux, Kubernetes or cluster schedulers, high-speed networking, observability, provisioning, and incident response. You will partner closely with distributed systems, runtime, compiler, and hardware teams to ensure Gimlet's infrastructure can support demanding AI workloads at production scale.

What you will work on

Design, deploy, and operate large-scale CPU, GPU, and accelerator clusters powering production AI inference.
Build automation for provisioning, configuration, upgrades, validation, and lifecycle management.
Design and scale provisioning systems for heterogeneous bare-metal infrastructure across multiple datacenters and hardware vendors.Operate cluster scheduling, resource allocation, isolation, quotas, and utilization systems.
Debug complex production issues across Linux, networking, storage, drivers, firmware, and orchestration layers.
Build and operate high-performance networking infrastructure, including RDMA-enabled environments and accelerator interconnects.
Build observability for cluster health, capacity, performance, failures, and workload behavior.
Improve reliability, availability, and recovery across multi-node production systems.
Work with distributed systems and runtime teams to support low-latency, high-throughput inference workloads.
Evaluate and integrate new hardware platforms, accelerators, networking technologies, and datacenter designs.
Create runbooks, operational standards, and incident response practices as the fleet scales.

You may be a good fit if

Experience in infrastructure, cluster engineering, platform engineering, SRE, HPC, or distributed systems.
Deep Linux systems experience, including debugging performance, networking, storage, processes, and kernel-level issues.
Experience operating Kubernetes, Slurm, Nomad, or similar orchestration and scheduling systems.
Strong automation skills using tools such as Terraform, Ansible, Helm, Python, Go, or equivalent.
Experience with GPU or accelerator infrastructure, including drivers, firmware, CUDA/ROCm stacks, or hardware validation.
Familiarity with high-performance networking such as InfiniBand, RoCE, high-speed Ethernet, or datacenter fabrics.
Strong operational judgment: you know how to build systems that are observable, recoverable, and boring in production.
Comfort working in a fast-moving startup environment with high ownership and ambiguity.

Strong candidates may also have

Experience building or operating AI inference, training, HPC, or neocloud infrastructure.
Experience with bare-metal provisioning, PXE/iPXE, image pipelines, BIOS/firmware management, or rack bring-up.
Experience with multi-tenant cluster isolation, quota systems, fair scheduling, or usage accounting.
Experience debugging distributed workload performance across compute, memory, network, and storage bottlenecks.
Experience building observability platforms using technologies such as Prometheus, OpenTelemetry, Grafana, or similar tooling.
Familiarity with heterogeneous hardware environments across NVIDIA, AMD, Intel, ARM, or emerging accelerators.

* Ladders Estimates

Similar Jobs

Customer Relibility Engineer - Isovalent
$158K — $200K *
Cisco
San Jose, CA 95123 (Santa Clara County)
Today
Customer Relibility Engineer - Isovalent
$158K — $200K *
Cisco
Remote
Today
Senior Software Engineer, Integrated Vehicle Simulator
$174K — $205K *
Joby Aviation
Santa Cruz, CA 95060 (Santa Cruz County)
Today
Software Engineer, Infrastructure
$133K — $184K *
VXI Global Solutions
Remote
Today
Senior Computer Systems Engineer/Architect
$100K — $130K *
Louisiana Technology Group Inc
Remote
Today
Sr. Infrastructure Engineer - Kubernetes (Remote)
$140K — $215K *
CrowdStrike Holdings, Inc.
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Gimlet Labs

Infrastructure / Cluster Engineer
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Today
Information Technology
In-Person
Senior Technical Project Manager
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
5 days ago
Enterprise Technology
In-Person
Head of Strategic Hardware Partnerships
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Telecommunications & Hardware
In-Person
Head of Security and Compliance
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person
Network Engineer
$120K — $150K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
UX Architect/Lead
$130K — $200K *
HP Development Company, L.P.
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer III
$90K — $180K *
Walmart, Inc.
Bentonville, AR 72712 (Benton County)
Reposted Today
Site Reliability Engineer
$90K — $120K *
Tecsys
Montreal, QC H1A 0A1
Reposted Today
Client Onboarding Manager
$75K — $95K *
Global Data Consultants
Lafayette, LA 70506 (Lafayette County)
Reposted Today

Find similar Infrastructure / Cluster Engineer jobs:

Nationwide San Francisco, CA

Infrastructure / Cluster Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Infrastructure / Cluster Engineer jobs:

Get Ready For Your
Next Interview