Senior / Staff Site Reliability Engineer, Compute

Fluidstack

$150K — $200K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in compute-heavy Site Reliability Engineering (SRE), kernel or virtualization engineering.
  • Mastery of Linux internals including scheduler, memory, and drivers.
  • Production experience with KVM, Xen, QEMU, VMware, or similar hypervisors.
  • Proficient in C, Go, or Rust, with strong Infrastructure as Code (IaC) and CI/CD skills.
  • Familiarity with SmartNICs/DPUs and kernel-bypass networking.
  • Demonstrated ability to scale high-throughput compute or HPC platforms.

Responsibilities

  • Super-charge virtualization by tuning hypervisors, kernel subsystems, and NUMA layouts.
  • Deploy and optimize new CPU/GPU/DPU nodes and validate SmartNIC off-loads.
  • Automate observability with performance telemetry and incident response bots.
  • Lead root-cause analyses of crashes and performance regressions, providing insights to inform configurations.
  • Collaborate closely with silicon and Linux teams to debug drivers and improve I/O paths.
  • Continuously improve system performance through chaos engineering and ensuring actionable SLIs/SLOs.

Benefits

  • Competitive total compensation package including cash and equity.
  • Retirement or pension plan aligned with local standards.
  • Comprehensive health, dental, and vision insurance.
  • Generous PTO policy in accordance with local norms.
Full Job Description
About Fluidstack

Fluidstack is building GPU supercomputers for top AI labs, governments, and enterprises. Our customers include Mistral, Poolside, Black Forest Labs, Meta, and more.

Our team is small, highly motivated, and focused on providing a world class supercomputing experience. We put out customers first in everything we do, working hard to not just win the sale, but to win repeated business and customer referrals.

We hold ourselves and each other to high standards. We expect you to care deeply about the work you do, the products you build, and the experience our customers have in every interaction with us.

You must work hard, take ownership from inception to delivery, and approach every problem with an open mind and a positive attitude. We value effectiveness, competence, and a growth mindset.

About the Role

Our Senior / Staff Site Reliability Engineers (Storage) are the backbone of Fluidstack's platform. You'll utilise deep systems expertise and software engineering to keep our bare-metal and virtualised compute fleet fast, reliable and cost-efficient at petabyte scale.

Focus
  • Super-charge virtualisation. Tune hypervisors (KVM/QEMU), kernel subsystems and NUMA layouts to squeeze micro-seconds off tail-latency for AI & HPC jobs.
  • Deploy & optimise at scale. Roll out new CPU/GPU/DPU nodes, validate SmartNIC and BlueField off-loads and harden workload isolation.
  • Automate observability. Build kernel-to-orchestrator telemetry, incident-response bots and performance dashboards.
  • Root-cause the gnarly stuff. Lead crash-dumps, kexec/kdump analyses and performance regressions; turn insights into upstream patches and config templates.
  • Drive kernel & hardware collaboration. Pair with silicon and Linux teams to debug drivers, accelerate I/O paths and integrate emerging compute hardware (TPUs, DPUs).
  • Continuously improve. Inject chaos, run game-days and codify post-mortem learnings into SLIs/SLOs that matter to customers.


About you
  • 5+ yrs in compute-heavy SRE, kernel or virtualisation engineering.
  • Mastery of Linux internals (scheduler, memory, drivers) and system-level debugging.
  • Production experience with KVM, Xen, QEMU, VMware or similar hypervisors.
  • Fluency in C, Go or Rust; solid Infra-as-Code & CI/CD chops.
  • Familiarity with SmartNICs/DPUs and kernel-bypass networking.
  • Proven track record scaling high-throughput compute or HPC platforms.


Benefits
  • Competitive total compensation package (cash + equity).
  • Retirement or pension plan, in line with local norms.
  • Health, dental, and vision insurance.
  • Generous PTO policy, in line with local norms.

Similar Jobs

More Jobs at Fluidstack

More Information Technology Jobs

Find similar Senior / Staff Site Reliability Engineer, Compute jobs: