HPC Engineer

Periodic Labs

• $350K — $450K *

Menlo Park, CA 94025In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience with large-scale HPC or GPU clusters
Strong knowledge of high-speed interconnects like InfiniBand and RoCE
Hands-on experience with parallel and distributed storage systems
Proficient with workload managers and schedulers such as Slurm and Kubernetes
Expertise in Linux systems administration including performance tuning
Familiar with infrastructure automation tools like Ansible and Terraform
Experience in GPU computing environments including CUDA and MPI

Responsibilities

Design, deploy, and operate large-scale GPU and CPU clusters for research
Manage and optimize high-speed interconnects and parallel filesystems
Own workload scheduling and resource management for efficiency and productivity
Implement and maintain automated provisioning and configuration management
Monitor cluster performance and resolve bottlenecks proactively
Collaborate with research teams to optimize computational workloads
Establish standards for HPC operations and incident response

Benefits

Offers visa sponsorship for qualified candidates
Flexible working locations with a preference for Menlo Park or San Francisco
Dynamic environment supporting cutting-edge AI and scientific research
Opportunities for collaboration with pioneering researchers and engineers
Access to advanced computing technologies and tools

Full Job Description

About the Role

As HPC Engineer at Periodic Labs, you will design, build, and operate the high-performance computing infrastructure that powers our AI and scientific research. Our models demand extreme compute at scale - large GPU and CPU clusters, high-speed interconnects, low-latency parallel storage, and workload schedulers that make every cycle count. You will work directly with researchers and infrastructure engineers to ensure our compute environment is fast, reliable, and optimized for scientific discovery at the frontier.

This is a deeply hands-on role. You will architect and tune systems, automate provisioning, diagnose performance bottlenecks, and design for resilience at scale. You'll partner with research and ML teams to understand their workloads and shape an HPC environment that removes friction and accelerates science.

What You'll Do

Design, deploy, and operate large-scale GPU and CPU clusters for AI training, scientific simulation, and research workloads
Manage and optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA, or equivalent) for maximum throughput and minimum latency
Own workload scheduling and resource management using Slurm, Kubernetes, or similar systems - tuning for throughput, fairness, and researcher productivity
Implement and maintain automated cluster provisioning, configuration management, and lifecycle tooling using Ansible, Terraform, or custom orchestration
Monitor cluster health, performance, and utilization; build dashboards and alerting to proactively identify and resolve bottlenecks
Partner with research and ML engineering teams to profile workloads, diagnose performance issues, and tune hardware and software stacks for specific computational demands
Design and implement backup, disaster recovery, and fault-tolerance strategies for research data and compute infrastructure
Evaluate and integrate new hardware (GPUs, accelerators, networking) and software technologies as the field evolves
Establish standards and runbooks for HPC operations, capacity planning, and incident response
Collaborate with security and infrastructure teams to implement access controls, network segmentation, and compliance controls appropriate for a research environment

You Will Thrive in This Role If You Have

Experience designing and operating large-scale HPC or GPU clusters in research, cloud, or enterprise environments
Deep knowledge of high-speed interconnects such as InfiniBand (HDR/NDR) or RoCE, including fabric management, tuning, and troubleshooting
Hands-on experience with parallel and distributed storage systems (Lustre, GPFS, WEKA, BeeGFS, or similar) - configuration, performance tuning, and capacity management
Experience with workload managers and schedulers such as Slurm, PBS Pro, LSF, or Kubernetes-based HPC orchestration
Linux systems administration at scale, including kernel tuning, NUMA optimization, CPU and memory affinity, and GPU driver management
Infrastructure automation using Ansible, Terraform, or equivalent - you treat infrastructure as code
Experience with GPU computing environments including CUDA, NCCL, MPI, and multi-node distributed training or simulation setups
Performance profiling, benchmarking, and tuning of computational workloads across CPU, GPU, memory, network, and storage
Experience with monitoring and observability tooling (Prometheus, Grafana, or equivalent) in large, heterogeneous compute environments
Ability to collaborate with researchers or data scientists to understand workload requirements and translate them into infrastructure decisions

Especially Strong Candidates May Also Have

Experience operating GPU clusters for large-scale AI or ML training workloads such as multi-node transformer training
Familiarity with AI accelerators beyond GPUs, such as TPUs, Trainium, or custom ASIC environments
Experience in mixed on-prem and cloud HPC environments, including burst-to-cloud or hybrid scheduling patterns
Background in scientific computing domains such as computational chemistry, physics simulation, or bioinformatics
Experience with containerized HPC environments (Singularity/Apptainer, Docker, or container-aware schedulers)
Knowledge of network security, access control, and compliance requirements for regulated research data
Contributions to open-source HPC tooling or published work on HPC system design or performance

Mechanics

Minimum education: Bachelor's degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual base compensation range for this role is $350,000-$450,000

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

* Ladders Estimates

Similar Jobs

Senior Site Reliability Engineer, CORE (Member Experience / Resilience Operations)
$388K — $500K+*
Netflix
Remote
Reposted Yesterday
Senior Site Reliability Engineer, Reliability Team - USDS
$187K — $359K *
TikTok
San Jose, CA 95123 (Santa Clara County)
Reposted 1 week ago
Senior/Staff Site Reliability Engineer
$325K — $485K *
Ivo
San Francisco, CA 94112 (San Francisco County)
1 week ago
Infrastructure, Speech
$180K — $450K *
Hark
San Jose, CA 95123 (Santa Clara County)
3 weeks ago
ML Systems Engineer
$300K — $400K *
Periodic Labs
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Site Reliability Engineer (SRE)
$350K — $475K *
Thinking Machines Lab
San Francisco, CA 94112 (San Francisco County)
1 month ago

Get Ready For Your
Next Interview

More Jobs at Periodic Labs

Research Engineer - Data
$350K — $400K *
Menlo Park, CA 94025 (San Mateo County)
3 weeks ago
Information Technology
In-Person
HPC Engineer
$350K — $450K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
ML Systems Engineer
$300K — $400K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
HR Business Partner
$200K — $300K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person
Technical Sourcer - physical sciences
$200K — $250K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Customer Support
Confidential Company
Austin, TX 78701 (Travis County)
2 weeks ago
Sr Assoc, Cyber Sec ThreatMgmt - Detection Engineer
$88K — $151K *
Northern Trust
Naperville, IL 60540 (Dupage County)
Today
Global Director – Vulnerability Management & Security Configuration
$164K — $288K *
Northern Trust
Chicago, IL 60629 (Cook County)
Today

Find similar HPC Engineer jobs:

Nationwide Menlo Park, CA

HPC Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar HPC Engineer jobs:

Get Ready For Your
Next Interview