HPC Engineer

Institute of Foundation Models

• $150K — $300K *

Sunnyvale, CA 94087In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Bachelor's degree in relevant technical field (Computer Science, Engineering, etc.)
2+ years of experience in Linux systems administration or related fields
Strong troubleshooting skills in Linux environments
Proficiency in scripting with Python or Bash
Experience with cloud platforms (AWS, Azure, GCP) and GPU infrastructure preferred

Responsibilities

Monitor health and performance of large-scale GPU clusters
Respond to and triage incidents as they arise
Provide research support and troubleshoot job failures
Execute runbooks and recovery procedures effectively
Validate cluster deployments and upgrades
Track infrastructure utilization and operational metrics
Develop automation tools and enhance monitoring processes

Benefits

Comprehensive medical, dental, and vision benefits
Bonus structure
401K retirement plan
Generous paid time off and sick leave
Paid parental leave
Employee assistance program
Life insurance and disability coverage

Full Job Description

Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.

Responsibilities
• Monitor health, performance, and availability of large-scale GPU clusters.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.

Education

Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.

Experience
• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.

Preferred Qualifications
• Slurm.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.

$150,000 - $300,000 a year

Salary Range

The posted salary range represents the company's good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

* Ladders Estimates

Similar Jobs

Staff Technical Support Engineer - White Glove
$113K — $199K *
ServiceNow
Santa Clara, CA 95051 (Santa Clara County)
Today
Customer Engineer
$120K — $150K *
Anyscale
San Francisco, CA 94112 (San Francisco County)
Today
Customer Engineer
$120K — $150K *
Anyscale
Palo Alto, CA 94303 (Santa Clara County)
Today
Technical Product Support (TPS) Engineer III
$110K — $152K *
Applied Materials, Inc
Santa Clara, CA 95051 (Santa Clara County)
Reposted Yesterday
Technical Product Support (TPS) Engineer III
$110K — $152K *
Applied Materials, Inc
Santa Clara, CA 95051 (Santa Clara County)
Reposted Yesterday
Platform Professional Services Principal Consultant (Remote)
$140K — $195K *
CrowdStrike Holdings, Inc.
Remote
2 days ago

Get Ready For Your
Next Interview

More Jobs at Institute of Foundation Models

HPC Engineer
$150K — $300K *
Sunnyvale, CA 94087 (Santa Clara County)
Today
Information Technology
In-Person
Research Scientist - Vision Language Model
$150K — $450K *
Sunnyvale, CA 94087 (Santa Clara County)
3 days ago
Information Technology
In-Person
Finance Operations Specialist
$70K — $110K *
Sunnyvale, CA 94087 (Santa Clara County)
1 week ago
Finance & Insurance
In-Person
Machine Learning Engineer - World Model
$150K — $450K *
Sunnyvale, CA 94087 (Santa Clara County)
3 weeks ago
Enterprise Technology
In-Person
Research Engineer - The Diffusion LLM Team
$120K — $180K *
Sunnyvale, CA 94087 (Santa Clara County)
1 month ago
Information Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Senior Data Engineer
$120K — $150K *
ECS
Remote
Today
Engineer I- Software
$70K — $95K *
Microchip Technology
Chandler, AZ 85225 (Maricopa County)
Today
Software Engineer lll - Payments Modernization
$102K — $179K *
Bank of America Corporation
Charlotte, NC 28269 (Mecklenburg County)
Reposted Today

Find similar HPC Engineer jobs:

Nationwide Sunnyvale, CA

HPC Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar HPC Engineer jobs:

Get Ready For Your
Next Interview