HPC Engineer

Institute of Foundation Models

$150K — $300K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in relevant technical field (Computer Science, Engineering, etc.)
  • 2+ years of experience in Linux systems administration or related fields
  • Strong troubleshooting skills in Linux environments
  • Proficiency in scripting with Python or Bash
  • Experience with cloud platforms (AWS, Azure, GCP) and GPU infrastructure preferred

Responsibilities

  • Monitor health and performance of large-scale GPU clusters
  • Respond to and triage incidents as they arise
  • Provide research support and troubleshoot job failures
  • Execute runbooks and recovery procedures effectively
  • Validate cluster deployments and upgrades
  • Track infrastructure utilization and operational metrics
  • Develop automation tools and enhance monitoring processes

Benefits

  • Comprehensive medical, dental, and vision benefits
  • Bonus structure
  • 401K retirement plan
  • Generous paid time off and sick leave
  • Paid parental leave
  • Employee assistance program
  • Life insurance and disability coverage
Full Job Description
Position Summary
This role provides operational coverage during Abu Dhabi overnight hours and serves as a primary point of contact for infrastructure monitoring, incident triage, researcher support, and production operations.

Responsibilities
• Monitor health, performance, and availability of large-scale GPU clusters.
• Respond to incidents and perform first-level triage.
• Support researchers and troubleshoot job failures.
• Execute operational runbooks and recovery procedures.
• Validate cluster deployments, upgrades, and maintenance activities.
• Track infrastructure utilization and operational metrics.
• Develop automation and monitoring tools.
• Contribute to documentation and reporting.

Education

Bachelor's degree in Computer Science, Computer Engineering, Software Engineering, Information Technology, Electrical Engineering, Mathematics, Physics, or related disciplines.

Experience
• 2+ years in Linux systems administration, SRE, DevOps, cloud operations, HPC, or infrastructure operations.
• Strong Linux troubleshooting skills.
• Experience with scripting using Python or Bash.

Preferred Qualifications
• Slurm.
• GPU infrastructure.
• AWS, Azure, or GCP.
• Grafana, Prometheus, Datadog, or similar tools.
• Containers and Kubernetes.
• AI/ML infrastructure exposure.
• Research computing environments.

$150,000 - $300,000 a year

Salary Range

The posted salary range represents the company's good faith estimate of the compensation for this position upon hire. The actual compensation offered may vary within this range depending on individual qualifications, including but not limited to relevant skills, experience, education, certifications, geographic location, and specific business needs.

Benefits Include

*Comprehensive medical, dental, and vision benefits

*Bonus

*401K Plan

*Generous paid time off, sick leave and holidays

*Paid Parental Leave

*Employee Assistance Program

*Life insurance and disability

Similar Jobs

More Jobs at Institute of Foundation Models

More Information Technology Jobs

Find similar HPC Engineer jobs: