KLA Tencor

Staff HPC Engineer

KLA Tencor$162K — $284K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • Extensive experience with Linux systems engineering in large-scale compute environments.
  • Solid understanding of distributed systems and cloud infrastructure.
  • Deep knowledge of HPC schedulers (Slurm preferred), MPI stacks, and parallel computing models.
  • Strong understanding of high-speed interconnects (InfiniBand, RoCE) and distributed storage systems.
  • Proficiency in scripting languages (Python, Go, Bash) and automation frameworks.
  • Experience with GPUs (NVIDIA CUDA, MIG, NVLink) and accelerator-based computing.
  • Familiarity with containerization (Singularity/Apptainer, Docker) in HPC contexts.

Responsibilities

  • Design and implement HPC clusters, including compute, storage, networking, and job-scheduling components.
  • Evaluate and integrate new technologies (GPUs, accelerators, interconnects, filesystems).
  • Develop automation for cluster provisioning, configuration, and lifecycle management.
  • Architect solutions for large-scale parallel workloads, AI/ML pipelines, and data-intensive applications.
  • Profile and tune applications for CPU, GPU, memory, and I/O performance.
  • Maintain and monitor HPC clusters, job schedulers (Slurm, PBS, LSF), and distributed filesystems (Lustre, GPFS, BeeGFS).
  • Build and maintain CI/CD pipelines for HPC-related software and infrastructure.

Benefits

  • Medical, dental, vision, and life insurance coverage.
  • 401(K) with company matching.
  • Employee stock purchase program (ESPP).
  • Tuition reimbursement and student debt assistance.
  • Development and career growth opportunities.
  • Employee assistance program (EAP) for wellness.
  • Paid time off and paid company holidays.
Full Job Description
Job Description/Preferred Qualifications

The Staff HPC Engineer designs, builds, optimizes, and supports large scale compute environments used for scientific computing, AI/ML workloads, simulation, and data intensive research. This role blends systems engineering, performance tuning, cluster architecture, and hands on troubleshooting. The engineer partners with researchers, developers, and IT teams to deliver reliable, scalable, and high performance compute infrastructure.

Key Responsibilities:
  • HPC Architecture & Engineering
  • Design and implement HPC clusters, including compute, storage, networking, and job-scheduling components.
  • Evaluate and integrate new technologies (GPUs, accelerators, interconnects, filesystems).
  • Develop automation for cluster provisioning, configuration, and lifecycle management.
  • Architect solutions for large-scale parallel workloads, AI/ML pipelines, and data-intensive applications.


Performance Optimization:
  • Profile and tune applications for CPU, GPU, memory, and I/O performance.
  • Optimize MPI, OpenMP, CUDA, and other parallel programming frameworks.
  • Benchmark hardware and software stacks to guide procurement and architecture decisions.


Operations & Reliability:
  • Maintain and monitor HPC clusters, job schedulers (Slurm, PBS, LSF), and distributed filesystems (Lustre, GPFS, BeeGFS).
  • Troubleshoot complex system issues across compute, storage, and network layers.
  • Implement security best practices, patching, and compliance controls.
  • Ensure high availability and efficient resource utilization.


Automation & DevOps:
  • Build and maintain CI/CD pipelines for HPC-related software and infrastructure.
  • Use tools such as Ansible, Terraform, Kubernetes, or custom scripts to automate workflows.
  • Develop monitoring and observability solutions (Prometheus, Grafana, ELK, etc.).


Collaboration & Leadership:
  • Work closely with researchers, data scientists, and engineering teams to support workload optimization.
  • Provide technical leadership, mentorship, and guidance to junior engineers.
  • Document architectures, procedures, and best practices.
  • Participate in capacity planning and long-term HPC strategy.


Required Qualifications:
  • Extensive experience with Linux systems engineering in large-scale compute environments.
  • Solid understanding of distributed systems and cloud infrastructure
  • Deep knowledge of HPC schedulers (Slurm preferred), MPI stacks, and parallel computing models.
  • Strong understanding of high-speed interconnects (InfiniBand, RoCE) and distributed storage systems.
  • Proficiency in scripting languages (Python, Go, Bash) and automation frameworks.
  • Experience with GPUs (NVIDIA CUDA, MIG, NVLink) and accelerator-based computing.
  • Familiarity with containerization (Singularity/Apptainer, Docker) in HPC contexts.
  • Strong troubleshooting skills across hardware, OS, and application layers.
  • Understanding of networking fundamentals (TCP/IP, DNS, load balancing)
  • Background in high-availability and distributed systems at scale


Soft Skills:
  • Excellent communication and cross-functional collaboration.
  • Ability to translate research needs into technical solutions.
  • Strong ownership mindset and ability to lead complex initiatives.


Minimum Qualifications

Doctorate (Academic) Degree and related work experience of 8 years; Master's Level Degree and related work experience of 12 years; Bachelor's Level Degree and related work experience of 15 years

Base Pay Range: $162,700.00 - $284,700.00 Annually

Primary Location: USA-CA-Milpitas-KLA

KLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits including but not limited to: medical, dental, vision, life, and other voluntary benefits, 401(K) including company matching, employee stock purchase program (ESPP), student debt assistance, tuition reimbursement program, development and career growth opportunities and programs, financial planning benefits, wellness benefits including an employee assistance program (EAP), paid time off and paid company holidays, and family care and bonding leave.

Interns are eligible for some of the benefits listed. Our pay ranges are determined by role, level, and location. The range displayed reflects the pay for this position in the primary location identified in this posting. Actual pay depends on several factors, including state minimum pay wage rates, location, job-related skills, experience, and relevant education level or training. We are committed to complying with all applicable federal and state minimum wage requirements where applicable. If applicable, your recruiter can share more about the specific pay range for your preferred location during the hiring process.

About KLA Tencor

KLA Corporation is a global capital equipment company that provides process control solutions for semiconductor and related industries. The Company's products are also used in a number of other high technology industries, including the packaging, light emitting diode (LED), power device and compound semiconductor markets. Its products and services are used by bare wafer, integrated circuit (IC), lithography reticle (reticle or mask) and disk manufacturers around the world. The Company's inspection and metrology products and related offerings are categorized in various groups, including Chip Manufacturing, Wafer Manufacturing, Reticle Manufacturing, LED, Power Device and Compound Semiconductor Manufacturing, Data Storage Media/Head Manufacturing, Microelectromechanical Systems (MEMS) Manufacturing, and General Purpose/Lab Applications.
Learn more about KLA Tencor
Size
11,300 employees
Market Cap
$52 billion
Industry
Net Income
$1.3 billion
Founded
1997
5 Year Trend
+21.5%
Revenue
$6 billion
NASDAQ

Similar Jobs

More Jobs at KLA Tencor

More Information Technology Jobs

Find similar Staff HPC Engineer jobs: