KLA Tencor

HPC Systems Engineer

KLA Tencor$105K — $180K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science, Engineering, or equivalent experience
  • 3+ years of hands-on Linux systems administration experience
  • Experience with HPC or large-scale compute environments
  • Practical knowledge of at least one HPC scheduler (SLURM, LSF, PBS, or similar)
  • Strong troubleshooting skills in Linux (processes, memory, I/O, networking)

Responsibilities

  • Operate and maintain a large-scale Linux based HPC cluster for internal R&D
  • Manage compute nodes and supporting infrastructure in a multi-tenant environment
  • Monitor cluster health and respond to incidents
  • Configure and support HPC job schedulers
  • Optimize scheduler policies for throughput and fairness
  • Install and maintain Linux operating systems across compute nodes
  • Support high throughput workloads across CPU and GPU resources

Benefits

  • Medical, dental, and vision insurance
  • 401(K) plan with company matching
  • Employee stock purchase program (ESPP)
  • Tuition reimbursement and student debt assistance
  • Wellness benefits including an employee assistance program (EAP)
Full Job Description
Group/Division
With over 40 years of semiconductor process control experience, chipmakers around the globe rely on KLA to ensure that their fabs ramp next-generation devices to volume production quickly and cost-effectively. Enabling the movement towards advanced chip design, KLA's Global Products Group (GPG), which is responsible for creating all of KLA's metrology and inspection products, is looking for the best and the brightest research scientist, software engineers, application development engineers, and senior product technology process engineers. Central Engineering is KLA's largest engineering organization comprised of 9 Centers-of-Excellence (CoE) in various disciplines applied across all product groups in the company. These CoE include Handling & Automation, Precision Motion Control, Sensors & Image Acquisition, Platform Design, and Packaging Engineering, among others. Talent includes over 500 engineers across global centers in Israel, China, India, and the US. Each CoE contributes not just talent and deliverables per discipline toward product programs, but also subject matter expertise, best practices, roadmaps, specialized facilities, apparatus, models, and analytics. These differentiate KLA not only in WHAT we do, but also in HOW we do it.

Job Description/Preferred Qualifications

We're looking for a HPC Systems Engineer to help power the compute infrastructure behind our R&D innovation! In this role, you'll support and evolve a high-performance Linux cluster used for physics modeling, simulation, algorithm development, and machine-learning workloads-enabling hundreds of engineers to do their best work every day. You'll play a key role in driving the reliability, performance, and scalability of a shared, mission-critical HPC environment, partnering closely with infrastructure, DevOps, and application teams to keep the platform fast, resilient, and ready for the most demanding computational challenges!

Key Responsibilities:
HPC Platform Operations
• Operate and maintain a large-scale Linux based HPC cluster used for internal R&D workloads
• Manage compute nodes, login nodes, and supporting infrastructure in a multi-tenant environment
• Monitor cluster health, performance, and capacity; respond to incidents and degradations
Scheduler & Workload Management
• Configure, tune, and support HPC job schedulers (e.g., SLURM, LSF, PBS, or equivalent)
• Assist users with job submission issues, resource requests, and queue optimization
• Help optimize scheduler policies to balance throughput, fairness, and utilization
Linux Systems Engineering
• Install, configure, and maintain Linux operating systems across compute and service nodes
• Manage OS updates, kernel changes, drivers (including GPU drivers where applicable), and system hardening
• Troubleshoot complex Linux performance, networking, storage, and process level issues
Performance & Scaling
• Support high throughput and parallel workloads across CPU and GPU resources
• Diagnose performance bottlenecks across compute, storage, network, and scheduler layers
• Assist with scaling activities such as node expansions, re provisioning, and hardware refreshes
Automation & Reliability
• Use automation and configuration management tools to ensure consistency across the cluster
• Contribute to scripting and tooling for node provisioning, validation, and lifecycle management
• Participate in on call or escalation rotations as required to support a production R&D platform
Collaboration & User Support
• Partner with internal engineering teams to understand workload requirements and usage patterns
• Provide guidance and best practices for running workloads efficiently on shared HPC systems
• Contribute to internal documentation and operational runbooks

Required Qualifications:
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 3+ years of hands-on Linux systems administration experience
  • Direct experience working with HPC or large-scale compute environments
  • Practical experience with at least one HPC scheduler (SLURM, LSF, PBS, or similar)
  • Strong Linux troubleshooting skills (processes, memory, I/O, networking, performance analysis)
  • Comfort working in CLI-driven, production infrastructure environments

Preferred:
  • Experience supporting GPU-accelerated workloads (CUDA, drivers, GPU scheduling concepts)
  • Familiarity with parallel computing or scientific/engineering workloads
  • Experience with cluster storage systems (e.g., Lustre, BeeGFS, NFS, or high-performance NAS/SAN)
  • Exposure to automation tools (Ansible, scripting, Infrastructure-as-Code concepts)
  • Familiarity with containers in HPC contexts (Singularity / Apptainer, rootless containers)
  • Experience supporting internal developer or research communities


Minimum Qualifications

Doctorate (Academic) Degree and 0 years related work experience; Master's Level Degree and related work experience of 3 years; Bachelor's Level Degree and related work experience of 5 years

Base Pay Range: $105,900.00 - $180,000.00 Annually

Primary Location: USA-MI-Ann Arbor-KLA

KLA's total rewards package for employees may also include participation in performance incentive programs and eligibility for additional benefits including but not limited to: medical, dental, vision, life, and other voluntary benefits, 401(K) including company matching, employee stock purchase program (ESPP), student debt assistance, tuition reimbursement program, development and career growth opportunities and programs, financial planning benefits, wellness benefits including an employee assistance program (EAP), paid time off and paid company holidays, and family care and bonding leave.

Interns are eligible for some of the benefits listed. Our pay ranges are determined by role, level, and location. The range displayed reflects the pay for this position in the primary location identified in this posting. Actual pay depends on several factors, including state minimum pay wage rates, location, job-related skills, experience, and relevant education level or training. We are committed to complying with all applicable federal and state minimum wage requirements where applicable. If applicable, your recruiter can share more about the specific pay range for your preferred location during the hiring process.

About KLA Tencor

KLA Corporation is a global capital equipment company that provides process control solutions for semiconductor and related industries. The Company's products are also used in a number of other high technology industries, including the packaging, light emitting diode (LED), power device and compound semiconductor markets. Its products and services are used by bare wafer, integrated circuit (IC), lithography reticle (reticle or mask) and disk manufacturers around the world. The Company's inspection and metrology products and related offerings are categorized in various groups, including Chip Manufacturing, Wafer Manufacturing, Reticle Manufacturing, LED, Power Device and Compound Semiconductor Manufacturing, Data Storage Media/Head Manufacturing, Microelectromechanical Systems (MEMS) Manufacturing, and General Purpose/Lab Applications.
Learn more about KLA Tencor
Size
11,300 employees
Market Cap
$52 billion
Industry
Net Income
$1.3 billion
Founded
1997
5 Year Trend
+21.5%
Revenue
$6 billion
NASDAQ

Similar Jobs

More Jobs at KLA Tencor

More Information Technology Jobs

Find similar HPC Systems Engineer jobs: