Lead Systems Engineer (HPC)

Princeton University

• $135K — $150K *

Princeton, NJ 08540In-Person

Technical Services

8 - 10 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years of experience managing advanced research computing systems
Strong expertise in Linux system administration, installation, and troubleshooting
Advanced scripting skills in languages like bash, Python, and/or Perl
Proficient in managing networking within HPC environments
Experience with job scheduling and management using SLURM in large-scale computing
Strong oral and written communication skills
Ability to solve complex infrastructure problems and collaborate across teams
Education: Bachelor's degree in a related field or equivalent experience

Responsibilities

Design, maintain, troubleshoot, and refine HPC/AI cluster infrastructure
Collaborate with Advanced Data and Storage Management to align filesystem and data management with cluster designs
Develop networks to support AI-driven computing workloads
Establish best practices for cluster management supporting AI workloads
Create documentation for users and technical staff
Enhance monitoring infrastructure and protocols for research systems
Plan and implement scheduled maintenance of operations including off-hours tasks

Benefits

Eligible for benefits
Standard workweek of 36.25 hours
No overtime eligibility
Probationary period of 180 days

Full Job Description

Overview

The Lead Systems Engineer for High Performance Computing (HPC) and Artificial Intelligence (AI) works as part of the Advanced Systems team within Research Computing that supports the hardware and system-level software on the University's centralized high-performance computing and other computing for research systems. The Lead Systems Engineer is responsible for engaging with faculty, researchers, vendors, and other information technology (IT) staff to specify, design, install, and administer computing for research systems while also providing insight into trends and technologies supporting the advancement of AI research. The Lead Systems Engineer is also expected to be in tune to trends in computational research and will be asked to evaluate, pilot, and implement systems that advance Princeton's HPC and AI technologies enhancing Research Computing services. The Lead Systems Engineer serves as an expert for HPC and AI hardware and software and helps researchers troubleshoot system level problems with software, data, and job submission. This position requires one to work closely with colleagues at all levels of technical understanding in the Office of Information Technology (OIT) and University academic departments to provide timely and creative support for research computing. The Lead Systems Engineer is required to work well on teams and independently, and will be asked to lead initiatives within Advanced Systems, requiring only general supervision.

On-call rotation is a mandatory facet of this role, requiring infrequent off-hour and weekend duty.

Responsibilities

Operations:

Design, maintain, troubleshoot, and refine advanced HPC/AI cluster infrastructure including high-performance interconnects, cluster schedulers, and configuration management across research systems.
Partner with colleagues in Advanced Data and Storage Management to align designs for scratch filesystems and data management with cluster designs.
Develop data-transfer pathways and networks to support AI-driven computing workloads.
Establish and maintain best practices for cluster management and usage to support AI-driven workloads.
Develop documentation for users and technical staff that can be used by the larger community.
Develop, enhance, and expand monitoring infrastructure and related protocols for research computing systems.
Plan and implement scheduled maintenance of operations, including during off hours.
Perform other tasks as assigned.

Technical Leadership:

Define and drive the institutional technical strategy for advanced AI and data-intensive HPC.
Bring creativity, foresight, and mature professional judgment in anticipating and solving novel and complex problems, in determining project objectives and requirements, and in developing standards and governance for all research computing platforms.
Leveraging expertise in AI technologies, identify, evaluate, and pilot researcher-facing systems that enable the acceleration of research using AI.
Lead the implementation and expand adoption of modern, automation-driven infrastructure and cluster management practices.
Promote institution-wide collaboration as the community expert advising and working with faculty, researchers and vendors on emerging trends and challenges in AI-enabled research computing.
Cultivate a collaborative, knowledge-sharing environment by providing technical mentorship to systems specialists and analysts by sharing designs and operational expertise across data systems and HPC/AI infrastructure.
Contribute to the strategic vision for HPC/AI systems; Advise senior leadership and stakeholders on strategic investments, risks, and opportunities related to research infrastructure.

Troubleshooting and Problem Resolution:

Monitor HPC clusters, networks, and storage systems for abnormalities, and resolve issues.
Analyze and solve problems in Linux and HPC/AI computing environments with software, data, and job submissions.
Use scripting and programming tools to troubleshoot issues.

Qualifications

Essential Qualifications:

10+ years of strong experience managing advanced research computing systems.
Strong expertise with Linux system administration, installation, and troubleshooting.
Advanced experience writing scripts in languages such as bash, Python and/or Perl.
Proficient in managing networking in HPC environments.
Strong experience managing software in an advanced research computing environment.
Experience supporting scheduling and managing jobs (SLURM) in large-scale computing environments.
Strong oral and written communication skills, with the ability to proactively engage peers and communicate effectively across a diverse stakeholder community.
Strong ability to solve complex and system infrastructure problems, and share expertise with colleagues at all levels.
Demonstrated ability to collaborate across teams to solve systems and infrastructure challenges, aligning day-to-day operational needs with longer-term technical and organizational goals as technologies evolve.
When provided access to personal, proprietary and/or otherwise confidential data, maintain such data in the strictest confidence and follow procedures to ensure the privacy, security, and proper use of data.
Education: Bachelors degree in a related field or equivalent experience.

Preferred Qualifications:

Experience working in an academic and research settings.
Experience supporting AI-driven research in open and secure computing environments.
Familiarity using and administering data-transfer technologies such as Globus that facilitate the transfer of large datasets.
Experience using and supporting parallel file systems that are commonly used in HPC/AI systems.
Experience supporting unstructured data in HPC/AI environments.

Standard Weekly Hours

36.25

Eligible for Overtime

No

Benefits Eligible

Yes

Probationary Period

180 days

Essential Services Personnel (see policy for detail)

No

Physical Capacity Exam Required

No

Valid Driver's License Required

No

Experience Level

Director

#LI-JJ1

Salary Range

$135,000 to $150,000

* Ladders Estimates

Similar Jobs

Control system Validation Sr. Engineer
$86K — $165K *
Raytheon Technologies
Pottsville, PA 17901 (Schuylkill County)
Today
Senior Associate - Site Reliability Engineer
$100K — $143K *
New York Life Insurance Co
Lebanon, NJ 08833 (Hunterdon County)
Today
Eng Sr Prin II - Sys
$120K — $150K *
BAE Systems
Sterling, VA 20164 (Loudoun County)
Reposted Today
Senior Site Reliability Engineer
$120K — $150K *
Ellucian
Remote
Today
Infrastructure Engineering Advisor - Mainframe z/OS Storage Administrator
$123K — $205K *
Cigna Healthcare
Remote
Today
Infrastructure Engineering Advisor - Mainframe z/OS Storage Administrator
$123K — $205K *
Cigna Healthcare
Bloomfield, CT 06002 (Capitol County)
Today

Get Ready For Your
Next Interview

More Jobs at Princeton University

Grant & Contract Administrator
$94K — $104K *
Princeton, NJ 08540 (Mercer County)
Today
Education, Government & Non-Profit
In-Person
Lead Systems Engineer (HPC)
$135K — $150K *
Princeton, NJ 08540 (Mercer County)
Today
Technical Services
In-Person
Robert H. B. Baldwin '42 Head Coach of Baseball
$140K — $200K *
Princeton, NJ 08540 (Mercer County)
Yesterday
Education, Government & Non-Profit
In-Person
Director, Data Governance & Security
$140K — $150K *
Princeton, NJ 08540 (Mercer County)
4 days ago
Education, Government & Non-Profit
In-Person
Computational Research Analyst
$76K — $86K *
Princeton, NJ 08540 (Mercer County)
Reposted 6 days ago
Education, Government & Non-Profit
In-Person

More Technical Services Jobs

HVAC Service Tech
$80K — $100K *
ARS-Rescue Rooter
Charleston, SC 29412 (Charleston County)
Today
Project Manager II
$101K — $154K *
MSD
Rockville, MD 20850 (Montgomery County)
Today
Key Account Sales Manager, Aftermarket Services
$90K — $120K *
Ingersoll Rand
Irving, TX 75061 (Dallas County)
Today
Principal Technical Program Manager, Amazon Leo
$177K — $239K *
Amazon
Redmond, WA 98052 (King County)
Today
Technical Project/Program Management IV
$116K — $159K *
Applied Materials, Inc
Santa Clara, CA 95051 (Santa Clara County)
Today

Find similar Lead Systems Engineer (HPC) jobs:

Nationwide Princeton, NJ

Lead Systems Engineer (HPC)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Lead Systems Engineer (HPC) jobs:

Get Ready For Your
Next Interview