The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world's largest supercomputers, supporting thousands of researchers tackling major scientific challenges.
In this role, you will manage high-performance computing environments, including HPC systems, containers, virtual machines, and core infrastructure services. You'll work with cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-speed networking, Slurm, and Kubernetes, balancing innovation with reliability, performance, and security at scale.
Collaborating with engineers, researchers, vendors, and open-source communities, you will help develop scalable solutions that advance scientific discovery and the future of HPC. If you have Linux experience, an interest in science, and enjoy fast-paced collaborative environments, NERSC would love to hear from you.
What You Will Do if hired at a Level 3:- Build and manage Linux systems and storage infrastructure
- Troubleshoot complex technical issues with team members
- Install, upgrade, and secure systems and services
- Develop and maintain scripts and automation tools
- Participate in a 24/7 on-call rotation
- Lead small projects, upgrades, and service rollouts
- Collaborate with vendors to improve technologies and user experience
- Support reliable operations of NERSC's Perlmutter supercomputer and Spin Kubernetes platform
- Develop and integrate services across NERSC and DOE facilities, including the upcoming Doudna supercomputer
- Present technical work to the HPC community at conferences and industry events
In Additional Responsibilities if hired at a Level 4:- Solve complex technical problems with independent judgment
- Develop team strategies and project plans
- Provide technical leadership and mentorship
- Lead system improvements for performance, reliability, and security
- Evaluate emerging HPC technologies and capabilities
- Represent NERSC in HPC and DOE technical communities and advocacy groups
What is Required to be hired at a Level 3:- Typically, 8+ years of related experience with a Bachelor's degree; alternatively, 6+ years with a Master's degree; or equivalent career experience
- 4+ years of experience managing large-scale Linux-based system deployments in a high-performance computing, cloud computing, or hyper-scale environment
- Mastery of Linux concepts and operations (processes, networking, system logs, performance)
- Proficiency with bash and Python scripting
- Experience with some or all of our key technologies:
- containers (such as Docker or Kubernetes)
- virtualization (such as Proxmox or VMware)
- cloud-based deployment (such as AWS, Azure or GCP)
- identity and access management
- database administration, tuning, and troubleshooting
- storage systems technologies (such as iSCSI and NAS appliances)
- parallel filesystems (such as Lustre, GPFS, or VAST)
- high-speed networking/interconnect (such as InfiniBand, Slingshot, or RoCE)
- advanced performance analysis and debugging tools (such as strace, lsof, ebpf, or gdb)
- DevOps tools (such as Gitlab or Jira) and processes (such as issues, merge requests, and API/automation)
- Familiarity with automated provisioning systems (such as Chef, Foreman, or Terraform)
- Familiarity with configuration management systems (such as Ansible or Puppet)
- Working knowledge of Linux system engineering and security practices
- Ability to resolve complex issues in creative and effective ways and derive technical solutions in a collaborative environment to meet end user requirements or needs
- Demonstrated ability to work independently as well as collaboratively in large projects, and contribute to an active and respectful intellectual environment
- Creative, positive, and collaborative work style
- Excellent oral and written communication skills
Additional Requirements to be hired at a Level 4- Typically, 12+ years of related experience with a Bachelor's degree; alternatively, 8+ years with a Master's degree; or equivalent career experience
- Proven ability to lead troubleshooting and resolution of high-impact incidents in complex, large-scale environments
- Demonstrated leadership in cross-team collaboration and mentoring
- Experience in software engineering, Linux systems programming, or complex scripting
- Experience managing one or more of the following:
- data center networking (TCP/IP, Ethernet, BGP, ECMP)
- batch workload managers (such as Slurm), including installation, configuration, routine operations, job lifecycle concepts, and troubleshooting common failure modes
- Cray/HPE HPC ecosystems (e.g., CSM/COS, Slingshot interconnect, and related components)
- Ability to lead and coordinate projects with traditional or Agile methodologies (such as Scrum or Kanban)
- Ability to analyze and resolve significant and unique issues requiring evaluation of multiple intangible factors
- Ability to exercise independent judgment in methods, techniques and evaluation criteria for obtaining results
Additional information:- Applications will be accepted until the job posting is removed.
- Appointment type: This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
- Salary range:
- Level 3: The expected salary for this position is $156,864 - $191,724, which fits into the full salary of $139,440 - $235,308 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
- Level 4: The expected salary for this position is $178,644 - $218,364, which fits into the full salary of $158,808 - $267,996 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
- Background check: This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
- Work modality: This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.
- Multi-level Posting: This position will be hired at a level commensurate with the business needs and the skills, knowledge, and abilities of the successful candidate.
- Export Control Access: This position will involve access to hardware, commodities, and technical information subject to export control regulations including, but not limited to, the Export Administration Regulations ("EAR") and/or International Traffic in Arms Regulations ("ITAR"). Accordingly, any hiring decision may depend in part on Berkeley Lab's ability to obtain or rely on federal government authorizations as required, if you are not a U.S. citizen, lawful permanent resident of the U.S. ("green card holder"), asylee, refugee, or other qualifying protected individual as defined by 8 U.S.C. 1324b(a)(3).
Want to learn more about working at Berkeley Lab? Please visit: careers.lbl.gov