System Infrastructure / Platform Engineer, HPC Technology Department

LBL • $156K — $191K *

San Francisco, CA 94112In-Person

Information Technology

8 - 10 years of experience

3 days ago

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

8+ years of related experience with a Bachelor's degree, or 6+ years with a Master's degree, or equivalent experience
4+ years managing large-scale Linux-based systems in HPC or cloud environments
Mastery of Linux operations including processes, networking, and performance tuning
Proficiency in bash and Python scripting
Familiarity with key technologies such as Docker, Kubernetes, and cloud deployments
Working knowledge of Linux system engineering and security practices
Excellent communication skills and ability to work collaboratively

Responsibilities

Build and manage Linux systems and storage infrastructure
Troubleshoot complex technical issues with team members
Develop and maintain scripts and automation tools
Lead small projects, upgrades, and rollouts
Support operations of the Perlmutter supercomputer and Kubernetes platform
Collaborate with vendors to improve technologies and user experience
Present technical work at conferences and industry events

Benefits

Full-time career appointment with the potential for hybrid schedules
Opportunity to work with some of the world's largest supercomputers
Collaboration with engineers and researchers in a dynamic environment
Involvement in cutting-edge technologies and advancements in HPC
Support for continuous development through mentoring and teamwork

Full Job Description

The National Energy Research Scientific Computing Center (NERSC) is seeking a System Infrastructure / Platform Engineer to help build and manage HPC systems and Linux-based infrastructure. NERSC operates some of the world's largest supercomputers, supporting thousands of researchers tackling major scientific challenges.

In this role, you will manage high-performance computing environments, including HPC systems, containers, virtual machines, and core infrastructure services. You'll work with cutting-edge technologies such as CPU/GPU clusters, parallel storage, high-speed networking, Slurm, and Kubernetes, balancing innovation with reliability, performance, and security at scale.

Collaborating with engineers, researchers, vendors, and open-source communities, you will help develop scalable solutions that advance scientific discovery and the future of HPC. If you have Linux experience, an interest in science, and enjoy fast-paced collaborative environments, NERSC would love to hear from you.

What You Will Do if hired at a Level 3:

Build and manage Linux systems and storage infrastructure
Troubleshoot complex technical issues with team members
Install, upgrade, and secure systems and services
Develop and maintain scripts and automation tools
Participate in a 24/7 on-call rotation
Lead small projects, upgrades, and service rollouts
Collaborate with vendors to improve technologies and user experience
Support reliable operations of NERSC's Perlmutter supercomputer and Spin Kubernetes platform
Develop and integrate services across NERSC and DOE facilities, including the upcoming Doudna supercomputer
Present technical work to the HPC community at conferences and industry events

In Additional Responsibilities if hired at a Level 4:

Solve complex technical problems with independent judgment
Develop team strategies and project plans
Provide technical leadership and mentorship
Lead system improvements for performance, reliability, and security
Evaluate emerging HPC technologies and capabilities
Represent NERSC in HPC and DOE technical communities and advocacy groups

What is Required to be hired at a Level 3:

Typically, 8+ years of related experience with a Bachelor's degree; alternatively, 6+ years with a Master's degree; or equivalent career experience
4+ years of experience managing large-scale Linux-based system deployments in a high-performance computing, cloud computing, or hyper-scale environment
Mastery of Linux concepts and operations (processes, networking, system logs, performance)
Proficiency with bash and Python scripting
Experience with some or all of our key technologies:
- containers (such as Docker or Kubernetes)
- virtualization (such as Proxmox or VMware)
- cloud-based deployment (such as AWS, Azure or GCP)
- identity and access management
- database administration, tuning, and troubleshooting
- storage systems technologies (such as iSCSI and NAS appliances)
- parallel filesystems (such as Lustre, GPFS, or VAST)
- high-speed networking/interconnect (such as InfiniBand, Slingshot, or RoCE)
- advanced performance analysis and debugging tools (such as strace, lsof, ebpf, or gdb)
- DevOps tools (such as Gitlab or Jira) and processes (such as issues, merge requests, and API/automation)
Familiarity with automated provisioning systems (such as Chef, Foreman, or Terraform)
Familiarity with configuration management systems (such as Ansible or Puppet)
Working knowledge of Linux system engineering and security practices
Ability to resolve complex issues in creative and effective ways and derive technical solutions in a collaborative environment to meet end user requirements or needs
Demonstrated ability to work independently as well as collaboratively in large projects, and contribute to an active and respectful intellectual environment
Creative, positive, and collaborative work style
Excellent oral and written communication skills

Additional Requirements to be hired at a Level 4

Typically, 12+ years of related experience with a Bachelor's degree; alternatively, 8+ years with a Master's degree; or equivalent career experience
Proven ability to lead troubleshooting and resolution of high-impact incidents in complex, large-scale environments
Demonstrated leadership in cross-team collaboration and mentoring
Experience in software engineering, Linux systems programming, or complex scripting
Experience managing one or more of the following:
- data center networking (TCP/IP, Ethernet, BGP, ECMP)
- batch workload managers (such as Slurm), including installation, configuration, routine operations, job lifecycle concepts, and troubleshooting common failure modes
- Cray/HPE HPC ecosystems (e.g., CSM/COS, Slingshot interconnect, and related components)
Ability to lead and coordinate projects with traditional or Agile methodologies (such as Scrum or Kanban)
Ability to analyze and resolve significant and unique issues requiring evaluation of multiple intangible factors
Ability to exercise independent judgment in methods, techniques and evaluation criteria for obtaining results

Additional information:

Applications will be accepted until the job posting is removed.
Appointment type: This is a full-time, career appointment, exempt (monthly paid) from overtime pay.
Salary range:
- Level 3: The expected salary for this position is $156,864 - $191,724, which fits into the full salary of $139,440 - $235,308 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
- Level 4: The expected salary for this position is $178,644 - $218,364, which fits into the full salary of $158,808 - $267,996 depending upon the candidate's skills, knowledge, and abilities. This includes education, certifications, and years of experience.
Background check: This position is subject to a background check. Any convictions will be evaluated to determine if they directly relate to the responsibilities and requirements of the position. Having a conviction history will not automatically disqualify an applicant from being considered for employment.
Work modality: This position requires substantial on-site presence, but is eligible for a flexible work mode, and hybrid schedules may be considered. Hybrid work is a combination of performing work on-site at Lawrence Berkeley National Lab, 1 Cyclotron Road, Berkeley, CA and some telework. Individuals working a hybrid schedule must reside within 150 miles of Berkeley Lab. Work schedules are dependent on business needs. In rare cases, full-time telework or remote work modes may be considered.
Multi-level Posting: This position will be hired at a level commensurate with the business needs and the skills, knowledge, and abilities of the successful candidate.
Export Control Access: This position will involve access to hardware, commodities, and technical information subject to export control regulations including, but not limited to, the Export Administration Regulations ("EAR") and/or International Traffic in Arms Regulations ("ITAR"). Accordingly, any hiring decision may depend in part on Berkeley Lab's ability to obtain or rely on federal government authorizations as required, if you are not a U.S. citizen, lawful permanent resident of the U.S. ("green card holder"), asylee, refugee, or other qualifying protected individual as defined by 8 U.S.C. 1324b(a)(3).

Want to learn more about working at Berkeley Lab? Please visit: careers.lbl.gov

About LBL

LBL Careers

Joining LBL offers an unparalleled opportunity to become part of a leading team of professionals dedicated to pioneering innovation and digital transformation. LBL stands as a beacon of excellence, offering a range of job opportunities that cater to various skills and career aspirations.

Explore Career Opportunities

LBL’s dynamic career paths empower professionals to navigate their professional growth with confidence. Whether through full-time positions, internships, or leadership roles, LBL is committed to fostering a culture of growth and learning.

Innovation and Professional Growth

At LBL, innovation isn’t just a buzzword; it's the cornerstone of their mission. The company encourages its team to push the boundaries of technology and strategy, ensuring that every member has the opportunity to contribute to groundbreaking projects.

Diversity and Inclusion

Diversity training and inclusion are at the heart of LBL’s employment strategy. The company believes that a diverse team is a strong team, and actively works to create an environment where all voices are heard and valued.

Benefits and Culture

LBL is renowned for its vibrant culture and comprehensive benefits package designed to support the team in all aspects of life—both professional and personal. From health benefits to flexible work policies, LBL ensures that the team not only excels at work but also enjoys a balanced life.

Networking and Development

Career advancement at LBL is fueled by robust professional networking and development programs. These initiatives are tailored to hone skills, enhance leadership capabilities, and ensure that every team member can achieve their career goals.

Join the LBL Team

LBL is actively hiring and looking for individuals who are passionate, curious, and driven. Explore the open positions that match your skills and interests. Engage with a company that values innovation and offers the tools needed to succeed in a competitive market.

Stay Connected with LBL Jobs

Stay informed about the latest in career opportunities and industry trends by subscribing to LBL job alerts. Tailor your preferences to receive updates that align with your professional interests and career goals.

Prepare for Your Interview

Aspiring candidates can look forward to a transparent interview process that assesses a range of competencies from technical skills to creative thinking. Ensure your resume highlights relevant experiences and skills to stand out in the LBL hiring process.

Career Insights and Tips

Gain insights from industry leaders and get ahead with career tips directly from the professionals at LBL. These resources are invaluable for those looking to make a significant impact in their professional journey.

Explore LBL Careers Today

Discover the exciting and rewarding career opportunities at LBL. Whether you’re seeking an internship or a managerial position, LBL offers a path for everyone. Join a team that’s dedicated to leadership, professional growth, and innovation in the digital era.

Learn more about LBL

Industry

Business Services

* Ladders Estimates

Similar Jobs

Systems/Software Engineer III
$120K — $243K *
Hewlett Packard Enterprise Development LP
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Senior Site Reliability Engineer
$128K — $160K *
DraftKings
Remote
Today
Software Engineer, Site Reliability Engineering
$151K — $195K *
Thumbtack, Inc.
Remote
Today
Staff Site Reliability Engineer
$119K — $170K *
Zscaler
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Staff Site Reliability Engineer
$119K — $170K *
Zscaler
Remote
Reposted Today
Staff System Engineer
$160K — $185K *
Super Micro Computer, Inc
San Jose, CA 95123 (Santa Clara County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at LBL

System Infrastructure / Platform Engineer, HPC Technology Department
$156K — $191K *
San Francisco, CA 94112 (San Francisco County)
3 days ago
Information Technology
In-Person
Scientific Software Engineer - AI/ML for Hyperspectral Imaging
$104K — $116K *
San Francisco, CA 94112 (San Francisco County)
6 days ago
Information Technology
In-Person
Policy Researcher III
$148K — $163K *
San Francisco, CA 94112 (San Francisco County)
6 days ago
Energy & Utilities
In-Person
Cyber Security Engineer
$156K — $191K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
In-Person
Senior Administrator (Deputy Lab Director for Research)
$105K — $136K *
San Francisco, CA 94112 (San Francisco County)
2 weeks ago
Education, Government & Non-Profit
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Software Engineer II, Search & Data Infrastructure -Slack
$117K — $223K *
Salesforce
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer Lead
$55K — $158K *
The PNC Financial Services Group, Inc
Dallas, TX 75217 (Dallas County)
Reposted Today
Senior R&D Engineer-17637
$130K — $180K *
Synopsys Inc
Sunnyvale, CA 94087 (Santa Clara County)
Today