Oak Ridge National Laboratory

HPC Infrastructure Platform Engineer

Oak Ridge National Laboratory$100K — $130K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in computer science or related field with 5 years of Linux and Kubernetes experience, or a master's degree with 4 years of experience.
  • Excellent communication and teamwork skills.
  • Strong experience with Kubernetes platform tools.
  • Solid understanding of Linux systems and common network protocols.
  • Proficiency in programming/scripting with Python and bash.

Responsibilities

  • Deploy and manage HPC-scale services in a Linux environment, focusing on RedHat and Rocky.
  • Build and maintain internal platforms for reliable application deployment and monitoring.
  • Design automated workflow patterns and CI/CD pipelines.
  • Support identity and access management services using LDAP and PingFederate.
  • Lead small infrastructure projects and mentor junior staff.

Benefits

  • Participation in ongoing training and developmental opportunities.
  • Opportunity to work on cutting-edge HPC technologies.
  • Supportive team culture encouraging knowledge sharing.
  • Potential for career progression within a leading computational sciences organization.
Full Job Description
Requisition Id 16521

Overview:

The High-Performance Computing Systems Section within the National Center for Computational Sciences (NCCS) is seeking an HPC Infrastructure Platform Engineer to join the HPC Infrastructure group. The preferred candidate will possess commensurate knowledge, skills and abilities in addition to relevant education, certifications, experience and demonstrated ability to work as a member of a team.

Major Duties/Responsibilities:

Linux Administration:
  • Deploy, configure and manage HPC-scale services in a Linux environment, primarily RedHat and Rocky
  • Perform regular patches, updates and backups
  • Monitor systems using tools like Nagios and Grafana
  • Respond to and assist in troubleshooting issues


Kubernetes Administration:
  • Build and maintain foundational internal platforms and tools to enable the HPC Infrastructure team to reliably deploy, monitor and scale applications
  • Design standardized and automated workflow patterns, build and maintain CI/CD pipelines
  • Offer self-service, excellent documentation and assistance to HPC Infrastructure group members for efficient consumption of platform services
  • Develop, maintain and review high quality code for internal tools using programming languages such as Python, Golang, or Rust


Identity Management and Security:
  • Deploy, configure and support identity and access management services using LDAP and PingFederate
  • Maintain and enable secure access for human users and automated workloads in Kubernetes


Virtualization and Automation:
  • Deploy and manage resources in the NCCS VMware environment
  • Identify potential automation targets and lead efforts to automate processes
  • Define policies and procedures for automation and configuration management for the team and organization as a whole


Project Management and Leadership:
  • Lead small Infrastructure projects through the project lifecycle
  • Mentor and train junior staff, creating training documentation, holding knowledge sharing sessions, and fostering skill growth throughout the team
  • Propose and implement improvements to existing Infrastructure systems as well as new systems, processes and procedures


Basic Qualifications:
  • Bachelor's degree in computer science or closely related field and a minimum of 5 years of experience in Linux systems and Kubernetes platform administration, or a master's degree and a minimum of 4 years of experience in Linux systems and Kubernetes platform administration
  • An equivalent combination of education and experience will be considered


Preferred Qualifications:
  • Excellent interpersonal/communication skills and the ability to work within a team
  • Strong experience designing, building and maintaining Kubernetes platform tools
  • Strong working knowledge of Linux system fundamentals and common network protocols
  • Programming and scripting skills in common languages such as Python and bash
  • Understanding of versioning and code review tools like GitHub and GitLab
  • Experience implementing and supporting highly-available systems and services
  • Experience with configuration management tools such as Puppet or Ansible
  • Experience deploying and maintaining virtual environments using VMWare
  • Experience deploying, maintaining and troubleshooting a variety of infrastructure services such as OpenLDAP, DNS, DHCP, etc.
  • Ability to plan, prioritize and complete assigned projects with minimal supervision


Special Requirements:
  • This position requires the ability to obtain and maintain a clearance from the Department of Energy. As such, this position is a Workplace Substance Abuse (WSAP) testing designated position. WSAP positions require passing a pre-placement drug test and participation in an ongoing random drug testing program


This position will remain open for a minimum of 5 days after which it will close when a qualified candidate is identified and/or hired.

We accept Word (.doc, .docx), Adobe (unsecured .pdf), Rich Text Format (.rtf), and HTML (.htm, .html) up to 5MB in size. Resumes from third party vendors will not be accepted; these resumes will be deleted and the candidates submitted will not be considered for employment.



About Oak Ridge National Laboratory

Oak Ridge National Laboratory (ORNL) is a science and technology national laboratory managed for the United States Department of Energy (DOE) by UT-Battelle. ORNL is the largest science and energy national laboratory in the Department of Energy system by size and by annual budget. ORNL conducts research and development activities in a variety of scientific and technical disciplines. ORNL's scientific programs focus on materials, neutron science, energy, high-performance computing, systems biology and national security. ORNL partners with other national laboratories, universities and industry to solve complex problems and transfer knowledge and technology. ORNL is home to several of the world's most powerful supercomputers, including Summit, the world's most powerful supercomputer as of November 2018.
Learn more about Oak Ridge National Laboratory
Size
5,000 employees
Industry
Founded
1943

Similar Jobs

More Jobs at Oak Ridge National Laboratory

More Information Technology Jobs

Find similar HPC Infrastructure Platform Engineer jobs: