HPC System Integration Software Architect (Scientist 3/4) in Los Alamos, New Mexico
What You Will Do
The High Performance Computing (HPC) Division at Los Alamos National Laboratory provides scientific computing resources consisting of some of the largest HPC systems in the world. The Systems team within the HPC Design Group (HPC-DES) is responsible for defining the technical direction, evaluating, developing and deploying the tools and system software ultimately used in production support of LANL’s HPC resources. These HPC resources are some of the largest in the world and currently include a large (19K+ node) Cray system called Trinity as well as numerous large commodity cluster systems.
This position will be filled at either the Scientist 3/Scientist 4 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.
You will be working closely with other DES System team members as well as more production focused team members in other groups at the HPC division. Projects typically involve collaborations inside and outside of the Laboratory, in line with the Laboratories’ history of leadership in HPC. Some non-standard working hours may occasionally be required.
We seek candidates who want to make significant contributions that impact the HPC technical direction at LANL and ultimately across the DOE and the nation.
Scientist 3 ($96,600 - $161,300)
The successful candidate will be required to:
Identify current and future challenges faced by large scale HPC applications, and work toward production HPC system solutions. In particular, this individual will help design, develop, deploy, and support system software to overcome these challenges. Areas of interaction include distributed systems, data aware scheduling, resource allocation, workflow management, user tools, software environment, with a focus on delivering next generation programming and run time user environment.
Set direction, goals, milestones, and deliverables for project tasks and establish associated scope, schedule and budgets. Assist in the preparation of progress reports to sponsors.
Contribute to multi-lab and cross organization proposals for funding both internally and externally to the laboratory.
Will be the Principal Investigator for a targeted area of research.
Present results of work locally and at conferences and workshops.
Support system software investigations along with application performance and stability optimization and testing within the HPC integrated open and secure network infrastructure.
Assist in the design of data intensive solutions for the wider HPC environment and provide input into the design and specification of new venues that utilize custom system software.
Provide Tier 3 support to system admin staff and help desk staff on various HPC production systems, when required by user feature requests, bugs, or security vulnerabilities that cannot be resolved by production teams.
Set direction and goals for project tasks and establish associated scope, schedule and budgets.
Enhance technical and professional expertise of other staff through active mentoring and training.
Contribute to peer review of the work of others across organizations or disciplines within the laboratory.
Scientist 4 ($116,900 - $197,000)
In addition to the duties mentioned above, the Scientist 4 will be required to:
Lead proposals for both internal and external funding for self and others via responses to competitive requests for proposals.
Contribute to peer review of the work of others across organizations and disciplines nationally, including participation on HPC-related conference and workshop committees.
Participate in national review boards for DOE in subject area of expertise.
Acquire internal/external funding for self and others via responses to competitive requests for proposals and developed collaborations.
Work closely with high level project leads and program managers to insure their projects are successful.
Assist in defining specifications and RFP’s for new HPC systems.
What You Need
Minimum Job Requirements:
Demonstrated record of accomplishment and expertise in high performance and large-scale systems integration, acceptance, and productizing of new HPC systems.
A record of technical leadership in software activities within a system integration or production environment.
Knowledge and experience with HPC system software environment specification, acquisition, deployment, and production readiness.
Practical experience at the advanced level in programming using C, C++ and/or Fortran.
Significant knowledge and expertise with typical Linux build systems such as GNUMake, and CMake.
Significant knowledge and expertise with typical HPC scheduling software such as Slurm, PBSPro, LSF, etc.
Experience in elements specific to system integration of large and complex high performance computing systems.
Practical experience at the advanced level in programming such as Bash scripts, shell scripts, perl, or Python code.
Good oral and written communication skills are needed.
Record of maintaining state-of-the-art technical expertise and knowledge within discipline and development of new skills in related disciplines.
Demonstrated ability to work within a team environment.
Working knowledge of networking concepts and practices.
Additional Job Requirements for Scientist 4:
In addition to the Job Requirements outlined above, qualification at the Scientist 4 level requires:
Demonstrate senior technical leadership that brings various organizations, teams/individuals together with a common goal to create an efficient, cost effective performance based solution to a particular problem/need.
Experience with tools and methods for optimization and debugging in a highly parallel environment.
Demonstrate capability of understanding the complete picture of an end-to-end solution for large complex systems. This includes facilities, archive, storage, networks (cluster fabric, data center, and campus) and clusters.
Exhibited knowledge and experience in working with equipment vendors on specifications and requirements of large-scale scientific system procurements addressing system architecture, reliability, performance, tuning, debugging, configuration, maintenance and support.
Demonstrated industry leadership and expertise in an area of high performance computing.
Demonstrated ability to initiate large-scale projects to solve technology challenges.
Demonstrated in-depth experience with Lustre or GPFS.
Experience with containerization technologies such as Singularity, Docker, and/or Kubernetes.
Thorough understanding of software engineering principles
Extensive Linux system administration experience.
Knowledge and experience using or supporting scientific computing and mathematics libraries
Experience with various Linux packaging and management tools such as RPM, APT, and Environment Modulefiles
Programming in a parallel computing environment with MPI, threads, or both
Familiarity with concepts in program decomposition and parallel programming models
Knowledge of or experience with hardware and software security practices.
Practical experience with proprietary interconnects such as the Cray Aries or Gemini network or other proprietary networks.
Experience with deploying software defined networks. (SDN/ NFV).
Practical experience with OpenHPC.
Practical experience with power aware computing and scheduling.
Experience in anticipating needs for hardware and software environments.
Ability to creating reliable/repeatable procedures for production use.
Practical experience and the advanced knowledge of ethernet switches, routing, TPC/IP, and configuration of NICs and routers.
Practical experience and advanced knowledge of system Interconnects, especially Infiniband and know how to configure on hosts and switches.
Practical experience in taking a large cluster and making it's OS and software "Production" quality. (How to “harden” a Linux system )
Practical experience with Slurm.
Practical experience in more than one advanced HPC subject area (E.g. data-aware computing, data intensive supercomputing, parallel file systems, operating systems, message passing libraries, threading models, and resilience of these systems at scale).
Demonstrated experience in formulating and presenting results to technical audiences and readerships.
Experience managing computers in a DOE or DOD classified environment.
Active DOE Q Clearance.
Req ID: IRC63457