HPC Monitoring Team ( Scientist 2 / 3 )

Los Alamos National Laboratory   •  

Los Alamos, NM

5 - 7 years

Posted 238 days ago

This job is no longer available.

HPC Monitoring Team (Scientist 2/3) in Los Alamos, New Mexico

What You Will Do

This position will be filled at either a Scientist 2 or 3 level, depending on the skills of the selected candidate. Additional job responsibilities (outlined below) will be assigned if the candidate is hired at the higher level.

The High-Performance Computing Division (HPC) provides production high performance computing systems services to the Laboratory. The High Performance Computing Systems group has responsibility for the broad range of HPC platforms and infrastructure deployed within Laboratory HPC Data Centers.

The High Performance Computing Environments group (HPC-ENV) invites applicants for a position of Scientist 2 or 3 to join the Monitoring, Security and Data Analytics team and strengthen our HPC monitoring and analysis efforts. We seek candidates who want to make significant contributions to our long-term efforts of larger scale cluster monitoring, continuous security monitoring and job based power monitoring. Team member duties include: System administration of RHEL servers; Setting up appropriate monitoring and alerts for new HPC clusters and infrastructure including networks and file systems; Diagnosing, solving and implementing solutions for various system operational problems; Communicating and collaborating with other teams, groups and sites. The selected candidate will participate in a regularly scheduled rotation of on-call support of productions systems. In addition, some non-standard working hours may occasionally be required The High Performance Computing Environments group (HPC-ENV) has the main responsibility of managing how users interaction with the HPC systems at LANL. Some of the teams in this group include (1) Consulting and User Services, responsible for direct interaction and problem resolution with the users; (2) Parallel Runtimes and Environments, responsible for installing and maintaining the software and user environments on the HPC clusters; (3) Application Readiness, working to optimize user code for new HPC platforms and technologies; (4) Monitoring, Security and Data Analytics, responsible for collecting, analyzing and displaying HPC system information to administrators and users. Projects typically involve collaborations inside and outside of the Laboratory, in line with the Laboratories’ history of leadership in HPC.

The Monitoring, Security and Data Analytics team within HPC-ENV is responsible for monitoring everything within the HPC Datacenters, including Facilities, Clusters, File Systems, Networking and Support Servers. Monitoring data, sensor information and system logs and are collected using syslog, polling scripts, IPMI and several other mechanisms. Monitoring data is transported throughout our extensive monitoring infrastructure using syslog and AMQP. Splunk serves as or main analysis, display and alerting tool for administrators. Grafana backed by Elasticsearch and OpenTSDB are running on our dedicated Data Analytics Cluster for our larger analysis and machine learning projects.

Scientist 2 ($87,800 - $144,800)

The successful candidate will perform the full spectrum of UNIX/Linux computing environment administration, including but not limited to:

  • Assist in the setup, administration and maintenance of dozens RHEL servers using a configuration management system

  • Administer several monitoring software systems including Splunk, RabbitMQ, LDMS and Grafana

  • Identify and fix system server and networksecurity issues

  • Actively look for problems in the Datacenters by monitoring logs and alerting systems

  • Implement monitoring dashboards and alerts for new HPC Clusters, File Systems or Networks

  • Work independently as well as under the supervision and guidance of senior HPC administrators to provide technical assistance in problem solving and day-to-day operation and monitoring of various HPC systems

  • Steadily increase responsibilities and knowledge of our environment and HPC systems

  • Participate in periodic on-callresponsibilities as assigned

  • Participate in process improvement and deep multi?system problem isolation and resolution in coordination with administrators of other HPC subsystems

  • Propose and implement solutions when presented with problems in our HPC environment

  • Experience using and maintaining databases

  • Experience managing web documentation sites, allowing subject-matter experts to easily add new documentation while creating an easy to navigate unified experience for the end user

Scientist 3 ($96,600- $161,300)

In addition to the duties outlined above, the Scientist 3 will be required to:

  • Work as a technical leader to implement solutions to current problems and future deficiencies in our HPC environment in conjunction with junior and senior administrators and technical members of other HPC teams

  • Proactively examine our HPC environment and propose projects to make it better

  • Communicate the strategies and successes of HPC Division to national peers and participate in national strategic partnerships

  • Implement active networksecurity monitoring using Bro and Netflow analysis

  • Deploy advanced analytics tools or machine learning techniques on monitoring data for use in our production environment

  • Knowledge of several database systems and experience architecting database solutions

  • Experience with content management frameworks like Drupal

What You Need

Minimum Job Requirements:

  • Strong interpersonal and communication skills

  • Broad knowledge of administration of production Linux computer systems, utilities, and tools, including experience building, configuring, and administering production Linux computer systems

  • Knowledge of syslog configuration

  • Knowledge of different database systems

  • Understanding of how to monitor logs from multiple systems and correlate events

  • Demonstrated scripting (e.g., in Bash, Perl, Python, or similar scriptinglanguages) and programming experience

  • Ability to mentor and lead individual junior team members and students

  • Working knowledge of networking concepts and practices

  • Experience working in a production computing environment, preferably with HPC systems or at large scale

  • Knowledge of or experience with hardware and software security practices

  • Ability to write papers and present results to peers locally or at conferences

Additional Job Requirements for Scientist 3:

In addition to the Job Requirements outlined above, qualification at the Scientist 3 level requires:

  • Broad knowledge of production system management topics, including networking, programming, file systems, operating systems, and configuration management, with depth in one or more areas

  • Experience leading and mentoring teams, students, or junior team members

  • Experience initiating, designing, and leading projects

  • Experience interacting with vendors and colleagues within the industry, including presenting technical results and practices to peers locally and at conferences

  • Experience deploying database solutions

  • Knowledge of statistics, data analytics, or similar fields

  • Knowledge of the NIST 800-53 standards

  • Experience implementing computer and networksecurity features

  • Knowledge of HPC facilities systems including monitoring and alerting

Desired Skills:

  • Experience working in a production HPC environment

  • Experience diagnosing system software problems

  • Knowledge of one or more monitoring tools (Splunk, Ganglia, LDMS, etc.)

  • Experience configuring syslog

  • Experience with data collection and transport (syslog, IPMI, AMQP)

  • Knowledge of data storage and databases

  • Experience hardening server for security

  • Knowledge of data driven web-based user interfaces, Web Servers (Apache, Tomcat, etc.), and Content Management Systems

  • Knowledge of resource management and job scheduling software (SLURM, Moab, etc.)

  • Experience with networking and file systems in an HPC environment

  • Experience with parallel filesystems (Lustre, GPFS, etc.)

  • Experience with archive solutions (HPSS, TSM, etc.)

  • Experience with data movement tools

  • Experience working with ticket tracking systems

  • Experience with multiple Linux distributions

  • Experience modifying Unix/Linux operating systems

  • Experience managing computers in a DOE or DOD classified environment

  • Active DOE Q Clearance


Typical educational requirement is a bachelor’s, master’s, or doctorate degree in science from an accredited college or university and a minimum of five years of experience in the HPC field, or an equivalent combination of education and experience.

Req ID: IRC61028.

$87K - $144K