Are you passionate about your work and dream of utilizing state-of-the-art facilities to explore solutions? Do you want to join a dynamic team that seeks to revolutionize the field of High Performance Computing (HPC) analysis and operations?
We are seeking a computer science R&D professional to join a team developing new software and new operational analytics for high performance computing (HPC) Architectures.
You will enjoy innovating and collaborating with a team researching and developing HPC Monitoring, Performance Analysis, and Response solutions in order to provide advanced, data-focused operations and efficient utilization. The team authors the open-source, R&D 100 Award-winning Lightweight Distributed Metric Service (LDMS) which is used for monitoring several of the largest HPC systems in the world.
on any given day you may be called upon to:
- Design and develop software for extreme-scale data collection and analysis to assess system and application performance
- Develop and deploy analysis techniques to detect and classify operational conditions that bottleneck user application performance.
- Develop data presentations and automated response techniques to enable more efficient computing based on analysis outcomes
- Work with internal and external organizations operating large-scale HPC systems to deploy monitoring solutions and utilize them for performance understanding
- Publish and present research results at peer-reviewed conferences
Qualifications We Require
- MS + 2 years experience or PhD in relevant STEM discipline
- 5 years of experience programming in C, C++, and/or Python
- You have experience programming in Unix/Linux environments
- A record of peer-reviewed publication of results and/or external presentations at scientific conferences
- Ability to obtain and maintain a DoE Q clearance
Qualifications We Desire
- Experience developing in Jupyter Notebooks and with NumPy
- Experience using and/or developing statistical data analysis and/or machine learning techniques (e.g. PCA, scikit-learn, TensorFlow) for significantly sized datasets
- Experience developing large-scale codes in a multi-developer, open-source software environment
- Experience developing middleware for HPC systems, including consideration of resilience, memory, scalability, and CPU footprint
- Experience doing performance analysis studies of software and applications on HPC system architectures, particularly for advanced processors and/or networks
- Familiarity building and running applications in HPC system environments
- Experience as a system administrator in Unix/Linux Environments
- Experience with HPC monitoring technologies, such as LDMS, Elastic Search, Kafka, and LogStash.
- Experience developing unit and regression tests and running such tests within frameworks, such as Jenkins
- Current DOE Q security clearance