The Controls and Data Systems Division within the Technology and Innovation Directorate at SLAC is dedicated to providing world-class technology and IT services for all Experimental and Observational Data systems at the laboratory. Our Scientific Computing Systems department supports the lab’s science mission through high performance computing and data management systems for scientific research (KAVLI, Rubin, SuperCDMS, ATLAS) and user facilities (CryoEM, LCLS, FACET, SSRL).
We are looking for a talented Scientific Computing Specialist to lead the team responsible for designing, developing and maintaining the scientific computing infrastructure of the Rubin US data facility at SLAC (USDF). The Vera C. Rubin Observatory will conduct a 10-year Legacy Survey of Space and Time (LSST) starting in early 2024. The LSST will deliver a 500 petabyte set of images, to be processed at the Rubin data facilities to provide a database of trillions of measurements of billions of astronomical objects, which will address some of the most pressing questions about the structure and evolution of the universe and the objects in it.
You must be a pragmatic and skillful engineer who will blend hardware and software solutions to deliver the best outcomes for our scientists. A broad understanding of technology and the implications of the impact of design decisions is important, as is the ability to communicate clearly both within our team and with our partners. We encourage free-thinking open dialog and opportunity to explore and implement new technologies and ideas and expect you to be able to drive projects both by yourself and in conjunction with other teams and scientists. In particular, your team will likely include computing professionals at other DOE laboratories, and will be embedded in the larger Infrastructure and Support Team within Rubin Observatory’s Data Production Department, where you will interact with scientists and engineers across the world.
Your specific responsibilities include:
- Develop professional relationships with Rubin’s various scientific and engineering teams to help identify requirements and either match, adapt, or create scientific services to benefit the observatory.
- Engage in, support, improve and evolve the whole lifecycle of Rubin’s scientific computing services portfolio; from inception and design through deployment, operation and sunset.
- Explore and test emerging technologies and technical developments to address scientific needs and services.
- Provide project management, coordination and engineering for significant scientific computing projects.
- Support day-to-day operations and troubleshooting of Rubin’s scientific computing services.
- Gather data, perform analysis and help troubleshoot issues across the Rubin’s scientific services portfolio.
- Provide documentation, monitoring and reporting of the entire Rubin’s scientific computing portfolio. Feedback and implement service improvements.
To be successful in this position you will bring:
- Bachelor's degree in physics, computer sciences or related field and 10 years of relevant experience in information technology, systems administration, or high-performance computing or a combination of education and relevant experience in the form of a master's or doctoral degree in physics, computer sciences or related field and eight years of relevant experience.
- Ability to work effectively in a team environment and lead cross-functional teams.
- In depth understanding of high performance computing systems, storage and networking
- Familiarity working with complex scientific workflows or rich data science ecosystems.
- Extensive experience with python including python-based data science ecosystem such as numpy, panda, parquet, dask.
- Production experience with container technologies (docker, singularity, shifter) and ecosystems (kubernetes, ArgoCD).
- Understanding of infrastructure as code technologies. Proficiency with HPC cluster software and general management tools (slurm, chef, ansible).
- Proven ability to identify and resolve service and performance issues across all layers of the infrastructure.
- Ability and willingness to learn and establish best practices.
- Expertise with system management, hardware benchmarking, monitoring, open-source software.
- Excellent organizational and communication skills.
In addition, preferred requirements include:
- Familiarity with shared and distributed memory parallelism (OpenMP, MPI) and GPU accelerators.
- Experience with interactive analysis technologies (jupyter, matlab, spark).
- Experience with machine learning technologies (pytorch, tensorflow, etc).
- Knowledge of HPC storage principles and file systems (XFS, ZFS, Lustre, GPFS, S3).
- Experience with low-latency/high-bandwidth networks (including infiniband, 100GbE+).
SLAC employee competencies:
- Effective Decisions: Uses job knowledge and solid judgment to make quality decisions in a timely manner.
- Self-Development: Pursues a variety of venues and opportunities to continue learning and developing.
- Dependability: Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.
- Initiative: Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.
- Adaptability: Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.
- Communication: Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.
- Relationships: Builds relationships to foster trust, collaboration, and a positive climate to achieve common goals.
Physical requirements and working conditions:
- Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of the job.