University of Chicago

HPC Systems Administrator

University of Chicago$93K — $110K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in a related field.
  • 5-7 years of relevant work experience.
  • Proficient in Linux system administration in distributed environments.
  • Knowledge of system security protocols and best practices.
  • Familiar with HPC clusters and GPU infrastructure management.

Responsibilities

  • Design, deploy, and administer CPU/GPU HPC clusters including storage and interconnects.
  • Monitor and tune GPU nodes for optimal performance.
  • Assist in developing and implementing monitoring and observability tools.
  • Enforce security procedures and maintain system documentation for compliance.
  • Troubleshoot operational issues and coordinate with vendors for resolutions.
  • Implement secure backup and disaster-recovery capabilities for systems.
  • Maintain and update documentation for hardware and software configurations.

Benefits

  • Comprehensive health insurance options.
  • Retirement plans with employer contributions.
  • Generous paid time off policy.
  • Access to various professional development resources.
Full Job Description
Department
Provost Research Computing Center

Job Summary
The University of Chicago is seeking a highly qualified HPC Systems Security Engineer to join the HPC Systems and Operations team that builds and manages RCC's HPC infrastructure. The individual in this position will be involved in the operation, maintenance, security, and compliance of large-scale complex HPC systems primarily used for research.

This position designs automated, scalable, and rapidly deployable solutions to infrastructure development and server configuration. Works independently to install, configure, and maintain operating systems. Uses best practices and systems knowledge to monitor and alert systems, utility software, and firewalls. Guides maintenance for production servers as well as Windows and Linux servers.

This is a hybrid position requiring at least 3 days working onsite.

Responsibilities
  • Designs, deploys, configures, and administers CPU/GPU HPC clusters, including management and compute nodes, storage infrastructure, interconnects such as InfiniBand, and physical infrastructure in the datacenter and related systems.
  • Monitors, configures, maintains and tunes GPU nodes for optimal performance and utilization following state-of-the-art practices for the required workloads.
  • Assists with the development and implementation of monitoring and observability tools and infrastructure, collection and aggregation of metrics, development of dashboards.
  • Develops, maintains, and enforces security procedures and system documentation for operational and compliance purposes.
  • Tunes, secures, and maintains the HPC job scheduling environment, including fair-sharing, accounting, and policy enforcement.
  • Troubleshoots and resolves operational, performance, and issues across HPC hardware and software stacks. Coordinates with hardware and software vendors to address defects, vulnerabilities, and performance issues. Assists the Computational Scientists team with user support and helpdesk tickets, including elevated support for security-protected environments.
  • Assists with the implementation and maintenance of secure and reliable backup, archival, disaster-recovery, and restore capabilities for systems and research data.
  • Performs vulnerability scanning, patch management, system and firmware updates across the infrastructure.
  • Maintains complex systems and network administration functions. Works with moderated guidance to administer simple systems and assists in the administration of larger systems.
  • Maintains all supporting documentation for comprehensive operating system, hardware and software configuration. Monitors primary responses for information technology related security incidents and violations. Keeps current with new security and network monitoring technologies, applicable laws and regulations.
  • Plans and installs necessary patches and upgrades for servers and their associated storage, network, communications, and peripheral sub-systems. Installs and maintains and appropriate level of intrusion detection, monitoring, and auditing software as required.
  • Tracks compliance and maintains documentation for hardware, software, and service inventories for management reports.
  • Performs other related work as needed.


Minimum Qualifications

Education:
Minimum requirements include a college or university degree in related field.

Work Experience:
Minimum requirements include knowledge and skills developed through 5-7 years of work experience in a related job discipline.

Certifications:

---

Preferred Qualifications

Experience:
  • Linux system administration experience in a large, distributed computing environment.
  • Demonstrated experience and knowledge of system security and best practices.


Technical Skills and Knowledge:
  • Knowledge of Linux administration, preferably RHEL/Rocky.
  • Administration of GPU infrastructure, such as tuning, driver updates, performance monitoring, etc.
  • Solid skills in scripting with Python or Bash.
  • Installing, configuring, and managing job schedulers, such as Slurm, Torque, PBS, and LSF.
  • Automation tools such as Ansible, Puppet, Chef, Salt.
  • Provisioning tools, including xCAT, Confluent, and Warewulf.
  • Implementing monitoring tools, such as CheckMK, Zabbix, Nagios, Prometheus, and Grafana.
  • Working, documenting and enforcing controls required to protect controlled unclassified information, such as NIST 800-53, NIST 800-171, NIST SP 800-223, and FIPS.
  • Knowledge of at least one distributed storage system, including Storage Scale, Lustre, Gluster, BeeGFS, Ceph, and practical experience.
  • Demonstrate a working knowledge of InfiniBand concepts.
  • Writing precise and concise documentation, standard operating procedures.


Preferred Competencies
  • Understand and translate researchers' scientific goals into computational requirements.
  • Work well with faculty and researchers.
  • Identify and gain expertise in appropriate new technologies and/or software tools.
  • Function as part of an interactive team while demonstrating self-initiative to achieve project's goals and Research Computing Center's mission.
  • Strong analytical skills and problem-solving ability.


Application Documents
  • Resume/C/V (required)
  • Cover Letter (preferred)


When applying, the document(s) MUST be uploaded via the My Experience page, in the section titled Application Documents of the application.

Job Family
Information Technology

Role Impact
Individual Contributor

Scheduled Weekly Hours
37.5

Drug Test Required
No

Health Screen Required
No

Motor Vehicle Record Inquiry Required
No

Pay Rate Type
Salary

FLSA Status
Exempt

Pay Range
$93,500.00 - $110,000.00
The included pay rate or range represents the University's good faith estimate of the possible compensation offer for this role at the time of posting.

Benefits Eligible
Yes
The University of Chicago offers a wide range of benefits programs and resources for eligible employees, including health, retirement, and paid time off. Information about the benefit offerings can be found in the Benefits Guidebook.

Similar Jobs

More Jobs at University of Chicago

More Information Technology Jobs

Find similar HPC Systems Administrator jobs: