IT043 High Performance Computing (HPC) and Storage System Administrator

ADNET Systems, Inc.

$100K — $130K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Expert Linux System Administration with command-line proficiency and scripting skills in Bash and Python.
  • Hands-on experience with high-performance file systems like IBM Spectrum Scale or Lustre.
  • Familiarity with HPC resource management tools, particularly Slurm.
  • Solid understanding of systems security frameworks and patch management processes.
  • Experience in Agile methodologies and tools like Jira and Gitlab.
  • Proficiency in managing GPU-accelerated computing environments and libraries.
  • A Master's degree with 5+ years in relevant operational roles.

Responsibilities

  • Perform daily operations of large-scale supercomputing clusters, ensuring optimal availability and performance.
  • Deploy, configure, and maintain high-performance storage solutions and parallel file systems.
  • Manage and optimize job scheduling and cluster management software.
  • Implement security patches and compliance measures without affecting system stability.
  • Conduct preventative maintenance and diagnostics for hardware and software.
  • Provide tiered technical support for complex user inquiries and workflow optimizations.
  • Administer GPU computing systems, ensuring performance and driver management.

Benefits

  • Annual Leave and Sick Leave
  • Paid Holidays
  • Performance Bonuses
  • Medical, Dental, and Vision Insurance
  • 401K Plan with Company Matching
  • Tuition Reimbursement
  • Flexible payroll options such as Direct Deposit
Full Job Description
IT043 High Performance Computing (HPC) and Storage System Administrator

This job description is for a High Performance Computing and Storage System Administrator to support the operations of the Integrated Modeling Computing Center (IMCC), formerly known as the NASA Center for Climate Simulation (NCCS). The IMCC will directly support the Integrated Modeling Virtual Institute (IMVI) to meet the Earth science modeling needs for NASA. The following describes the core duties and responsibilities and technical skills. Ideal candidates should have excellent communication skills, problem solving, and the ability to work efficiently within a highly performing team environment.

Core Duties & Responsibilities:
  • Full Operational Management: Perform day-to-day operations and management of large-scale, supercomputing clusters to meet the required availability, and performance, including, but not limited to, integration, provisioning, software stack deployment, updates, hardware and software maintenance, and decommissioning.
  • High-Performance Storage Administration: Deploy, tune, configure, maintain, and operate massive parallel file systems.
  • Workload and Schedule Management: Manage, configure, optimize, and troubleshoot cluster management and job scheduling software.
  • Security, Patches, and Compliance: Proactively implement security updates, coordinate systematic Operating System kernel patches, and mitigate vulnerabilities across computing and storage environments without compromising system stability.
  • Preventative and Corrective Maintenance: Coordinate vendor-supported maintenance schedules, conduct hardware and software diagnostics, and participate in rapid-response resolution during service degradations or system blackouts.
  • User Support: Provide specialized, tiered technical assistance ranging from software provisioning and workflow optimization to advanced, expert-level troubleshooting for complex research challenges.
  • GPU System Administration: Provision, configure, and maintain GPU-accelerated computing systems, including driver management, library configuration, and performance optimization for workload acceleration.


Required Technical Skills and Qualifications:
  • Expert Linux System Administration: Advanced, production-level expertise in enterprise Linux distributions (RHEL, Rocky Linux, AlmaLinux, or Ubuntu Server), incorporating expert-level command-line proficiency, kernel tuning, and automated shell scripting (Bash, Python).
  • Parallel File Systems Architecture: Hands-on experience in the design, deployment, scaling, and/or optimization of high-performance file systems. Experience in deploying, configuring, and operating IBM Spectrum Scale and/or Lustre.
  • Scheduling Proficiency: Working familiarity with HPC resource management, including experience with Slurm.
  • Systems Security Alignment: Robust foundation in core security frameworks, containing firewalls, identity management (LDAP/Active Directory), access control lists (ACLs), SSH hardening, and continuous patch management cycles.
  • Agile Methodologies: Experience operating within modern Agile frameworks (Scrum, Kanban), leveraging iterative workflows, participating in sprint reviews, and utilizing collaborative project boards (Jira, Gitlab) to track milestones.
  • GPU Accelerator Management: Proficiency in configuring and maintaining GPU-accelerated computing environments, including driver installation/management, CUDA or similar library configuration, and performance tuning for accelerated workloads.
  • A MS degree and 5+ years' experience in relevant work areas.
  • US Citizenship required.
  • Ability to obtain and maintain a Tier 1 or Tier 2 Investigation through NASA.


Some features of our compensation plans and environment perks include:
  • Annual Leave/Sick Leave
  • Military and Family Emergency Leave
  • Paid Holidays
  • Performance Bonuses
  • Medical, Dental and Vision Plans
  • Direct Deposit Payroll
  • 401K Plan with Company Matching
  • Tuition Reimbursement
  • Swag bags

Similar Jobs

More Jobs at ADNET Systems, Inc.

More Information Technology Jobs

Find similar IT043 High Performance Computing (HPC) and Storage System Administrator jobs: