Senior HPC DevOps Engineer

Joint Activities

• $146K — $234K *

College Park, MD 20740In-Person

Information Technology

11 - 15 years of experience

3 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

12+ years of experience in computer science, IT, or related technical field; MS with 10 years, or a Ph.D. with 8 years; 4 additional years in lieu of a Bachelor’s degree.
7+ years of experience in Linux systems, SRE, or DevOps, specifically in HPC or large-scale compute environments.
3+ years of experience using Ansible for automation at scale, focusing on roles, idempotency, and secrets management.
Strong knowledge of Linux hardening and compliance techniques.
Demonstrated experience managing or automating clustered compute environments.
Hands-on experience with container technology in Linux environments.
Familiar with incident response and automating common remediation tasks.
Must hold at least one active technical certification in systems engineering, information security, networking, system administration, virtualization, IT systems management, or project management.
Active TS/SCI security clearance with a current polygraph.

Responsibilities

Own and manage automation workflows for HPC/AI compute clusters.
Enforce desired state across cluster services and detect deviations through code-driven configuration.
Build and maintain automated node onboarding workflows for OS configuration and readiness checks.
Implement rolling maintenance and patch automation to ensure compliance with SLAs.
Ensure logging and observability of automated workflows for effective incident response.
Automate responses to common incidents using standardized runbooks and management interfaces.
Maintain versioned operational documentation alongside automation efforts.

Benefits

Potential for an increased sign-on bonus based on eligibility and terms discussed during recruitment.

Full Job Description

Responsibilities

Peraton Labs is seeking a poly cleared Senior HPC DevOps Engineer to own the operations and automation lifecycle for an existing HPC/AI compute cluster (Linux). You will work closely with Peraton team members, as well as directly with our Maryland-based customer, in a fast-paced environment at a customer site. In this role you will codify repeatable operations in Ansible and drive execution through an enterprise automation controller to enforce desired state, detect drift, accelerate node onboarding, and streamline incident response via runbook automation integrated with monitoring and ITSM.

This position requires full-time on-site work at a customer site near College Park, MD.

Key responsibilities may include

Automation ownership: Own and manage automation workflows, including job templates, inventories, credentials, RBAC configurations, execution environments, and promotion across environments.
Desired-state and drift detection: Enforce desired state across cluster services via code-driven configuration; implement drift detection and alert on deviations; reconcile runtime state vs configured state.
Compute node onboarding (Bare-metal/VM): Build and maintain an automated node bootstrap workflow that installs/configures the OS, applies security and performance baselines, enrolls nodes into the scheduler and shared storage ecosystem, validates hardware and service readiness (CPU, network, accelerator, storage mounts), and reports pass/fail results.
Patch & vulnerability response: Implement rolling maintenance and patch automation to meet defined vulnerability response SLAs. Maintain version-controlled container build definitions and integrate image scanning into the build/release lifecycle.
Logging & observability: Ensure automation and operational workflows emit auditable logs to centralized analytics and integrate with metrics/alerting to enable reliable incident response, proactive detection, and safe auto-remediation.
Incident/problem management: Automate responses to common incidents (hung nodes, storage performance alarms, image vulnerabilities, hardware failures) leveraging out-of-band hardware management interfaces and standardized runbooks.
Docs-as-code: Keep runbooks and operational documentation versioned alongside automation and publish operator guidance to the orgs documentation platform.

*This position may be eligible for an increased sign-on bonus. Eligibility, bonus amount, and applicable terms and conditions will be discussed during the recruiting process*

#MDFSP

Qualifications

Required qualifications

12+ years of experience and a BS in computer science, IT, or related technical field, MS and 10 years of experience, or a Ph.D. with 8 years of experience. Four years of additional experience is required in lieu of a Bachelors’ degree for a total of 16 years of experience.
7+ years in Linux systems / SRE / DevOps, including production cluster operations in an HPC or large-scale compute environment.
3+ years of experience building and operating Ansible automation at scale (roles/collections, idempotency, inventories, secrets).
Strong Linux hardening & compliance fundamentals (SELinux/AppArmor, SSH key automation, baseline config management).
Demonstrated experience operating or automating clustered compute environments (HPC, large Linux farms, or similar).
Hands-on experience with container tooling in Linux environments, including image lifecycle/versioning.
Familiarity with incident response and runbook-driven operations; ability to automate common remediations.
Strong Git workflow and documentation practices.
Must hold at least one active/current technical certification from the following-
- Systems engineering (e.g., INCOSE)
- Information security (e.g., CISSP)
- Networking (e.g., CCNA)
- System Administration (e.g., RHCE, MCSE)
- Virtualization (e.g., VCP)
- IT systems management (e.g., ITIL)
- Project management (e.g., PMP, Agile)
Active TS/SCI security clearance with a current polygraph is required

Preferred qualifications

Bare-metal provisioning experience (PXE/iPXE, Kickstart/Preseed, Foreman/MAAS) and hardware OOB management.
CI/testing for automation and promotion pipelines for playbooks
Experience with tuned performance profiles, HPC performance troubleshooting, and GPU node health validation.

#MDPM

#MDFSP

Target Salary Range$146,000 - $234,000. This represents the typical salary range for this position. Salary is determined by various factors, including but not limited to, the scope and responsibilities of the position, the individual’s experience, education, knowledge, skills, and competencies, as well as geographic location and business and contract considerations. Depending on the position, employees may be eligible for overtime, shift differential, and a discretionary bonus in addition to base pay.

* Ladders Estimates

Similar Jobs

Software Engineer 2
$133K — $249K *
Wyetech
Linthicum Heights, MD 21090 (Anne Arundel County)
Today
DevOps Team Lead
$120K — $150K *
SAIC
Reston, VA 20191 (Fairfax County)
Today
Senior Azure DevOps Engineer
$120K — $150K *
IT Labs
Washington, DC 20011 (District Of Columbia County)
Today
Senior Azure DevOps Engineer
$120K — $150K *
IT Labs
New York City, NY 10025 (New York County)
Today
Mid Level DevOps Engineer
$130K — $190K *
Freedom Technology Solutions Group
Annapolis Junction, MD 20701 (Howard County)
Today
Senior DevOps Engineer
$150K — $200K *
Raft Company Website
Washington, DC 20011 (District Of Columbia County)
Today

Get Ready For Your
Next Interview

More Jobs at Joint Activities

Senior DevSecOps / Platform Engineer - Agentic AI
$112K — $179K *
Woodbridge, NJ 07095 (Middlesex County)
Today
Information Technology
In-Person
Senior Front-End Engineer - Agentic AI
$112K — $179K *
Basking Ridge, NJ 07920 (Somerset County)
Today
Information Technology
In-Person
Senior Front-End Engineer - Agentic AI
$112K — $179K *
Red Bank, NJ 07701 (Monmouth County)
Today
Consumer Technology
In-Person
Senior Full-Stack Platform Engineer - Agentic AI
$146K — $234K *
Basking Ridge, NJ 07920 (Somerset County)
Today
Information Technology
In-Person
AI/ML Software Engineer
$135K — $216K *
Herndon, VA 20171 (Fairfax County)
Today
Information Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Software Engineer II, Search & Data Infrastructure -Slack
$117K — $223K *
Salesforce
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer Lead
$55K — $158K *
The PNC Financial Services Group, Inc
Dallas, TX 75217 (Dallas County)
Reposted Today
Senior R&D Engineer-17637
$130K — $180K *
Synopsys Inc
Sunnyvale, CA 94087 (Santa Clara County)
Today

Find similar Senior HPC DevOps Engineer jobs:

Nationwide College Park, MD

Senior HPC DevOps Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior HPC DevOps Engineer jobs:

Get Ready For Your
Next Interview