ECS

Release/Incident Operations Engineer

ECS$90K — $130K *
Aerospace & Defense
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Current Secret security clearance with ability to obtain Top Secret (TS) clearance
  • 3+ years in release engineering, incident operations, or platform support roles in federal or classified settings
  • Experience managing deployment governance and incident triage for AI/ML pipelines
  • Hands-on knowledge of Kubernetes, GitLab CI, VMware, Prometheus, Grafana, and Elastic Stack
  • DoW 8570/8140-compliant IAT Level II certification
  • Strong problem-solving and decision-making skills

Responsibilities

  • Coordinate release operations for AI and machine learning models across multi-enclave environments
  • Direct change-window execution and rollback readiness for model updates
  • Conduct incident triage by analyzing telemetry and initiating stabilization actions
  • Execute root-cause analysis for incidents and document corrective actions
  • Maintain operational readiness for model serving in collaboration with cross-domain teams
  • Produce critical deliverables including release plans and incident reports
  • Support Tier-4 incident response actions to uphold service-level agreements

Benefits

  • Involvement in a critical AI initiative for the Department of War
  • Opportunity to work in a high-security and advanced technological environment
  • Engagement with a wide array of federal and military stakeholders
  • Potential for professional growth in a cutting-edge technology domain
Full Job Description
Everforth ECS is seeking a Release/Incident Operations Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax. Please Note: This position is contingent upon contract award.

The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.

The Release/Incident Operations Engineer coordinates release operations and incident triage support for AI and machine learning model-serving pipelines across WDP Core Integration's full multi-enclave environment, ensuring deployment consistency and operational continuity in direct support of DoW missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership. This role is central to sustaining mission-ready AI model-serving performance across all classification levels through disciplined release governance, root-cause analysis, and proactive operational risk management.
• Coordinates release operations for artificial intelligence and machine learning model serving across War Data Platform (WDP) Core Integration environments supporting Department of War missions, Joint Staff analysts, Combatant Command elements, and Senior Executive Service leadership.
• Directs change-window execution, rollback readiness activities, and deployment governance for model-runtime updates, serving endpoints, and pipeline modifications.
• Conducts incident triage support by analyzing telemetry, reviewing service health indicators, and initiating stabilization actions across Kubernetes clusters, VMware environments, GitLab Continuous Integration pipelines, Prometheus metrics, Grafana dashboards, and Elastic Stack observability tooling.
• Executes root-cause analysis activities for serving incidents by collecting operational evidence, reconstructing failure sequences, validating remediation steps, and documenting corrective actions aligned with mission assurance requirements.
• Maintains operational readiness for model serving by coordinating with Platform One, Cloud One, multi-national engineering teams, and cross-service mission partners to align release activities with enclave-specific constraints, cross-domain deployment architectures, and security requirements.
• Produces mission-critical deliverables including release plans, rollback packages, incident triage reports, root-cause analysis documentation, operational risk assessments, and service restoration summaries.
• Strengthens program value by advancing deployment consistency, reducing mission risk, and reinforcing operational continuity across all enclaves.
• Supports Tier-4 incident response actions to maintain service-level agreements and sustain mission performance for enterprise artificial intelligence model-serving capabilities.
• Performs other duties as assigned.
• Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).
• 3 or more years of experience in release engineering, incident operations, or platform support roles within a federal government or classified environment, including demonstrated hands-on responsibility for change-window execution, deployment governance, rollback readiness, and incident triage for AI/ML model-serving pipelines or equivalent enterprise cloud-hosted services across multi-enclave or multi-classification environments.
• Hands-on experience applying enterprise observability and container orchestration tooling, including Kubernetes, GitLab CI, VMware, Prometheus, Grafana, and Elastic Stack, to diagnose serving failures, analyze pipeline telemetry, execute root-cause analysis, and coordinate stabilization activities across Unclassified, Secret, and Top Secret network environments.
• Active DoW 8570/8140-compliant IAT Level II certification, such as CompTIA Security+ CE, CompTIA CySA+, CompTIA Cloud+, Cisco CCNA Security, GIAC GSEC, GIAC GCED, or ISC SSCP, as required for access to DoW information systems.
• Strong problem-solving and decision-making capabilities, with a proven ability to weigh the relative costs and benefits of potential actions and identify the most appropriate solution.

About ECS

ECS is a leading provider of digital solutions and services to the federal government. The company was founded in 2001 by Roy Kapani and has since grown to become a trusted partner to a wide range of government agencies. ECS offers a broad range of services, including cloud computing, cybersecurity, and artificial intelligence. The company has been recognized for its innovative solutions and has won numerous awards, including the AWS Public Sector Partner of the Year award.
Learn more about ECS
Size
2,000 employees
Industry

Similar Jobs

More Jobs at ECS

More Aerospace & Defense Jobs

Find similar Release/Incident Operations Engineer jobs: