ECS

Senior ML Observability Engineer

ECS$120K — $160K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • Current Secret security clearance with ability to obtain Top Secret clearance
  • 10+ years experience in systems engineering, platform operations, or ML/AI infrastructure roles
  • Hands-on experience with observability pipelines using tools like OpenTelemetry and Grafana
  • Experience in cross-domain environments including NIPRNet, SIPRNet, and JWICS
  • CompTIA Cloud+ certification or equivalent knowledge in cloud infrastructure
  • Strong problem-solving and decision-making skills
  • Excellent interpersonal and communication abilities

Responsibilities

  • Design and govern AI observability architectures across various security enclaves
  • Develop telemetry pipelines and metrics for model performance and operational readiness
  • Integrate observability into data pipelines and model-deployment workflows
  • Configure instrumentation using platforms such as Prometheus and Splunk
  • Conduct observability readiness reviews and collaborate on cybersecurity measures
  • Maintain observability consistency across multi-enclave environments
  • Produce standards and reports to enhance model reliability and mission assurance

Benefits

  • Opportunity to contribute to mission-critical AI initiatives
  • Collaboration with top-tier defense and intelligence personnel
  • Engagement with cutting-edge technology in a multi-enclave framework
  • Professional development in a high-security environment
  • Access to unique operational data handling challenges
Full Job Description
Senior ML Observability Engineer to work in the National Capital Region covering the Pentagon, Falls Church, and Fairfax. Please Note: This position is contingent upon contract award.

The War Data Platform (WDP) is a key initiative within the U.S. Department of War's (DoW) AI-First strategy introduced in early 2026. The WDP focuses on operational warfighting data and aims to accelerate the deployment of artificial intelligence (AI) on the battlefield. The WDP extends to Unclassified, Secret, and Top Secret environments, and supports collaboration between Combatant Commands, Joint Staff directorates, Senior Executive Service leaders, and operational analysts.

The Senior ML Observability Engineer architects and governs the instrumentation and telemetry infrastructure needed to ensure production AI and machine learning models deployed across WDP's multi-enclave environment perform reliably and securely at mission scale. This role is essential to maintaining real-time visibility into model behavior, pipeline execution, and cross-domain access interactions in direct support of Combatant Command and Joint Staff decision-making needs.
• Designs, implements, and governs observability and instrumentation architectures supporting AI and machine learning model-serving operations across Unclassified, Secret, and Top Secret enclaves within the War Data Platform (WDP) Core Integration enterprise.
• Develops semantic conventions, runtime instrumentation patterns, and telemetry pipelines that generate latency metrics, error signatures, throughput indicators, model-specific performance signals, and operational readiness measurements for deployed models and serving surfaces.
• Integrates observability capabilities into existing data pipelines, model-deployment workflows, API access patterns, and serving runtime frameworks to provide mission-relevant monitoring aligned with Combatant Command and Joint Staff decision-support needs.
• Configures and validates instrumentation using platforms such as OpenTelemetry, Prometheus, Grafana, Elastic, Splunk, Amazon CloudWatch, and service mesh telemetry components to deliver real-time visibility into model behavior, cross-domain access interactions, and pipeline execution characteristics.
• Conducts observability readiness reviews, supports test and evaluation gates, and collaborates with cybersecurity personnel to embed anomaly-detection signals aligned with Zero Trust and DoW cyber standards.
• Works with serving engineers, pipeline engineers, platform teams, and external provider integration engineers to maintain observability consistency across enclaves and resolve domain-specific telemetry constraints.
• Produces observability standards, instrumentation specifications, dashboards, alerting configurations, and performance analysis reports that strengthen reliability, accelerate incident response, and reinforce mission assurance for production model access across all security networks.
• Performs other duties as assigned.
• Current Secret security clearance with the ability to obtain and maintain a Top Secret (TS) security clearance with Sensitive Compartmented Information (SCI).
• 10 or more years of progressive experience in systems engineering, platform operations, or ML/AI infrastructure roles, with a demonstrated focus on observability, telemetry, and monitoring in classified or federal government cloud environments.
• Hands-on experience designing and implementing observability pipelines using industry-standard tooling such as OpenTelemetry, Prometheus, Grafana, Elastic, Splunk, or Amazon CloudWatch, including instrumentation of AI/ML model-serving runtimes and data pipelines.
• Experience operating across multi-enclave environments, including NIPRNet, SIPRNet, and JWICS, with demonstrated ability to adapt telemetry and observability architectures to cross-domain constraints and multi-level security requirements.
• CompTIA Cloud+ certification or equivalent, demonstrating foundational knowledge of cloud infrastructure, security, and operational monitoring standards.
• Strong problem-solving and decision-making capabilities, with a proven ability to weigh the relative costs and benefits of potential actions and identify the most appropriate solution.
• Highly developed interpersonal and oral/written communication skills, with the ability to effectively and professionally interact with a diverse set of stakeholders (from peers to end-users to executive management).

About ECS

ECS is a leading provider of digital solutions and services to the federal government. The company was founded in 2001 by Roy Kapani and has since grown to become a trusted partner to a wide range of government agencies. ECS offers a broad range of services, including cloud computing, cybersecurity, and artificial intelligence. The company has been recognized for its innovative solutions and has won numerous awards, including the AWS Public Sector Partner of the Year award.
Learn more about ECS
Size
2,000 employees
Industry

Similar Jobs

More Jobs at ECS

  • ECS
    STO Programmatic SETA
    $120K — $150K *
    Arlington, VA 22204 (Arlington County)
    Aerospace & Defense
    In-Person
  • ECS
    AI Methodologist
    $120K — $150K *
    Fairfax, VA 22030 (Fairfax City County)
    Aerospace & Defense
    In-Person
  • ECS
    Software Engineer IV
    $100K — $130K *
    Moorestown, NJ 08057 (Burlington County)
    Aerospace & Defense
    In-Person
  • ECS
    Software Engineer III
    $100K — $130K *
    Moorestown, NJ 08057 (Burlington County)
    Aerospace & Defense
    In-Person
  • ECS
    Program Control Analyst Senior
    $90K — $120K *
    Fairfax, VA 22030 (Fairfax City County)
    Aerospace & Defense
    In-Person

More Information Technology Jobs

Find similar Senior ML Observability Engineer jobs: