Principal MLOps Engineer

Raft Company Website

• $150K — $200K *

Boston, MA 02115In-Person

Enterprise Technology

5 - 7 years of experience

Reposted Yesterday

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

7+ years in software engineering, platform engineering, DevOps, MLOps, or related roles
5+ years of experience with Docker and Kubernetes
Strong experience with enterprise cloud infrastructure (AWS, Azure)
Experience managing GPU-enabled Kubernetes clusters
Solid understanding of CI/CD and Agile/Scrum methods
Ability to work independently and collaboratively
Must be able to obtain a Top Secret clearance

Responsibilities

Design and maintain scalable MLOps infrastructure for production ML systems
Lead the maturation of Raft's internal ML platform
Deploy and manage ML workloads on Kubernetes
Support model serving and infrastructure for various ML use cases
Build and maintain CI/CD workflows for ML services
Collaborate with teams to transition models to operational deployment
Enhance reliability and security of ML infrastructure

Benefits

Fully covered healthcare, dental, and vision
401(k) plan with company match
Take as you need PTO + 11 paid holidays
Education and training benefits
Annual budget for tech/gadgets
Remote, hybrid, and flexible work options
Team off-site events
Generous referral bonuses

Full Job Description

This is a U.S. based position. All of the programs we support require U.S. citizenship to be eligible for employment. All work must be conducted within the continental U.S.

We're looking for an experienced Principal ML Ops Engineer to support our customers and join our passionate team of high-impact problem solvers.

About the role:

Raft is building mission-critical AI and data platforms for the Department of Defense (DoD). Our systems ingest and process massive volumes of real-time data from hundreds of sensors and operational sources, transform that data into usable intelligence, and deliver it to operators through mission applications and common operational pictures that support time-sensitive decision-making.

Our platform operates at scale, processing billions of events per day with low-latency data pipelines and cloud-native infrastructure. As Raft expands its AI capabilities, we are investing in a more mature end-to-end machine learning platform to support model development, evaluation, deployment, monitoring, and lifecycle management across both cloud and constrained operational environments.

In this role, you will help design, deploy, and mature Raft's ML platform and MLOps infrastructure. You will work across Kubernetes-based deployment environments, GPU-enabled infrastructure, model serving systems, CI/CD pipelines, and secure production operations to enable rapid and reliable delivery of machine learning capabilities. This role is ideal for someone who understands both the infrastructure needed to run ML systems in production and the practical needs of ML engineers building and deploying models.
What you'll do:
Design, build, and maintain secure, scalable MLOps infrastructure and deployment pipelines for production ML systems
Help mature Raft's internal ML platform and model lifecycle capabilities, including model packaging, registry/catalog workflows, deployment, monitoring, and operational support
Deploy and manage machine learning workloads on Kubernetes, including GPU-enabled clusters
Support model serving and inference infrastructure for a range of ML use cases, including traditional ML, computer vision, speech/audio, and LLM-based systems
Build and maintain CI/CD workflows for ML services, model artifacts, and platform components
Partner closely with ML engineers, software engineers, and product teams to move models from experimentation to reliable operational deployment
Improve observability, reliability, security, and maintainability across ML infrastructure and services
Help evaluate and standardize runtime patterns, serving frameworks, and deployment architectures for production ML workloads
Contribute to infrastructure decisions across edge, on-prem, and cloud-hosted deployment environments
Support compliance-driven deployment practices and secure software supply chain requirements in defense environments
Get hands-on with customers at the most forward-leaning places in the Department of War

What we are looking for:
7+ years of relevant hands-on experience in software engineering, platform engineering, DevOps, MLOps, or related technical roles
5+ years of experience with Docker and Kubernetes in production environments
5+ years of experience supporting enterprise cloud infrastructure or applications in AWS, Azure, or similar environments
Strong experience provisioning, operating, and troubleshooting Kubernetes clusters in production
Experience building and maintaining machine learning platforms, infrastructure, or pipelines used by engineering or data science teams
Practical experience deploying machine learning workloads on Kubernetes
Experience managing clusters or workloads that use GPUs
Strong understanding of Helm and Kubernetes deployment patterns
Strong scripting or programming skills, preferably in Python
Experience with modern software engineering practices including Git, CI/CD, DevOps, and Agile/Scrum workflows
Strong troubleshooting, systems thinking, and communication skills
Ability to work independently and collaboratively in a fast-moving environment
Ability to obtain and maintain a Top Secret clearance
Ability to obtain Security+ certification within the first 90 days of employment

Highly preferred:
Experience with ML model serving and inference platforms such as Triton Inference Server, KServe, Ray Serve, vLLM, or similar technologies
Experience with secure and compliant deployment practices in regulated or government environments
Experience with Kubernetes-based ML platforms such as Kubeflow
Familiarity with service mesh technologies such as Istio
Experience provisioning and debugging complex CI/CD systems
Experience with infrastructure as code tools such as Terraform
Familiarity with software supply chain security, container hardening, vulnerability management, and runtime scanning
Experience supporting ML systems across multiple deployment environments, including cloud, on-prem, and edge
Background working with machine learning engineers on model training, evaluation, packaging, and release workflows
Familiarity with storage and artifact systems used in ML platforms, such as S3-compatible object stores, registries, and metadata/catalog system
What success looks like:
You help Raft stand up a more mature and repeatable ML platform for deploying and managing models in production
ML engineers can move faster because deployment, serving, and platform workflows are clearer, more reliable, and easier to use
Model deployments become more secure, observable, and supportable across real-world mission environments
The organization gains stronger infrastructure for model lifecycle management, including deployment standards, runtime patterns, and platform ownership

Clearance Requirements:
Ability to obtain and maintain a Top Secret clearance

Work Type:
Remote in DMV; McLean, VA; Boston, MA; San Antonio, TX; Colorado Springs, CO; Tampa, FL; Honolulu, HI Locations ONLY
May require up to 40% travel

Salary Range: $150,000.00 - $200,000.00

What we will offer you:
Highly competitive salary
Fully covered healthcare, dental, and vision coverage
401(k) and company match
Take as you need PTO + 11 paid holidays
Education & training benefits
Annual budget for your tech/gadgets needs
Monthly box of yummy snacks to eat while doing meaningful work
Remote, hybrid, and flexible work options
Team off-site in fun places!
Generous Referral Bonuses
And More!

* Ladders Estimates

Similar Jobs

Principal MLOps Engineer
$150K — $200K *
Raft Company Website
Remote
Reposted Today
Principal MLOps Engineer
$150K — $200K *
Raft Company Website
Boston, MA 02115 (Suffolk County)
Reposted Today
Distinguished Engineer AI in Horsham, Pennsylvania
$140K — $219K *
U.S. Bank
Horsham, PA 19044 (Montgomery County)
Yesterday
Principal MLOps Engineer
$150K — $200K *
Raft Company Website
Remote
Reposted Yesterday
Principal Data Engineer, LLM/AI Platforms (Remote)
$195K — $290K *
CrowdStrike Holdings, Inc.
Remote
2 days ago
Principal Data Engineer – Safety Analytics (Global Medical Safety)
$102K — $177K *
Johnson & Johnson
Titusville, NJ 08560 (Mercer County)
4 days ago

Get Ready For Your
Next Interview

More Jobs at Raft Company Website

Principal MLOps Engineer
$150K — $200K *
Tampa, FL 33647 (Hillsborough County)
Reposted Today
Enterprise Technology
In-Person
Principal MLOps Engineer
$150K — $200K *
San Antonio, TX 78228 (Bexar County)
Reposted Today
Enterprise Technology
In-Person
Principal MLOps Engineer
$150K — $200K *
Remote
Reposted Today
Enterprise Technology
In-Person
Principal MLOps Engineer
$150K — $200K *
Boston, MA 02115 (Suffolk County)
Reposted Today
Enterprise Technology
In-Person
Principal MLOps Engineer
$150K — $200K *
Mclean, VA 22101 (Fairfax County)
Reposted Today
Enterprise Technology
In-Person

More Enterprise Technology Jobs

Enterprise Account Manager IV
$221K — $456K *
Hewlett Packard Enterprise Development LP
New York, NY 10025 (New York County)
Reposted Today
Product Marketing Manager - HPE Alletra Storage MP X10000
$105K — $243K *
Hewlett Packard Enterprise Development LP
Fall River Mills, CA 96028 (Shasta County)
Reposted Today
Sr. Director, Strategic Accounts
$176K — $294K *
Evolv Technologies Inc.
Atlanta, GA 30349 (Fulton County)
Today
Partner Delivery Advisor Director
$196K — $285K *
Salesforce
Chicago, IL 60629 (Cook County)
Today
CTO - Agentic Process Automation & Intelligence
$260K — $348K *
Salesforce
Chicago, IL 60629 (Cook County)
Today

Find similar Principal MLOps Engineer jobs:

Nationwide Boston, MA

Principal MLOps Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Principal MLOps Engineer jobs:

Get Ready For Your
Next Interview