The Role
We are looking for an AI Infrastructure / Platform Engineer to work on the foundational systems that power our data science and AI platform.
You will work across the infrastructure layer beneath our ML and AI workflows: data pipelines, orchestration, compute provisioning, model serving, and observability. You will also play a key role in operationalizing our agentic AI platform, ensuring agents are hosted, monitored, and integrated into production-grade systems.
What You'll Do
Data Pipelines & Orchestration
• Design, build, and maintain production data pipelines that ingest, transform, and deliver structured and unstructured data to downstream ML workflows.
• Own and extend our Prefect-based orchestration layer, including flow scheduling, error handling, retry logic, and human-in-the-loop (HITL) suspend/resume patterns.
• Build and maintain feature stores, data contracts, and promotion workflows that ensure data quality and traceability from raw ingestion through model consumption.
• Collaborate with data scientists to operationalize experimental workflows into reliable, repeatable pipelines.
ML/AI Infrastructure & Deployment
• Build and maintain scalable infrastructure for model training, retraining, and inference (batch and real-time), including GPU compute provisioning and container orchestration.
• Implement and manage model serving infrastructure - including containerized endpoints, API gateways, and self-serve deployment frameworks for the data science team.
• Deploy and manage monitoring systems that track model health, data drift, prediction consumption, and pipeline reliability.
• Ensure all deployed systems are highly available, resilient, and well-documented with clear data lineage and runbooks.
Agentic AI Platform & Tooling
• Support the buildout and operationalization of agentic AI workflows, including agent hosting, lifecycle management, and integration with Model Context Protocol (MCP) servers.
• Build shared tooling and infrastructure that enables data scientists to develop, test, and deploy agents with minimal friction.
• Design and implement evaluation frameworks and quality standards for AI agents, including automated benchmarking, regression testing, and production-readiness criteria.
• Ensure observability and reliability across agent execution environments, including logging, tracing, and performance monitoring.
DevOps & Platform Engineering
• Deploy, configure, and maintain shared AI platform services (e.g., observability tools, memory layers, evaluation platforms) as containerized workloads on Azure - including end-to-end ownership of networking, access, and connectivity between services.
• Manage cloud infrastructure (Azure) including container registries, managed identities, Key Vault secrets, storage backends, and virtual network configurations.
• Maintain CI/CD pipelines, branch protection policies, and release management workflows across data science repositories.
• Continuously evaluate and adopt tools and technologies that improve platform reliability, developer experience, and team velocity.
What We're Looking For
Required
• 3+ years of experience in data engineering, MLOps, or ML infrastructure roles - with a clear track record of building and maintaining production data and ML pipelines.
• Strong proficiency in Python and SQL, with hands-on experience building ETL/ELT pipelines and data transformation workflows.
• Experience with workflow orchestration tools (Prefect, Airflow, Dagster, or similar) in production environments.
• Solid understanding of containerization and cloud infrastructure - Docker, Kubernetes, and at least one major cloud provider (Azure preferred).
• Hands-on experience deploying and operating containerized services in cloud environments, including configuring networking, load balancing, and service-to-service connectivity.
• Experience with model serving and deployment patterns (batch inference, real-time APIs, feature stores).
• Familiarity with monitoring and observability tooling for pipelines and deployed models (data drift detection, health metrics, alerting).
• Strong documentation habits and the ability to communicate technical architecture clearly to diverse stakeholders.
Preferred
• Experience with Azure services: Container Apps, ACI, ACR, Blob Storage, Key Vault, Managed Identities, VNets.
• Familiarity with Prefect (especially cloud-managed work pools, result backends, and HITL patterns).
• Experience with dbt, Snowflake, or similar data transformation and warehousing tools.
• Exposure to LLM serving infrastructure and agentic workflow frameworks (e.g., MCP, LangChain, or similar).
• Experience standing up and maintaining third-party AI/ML platform tools (e.g., Langfuse, MLflow, or similar observability and evaluation platforms).
• Experience managing internal Python package distribution (private PyPI, Artifactory, or similar).
• Familiarity with Git-based release management, branch protection, and CI/CD for data science repos.
Benefits
• Comprehensive health, dental, and vision insurance.
• Retirement savings plan with company match.
• Hybrid/flexible work arrangements and a supportive work environment.
Culture
• Demonstrates a strong bias for action and executes quickly with limited guidance.
• Takes full ownership of outcomes and drives problems to resolution.
• Approaches challenges with a solutions-first mindset and delivers measurable results.
• Maintains composure under pressure while keeping momentum and focus.
• Simplifies complex issues into clear, actionable steps that move the work forward.
Base Salary Range: $140,000 to $200,000 per year