About the Team With deep domain expertise, advanced technical capabilities, and a proven track record of successful collaborations, the AI Enablement & Machine Learning team at CNN is accelerating our digital transformation through strategic applications of machine learning and AI technologies.
The ML Platform group within ML Foundations builds and maintains the infrastructure, deployment tooling, and observability that enable CNN's Machine Learning and AI Systems teams to move from prototype to production with velocity and confidence. We support diverse model architectures - two-tower, bandits, LLM- based systems, and traditional ML - across rapid experimentation and scaled production deployment.
Our vision is that CNN's ML and AI teams operate with velocity and confidence, supported by infrastructure that handles everything from rapid experimentation to scaled production deployment across diverse model types and use cases.
Your New Role... As a Staff Software Engineer on ML Platform, you will work across teams to design, build, and operate the infrastructure foundations that power model training, serving, experimentation, and observability for our Machine Learning and AI Systems teams.
You will partner with ML engineers, data engineers, and AI Systems engineers to understand production needs, build reliable infrastructure, and deliver tooling that accelerates the team.
Key challenges you will tackle: Outerbounds Migration: Own end-to-end migration of CNN's ML training and orchestration workloads to Outerbounds-managed Metaflow, with zero production disruption and a clear path to self-service for ML practitioners.
Observability and Cost Attribution: Establish comprehensive observability across all ML products, AI applications, and shared infrastructure - including the monitoring, alerting, and diagnostic tooling engineers need to operate production systems reliably, and cost attribution that lets us understand and govern ML/AI spend by team, product, and use case.
Model and Application Registry:Design and operate a unified registry and versioning system for ML models and AI applications, providing lineage, reproducibility, and a clean handoff between development and production.
Feature Resolution Framework: Evolve our existing feature resolution framework from a loosely coupled set of pipelines into a real platform - one that lets ML and AI Systems engineers configure content types and events to intercept, register feature-generation APIs, and land resolved features as durable data products in the feature store. The features this framework produces - ML- and LLM-generated alike - power everything from analytics to training to inference to user-facing rendering.
What You'll Do - Design and own infrastructure, deployment tooling, and developer experience for ML and AI Systems teams
- Lead architectural decisions across orchestration, serving, observability, and experimentation infrastructure
- Build self-service tooling that lets ML practitioners move from prototype to production without platform team dependencies
- Establish engineering standards for ML/AI infrastructure, including reliability, cost governance, and operational excellence
- Review designs and code, mentor engineers, and lead cross-team initiatives
- Partner with Data Platform on infrastructure coordination and data access patterns
- Communicate effectively across audiences - technical documentation, design reviews, and stakeholder interactions
The Essentials - 8+ years building production infrastructure or platform systems, with a Bachelor's degree in Computer Science, Information Technology, or a related technical field (or 6+ years with a Master's degree)
- Deep expertise in distributed systems, with a track record of shipping highly available, low-latency infrastructure
- Strong proficiency in Python and at least one of Go, Java, or C++
- Expertise with cloud infrastructure and IaC, especially AWS and Terraform
- Experience with ML or data infrastructure - orchestration, serving, deployment tooling, observability, or experimentation frameworks
- Proven track record of leading complex platform projects from concept to production - knowing when to own decisions, when to rally the right people for alignment, and when to escalate
- Collaborative mindset, understanding that great platform work depends on deep partnership with the teams you serve
- A passion for helping CNN's engineering organization grow through mentorship, talent acquisition, and professional development
The Nice to Haves - Experience with Metaflow, SageMaker, or comparable ML orchestration platforms
- Experience with model registries, feature stores, or experimentation frameworks
- Background in cost governance, FinOps, or multi-tenant infrastructure
- Practical experience supporting LLM-based or GenAI production systems
- Prior experience working closely with machine learning engineers
How We Get Things Done...This last bit is probably the most important! Here at WBD, our guiding principles are the core values by which we operate and are central to how we get things done. You can find them at www.wbd.com/guiding-principles/ along with some insights from the team on what they mean and how they show up in their day to day. We hope they resonate with you and look forward to discussing them during your interview.