Job Location: Dallas, TX
SummaryThe Staff Data Engineer, MLOps leads the design, build, and optimization of Hershey's machine learning operations platform-enabling data science and AI teams to develop, deploy, monitor, and govern ML models at enterprise scale. Sitting within Platform Engineering, this role owns the infrastructure, tooling, and automation that move models from experimentation to production with speed and confidence.
This is a foundational role-you will define and build Hershey's MLOps capability from the ground up, shaping the platform, establishing engineering standards, and growing the team as the function matures. Whether your background is in ML engineering, data platform engineering, or DevOps with ML exposure, we're looking for someone who can bridge the gap between data science and production infrastructure on Azure Cloud and Databricks.
Major Duties & Responsibilities
1. ML Platform Engineering & Infrastructure
- Design and maintain the end-to-end MLOps platform on Azure and Databricks: model training infrastructure, feature stores, experiment tracking, model registries, and serving endpoints.
- Build and optimize CI/CD pipelines for automated model training, validation, packaging, and deployment across environments.
2. Model Deployment, Monitoring & Lifecycle Management
- Implement model serving patterns (batch, real-time, edge) with blue-green and canary deployment strategies for safe rollouts.
- Build monitoring frameworks for data drift, concept drift, and prediction quality; automate alerting and retraining triggers.
3. Governance, Reproducibility & Responsible AI
- Enforce ML governance: model versioning, experiment lineage, artifact management, approval workflows, and audit trails.
- Embed responsible AI practices including explainability tooling, bias detection, and documentation standards.
4. Infrastructure as Code & Cost Optimization
- Author IaC (Terraform/Bicep) for Azure ML workspaces, Databricks clusters, networking, and compute; optimize costs through autoscaling, spot instances, and GPU scheduling.
5. Collaboration & Enablement
- Partner with Data Scientists to productionize models; develop self-service templates and documentation for platform onboarding; mentor junior engineers.
Required Knowledge, Skills, and Abilities
- MLOps & ML Engineering: Experience taking ML models from experimentation to production, including training automation, model packaging, deployment, and monitoring. Our environment uses MLflow, Databricks Model Serving, and Azure Machine Learning.
- Cloud & Platforms: Strong hands-on experience with Azure Cloud and Databricks. Familiarity with services such as Azure ML, AKS, Azure DevOps, Data Factory, Unity Catalog, Workflows, and Model Registry.
- Programming & Development: Strong Python and SQL; experience with ML frameworks (PyTorch, Scikit-learn, XGBoost); comfort building APIs and writing modular, testable code.
- Collaboration & Communication: Proven ability to partner across Data Science, Architecture, and business teams; experience mentoring engineers and driving technical standards.
Preferred Skills
- CI/CD & IaC: ML-specific CI/CD pipelines (Azure DevOps, GitHub Actions); Terraform or Bicep for infrastructure provisioning.
- Containerization & Orchestration: Experience with Docker and Kubernetes for model serving and workload management.
- Monitoring & Observability: Drift detection, prediction quality tracking, and observability tooling (Evidently AI, Azure Monitor, Grafana).
- Certifications: Azure Data Engineer (DP-203), Azure AI Engineer (AI-102), or Databricks ML Professional.
Experience & Education
- Bachelor's degree in Computer Science, Engineering, Data Science, or related field; Master's preferred.
- 5-10 years in software, ML, data platform, or infrastructure engineering with 3+ years building or operating ML pipelines, model serving infrastructure, or ML platform tooling.
- Hands-on experience with Azure and Databricks in a production ML context.
#LI-KR1
#LI-Onsite