Job Title: MLOps Platform Engineer (SageMaker)Location(s):OnsiteJob Summary:This position is with the Enterprise Analytical Data & Integration Team. The ideal candidate will have extensive experience in cloud infrastructure or ML platform operations, with a specific focus on AWS and Amazon SageMaker. The role involves designing, building, and operationalizing an enterprise ML platform on AWS SageMaker Unified Studio.
Key Responsibilities: - Set up SageMaker Unified Studio platform - domain configuration, project provisioning, persona-based roles, and multi-environment promotion workflows.
- Build MLOps pipelines using SageMaker Pipelines - data extraction from Snowflake, preprocessing, training, evaluation, and model registration.
- Manage SageMaker Model Registry - cross-account model promotion, versioning, immutability, and lineage tracking.
- Configure MLflow experiment tracking - auto-logging of parameters, metrics, and artifacts.
- Set up identity and access management - Okta SSO, SailPoint entitlements, persona-based execution roles, service roles for pipelines.
- Build model serving - real-time SageMaker endpoints and batch prediction workflows.
- Set up model monitoring - data drift, model drift, performance degradation detection.
- Configure data catalog - searchable datasets, access-level visibility, access-request workflows, lineage.
- Own platform operations - observability (CloudWatch, Datadog), logging, custom images, instance availability.
Required Qualifications: - 10-15 years of software engineering experience focused on cloud infrastructure or ML platform operations.
- 5 years hands-on with AWS, including deep expertise in Amazon SageMaker (Studio, Pipelines, Model Registry, Endpoints, Feature Store).
- 3 years building and operating production MLOps pipelines - training, versioning, deployment, monitoring, rollback.
- Experience with SageMaker Unified Studio or Studio Classic - domain/project setup, blueprints, multi-tenant configuration.
- Infrastructure-as-Code with Terraform, CDK, or CloudFormation.
- IAM design for ML platforms - execution roles, service roles, cross-account access, Lake Formation, SSO/SAML.
- MLflow or equivalent experiment tracking.
- SageMaker Pipelines or similar workflow orchestration (Airflow, Step Functions).
- Model serving - real-time endpoints, batch transform, auto-scaling, endpoint monitoring.
- Snowflake as a data source for ML pipelines.
- Kubernetes (EKS) and container orchestration.
- Networking and security - VPC, security groups, private endpoints, cross-account connectivity.