Team Name:Battle.net & Online Products
Job Title:Senior Site Reliability Engineer, Data & Analytics
Requisition ID:R027436
Job Description:This
Senior Site Reliability Engineer role is on our Data & Analytics team, partnering with data, analytics, ML, and platform engineering to improve the reliability, scalability, and performance of large-scale data platforms, analytics pipelines, ML training pipelines, and inference services.
In addition to core SRE responsibilities, this role will build operational and automation tooling that reduces toil, speeds up issue resolution, and improves engineering velocity. This includes contributing to internal platform services such as shared tooling, data integrations, and access-control patterns used across Blizzard.
The ideal candidate is a production-minded SRE or platform engineer who is comfortable operating critical systems, writing software, and building tools that improve engineering efficiency without compromising reliability.
This role is open to candidates based in Irvine, CA or Albany, NY (hybrid or on-site), as well as fully remote candidates.
Responsibilities- Participate in an on-call rotation and drive incidents to resolution
- Lead blameless postmortems and identify systemic reliability improvements
- Partner with data, ML, and platform teams to improve batch, streaming, training, and inference workloads
- Support ML training pipelines and inference services, including GPU workloads
- Help define how data and ML services run on Kubernetes
- Design and build automation and operational tooling (e.g., workflows, diagnostic tooling, runbooks) to reduce on-call burden
- Build and evolve centralized platform services, including shared tooling, data integrations, and access controls
- Diagnose and resolve reliability, performance, and cost issues across distributed systems
- Champion automation, documentation, and practices that reduce toil
- Maintain infrastructure using Terraform and infrastructure-as-code principles
- Improve CI/CD and GitOps workflows (Jenkins, GitHub Actions, ArgoCD)
- Operate and improve containerized services on Kubernetes
- Define and measure reliability using SLIs, SLOs, and error budgets
- Run load tests, capacity modeling, and production validation
- Build internal tools and paved paths that help teams operate safely and efficiently
Minimum Requirements - Experience operating reliable, distributed systems in SRE, platform, or similar roles
- Experience with data, analytics, ML, or large-scale distributed workloads
- Strong knowledge of Linux, containers, Kubernetes, and cloud infrastructure
- Experience building automation or internal tools (Python, Go, shell, etc.)
- Experience with infrastructure-as-code (e.g., Terraform)
- Experience with CI/CD or GitOps systems (e.g., Jenkins, GitHub Actions, ArgoCD)
- Familiarity with observability (metrics, logs, traces, alerting, incident response)
- Solid understanding of SRE concepts (SLIs, SLOs, error budgets, postmortems)
- Experience using modern development and automation practices to improve reliability and efficiency
- Experience building internal tooling, automation, or developer productivity systems
- Strong communication skills with technical and cross-functional partners
Bonus Points - Experience with data and ML systems (training pipelines, model serving, GPU workloads)
- Experience with distributed systems and messaging (Kafka, Pub/Sub)
- Experience working in Kubernetes-based environments
- Familiarity with observability tools (Prometheus, Grafana)
- Experience operating systems in cloud environments (GCP, AWS)
RewardsWe provide a suite of benefits that promote physical, emotional and financial well-being for 'Every World' - we've got our employees covered! Subject to eligibility requirements, the Company offers comprehensive benefits including:
- Medical, dental, vision, health savings account or health reimbursement account, healthcare spending accounts, dependent care spending accounts, life and AD&D insurance, disability insurance;
- 401(k) with Company match, tuition reimbursement, charitable donation matching;
- Paid holidays and vacation, paid sick time, floating holidays, compassion and bereavement leaves, parental leave;
- Mental health & wellbeing programs, fitness programs, free and discounted games, and a variety of other voluntary benefit programs like supplemental life & disability, legal service, ID protection, rental insurance, and others;
- If the Company requires that you move geographic locations for the job, then you may also be eligible for relocation assistance.
Eligibility to participate in these benefits may vary for part time and temporary full-time employees and interns with the Company. You can learn more by visiting https://www.benefitsforeveryworld.com/.
In the U.S., the standard base pay range for this role is $101,000.00 - $186,754.00 Annual. These values reflect the expected base pay range of new hires across all U.S. locations. Ultimately, your specific range and offer will be based on several factors, including relevant experience, performance, and work location. Your Talent Professional can share this role's range details for your local geography during the hiring process. In addition to a competitive base pay, employees in this role may be eligible for incentive compensation. Incentive compensation is not guaranteed. While we strive to provide competitive offers to successful candidates, new hire compensation is negotiable.