Job DescriptionSite Reliability Engineer, Team Lead (Player-Coach)Location: Preferred for candidates to be local to (U.S.) Austin, TX or Cranberry Woods, PA, but open to fully remote in mainland USA.
Department: Global Cloud Operations
Reports to: Vice President, Global Cloud Operations
Visa Sponsorship: Any form of Visa Sponsorship is not offered for this position. Must be US citizen or Permanent Resident.
What You'll DoPurpose: Establish and operate Omnicell's Site Reliability Engineering function, balancing hands-on engineering with practice design, coaching, and cross-functional leadership.
Primary ImpactYou will ensure Omnicell's Tier-1 cloud services are observable, resilient, and dependable-so hospitals, pharmacies, and clinicians can rely on our platform without interruption.
Reliability Practice & Operating Model- Define and publish SLIs, SLOs, and error budgets for the top 5-10 Tier-1 customer-facing services in partnership with Product and Engineering.
- Design Omnicell's incident command structure, including severity definitions, declaration criteria, war-room protocols, stakeholder communications, and post-incident review standards.
- Establish and operationalize a sustainable on-call model, including fair rotations, paging discipline, escalation paths, and coordination with managed service partners (IBM, HCL).
- Partner with the VP to migrate the interim incident response RACI - currently held by matrixed individuals across IT, Engineering, Support, and Enterprise Security - into a durable SRE-owned model.
- Select and stand up the primary observability platform, preferring extension of existing Omnicell contracts (DataDog, IBM/Instana, Prometheus/Grafana, OpenTelemetry, or other tooling already in use) over net-new procurement. Define the instrumentation standards all new services must meet.
- Develop and track operational KPIs (e.g., MTTR, SLO attainment, change-failure rate, incident recurrence, cost per workload) and present reliability insights and roadmaps in executive Cloud Ops reviews.
Hands-On Engineering & Incident Leadership- Instrument Tier-1 services directly-building dashboards, alerts, and runbooks yourself.
- Participate in on-call rotations and command Sev-1 and Sev-2 incidents, leading blameless postmortems and driving corrective actions to completion.
- Contribute production code and infrastructure-as-code (Terraform preferred) to the platform. Oversee the design and evolution of the CI/CD pipelines - current stack is Codefresh, Teamcity, Github Actions, and Octopus Deploy, and we are consolidating over time
- Administer and scale our Kubernetes platform, including secure and compliant cluster configurations. Working knowledge of Docker, Helm, and Service Mesh (Istio or Linkerd) expected.
- Plan and execute chaos and failover exercises to validate real-world resilience.
AI-Driven Operations- Architect Omnicell's AIOps strategy, evaluating ML-based anomaly detection, alert correlation, automated root-cause analysis, and LLM-assisted runbooks.
- Make disciplined build-versus-buy decisions and integrate AI tooling only where it delivers measurable reliability gains.
- Ensure AI-assisted operations meet auditability, explainability, and compliance requirements (HIPAA, SOC 2).
Coaching & Team Building- Serve as formal coach to an Engineer III SRE, pairing on incidents, reviewing designs proposals, and supporting growth toward senior levels.
- Design the next 2-4 SRE hires, including role definitions, interview loops, and hiring decisions.
- Represent SRE in architecture reviews, launch readiness assessments, and cross-functional reliability discussions.
What Success looks like in the first six monthsConcrete outcomes this role will be evaluated against in the first half-year. These are drawn from the Cloud Ops 90-day plan and its extension into the following quarter.
- Month 1: SLOs drafted for the top 5 Tier-1 services with Product sign-off. Severity rubric published. First live tabletop Sev-1 run against the interim RACI.
- Month 2: Observability platform selection finalized. Instrumentation standard published. Engineer III SRE hired and onboarded.
- Month 3: On-call rotation live. First real Sev-1 commanded under the new structure with a blameless postmortem completed and follow-ups tracked.
- Month 4-6: Error budget policy in effect for the first 3 services. First incident review at executive level. Interview loop running for the next SRE hires. Initial AIOps evaluation and pilot scope defined.
Who You Are- Bachelor's degree in Computer Science, Engineering, or a related technical field OR equivalent experience
- 7+ years of experience in software or platform engineering, with at least 4 of those in an SRE, DevOps, or platform reliability role.
- At least 2 years of formal technical leadership, tech-lead, or staff-level experience with mentorship responsibilities.
Preferred Qualifications- Proven experience leading SRE, DevOps, or platform engineering teams in a cloud-native production environment - with demonstrated experience building a practice from zero or near-zero: you have set SLOs, defined incident command, and introduced error budget thinking to an organization that did not have it.
- Deep hands-on expertise with at least one major public cloud (AWS, Azure, or GCP), including networking, IAM, and managed services.
- Strong background in CI/CD pipeline design and management (familiarity with CodeFresh, GitHub Actions, Jenkins, TeamCity, or equivalent).
- Experience implementing Infrastructure as Code using Terraform (preferred), Chef, Puppet, or similar tools.
- Proficiency in Python or another object-oriented programming language for automation, tooling, and production services.
- Experience administering and scaling Kubernetes clusters, including secure and compliant platform configurations. Working knowledge of Docker, Helm, and Service Mesh technologies (Istio, Linkerd).
- Hands-on experience designing modern observability platforms using tools such as DataDog, Prometheus, Grafana, OpenTelemetry, Elasticsearch/Kibana, or equivalent - with an opinion about what a good telemetry stack looks like.
- Familiarity with integrating AI/ML-based anomaly detection, alerting, or LLM-assisted triage pipelines - or strong conviction about where AIOps should and should not be applied in a regulated environment.
- Real incident command experience for customer-impacting Sev-1 events, with blameless postmortem practice and documented follow-up discipline.
- Ability to coach and mentor, with direct evidence of growing junior and mid-level engineers. You will eventually have 1 direct report.
- Comfort operating in a regulated environment where reliability and compliance (HIPAA, SOC 2) are inseparable.
How You'll Elevate at OmnicellAt Omnicell, success is defined by both outcomes and behaviors. In this role, you will:
- Collaborate: Partner deeply with Product, Platform Engineering, Support, Security, and managed service providers to align reliability with business priorities.
- Inspire: Lead by example during high-stakes incidents and influence teams toward a culture of ownership, learning, and resilience.
- Develop: Invest in the growth of your SRE peers through coaching, pairing, and thoughtful technical leadership.
- Execute: Set clear priorities, make informed trade-offs, and deliver durable reliability improvements.
- Impact: Shape how Omnicell operates for years to come by defining the standards, tools, and practices of our SRE function.
Leadership Imperatives (Player-Coach Role)This role will eventually have one less senior Site Reliability Engineer reporting to you, you are expected to demonstrate Omnicell's leadership expectations by:
- Modeling a growth mindset and continuous learning.
- Acting as a talent activator through formal coaching and mentorship.
- Being an impact maker who connects reliability investment to business and patient outcomes.
- Serving as a change champion as Omnicell transitions to cloud-first operations.
#LI-MG2