Aya Healthcare

Manager, Site Reliability Engineering

Aya Healthcare$230K — $255K *
US-AnywhereRemote in United States
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years in Site Reliability Engineering, DevOps, or related roles.
  • 4+ years of direct people management experience.
  • Demonstrated ownership of reliability for customer-facing SaaS at scale.
  • Deep Azure experience, especially with AKS and networking.
  • Production-grade experience with observability tools like Datadog.
  • Hands-on experience integrating AI into operational workflows.
  • Incident command experience for high-severity incidents.
  • Instinct for operating within HIPAA and other compliance constraints.

Responsibilities

  • Lead, mentor, and grow a high-performing SRE team.
  • Set operating cadence for team activities, including standups and incident reviews.
  • Build a culture of blameless learning and technical depth.
  • Own the reliability strategy for customer-facing products.
  • Lead major incident response and ensure systemic fixes are implemented.
  • Champion proactive reliability through chaos engineering and capacity testing.
  • Build AIOps practices to reduce mean time to detect and respond.

Benefits

  • Free premium medical, dental, life, and vision insurance.
  • Generous 401(k) match.
  • Paid sick leave that accrues at a rate of one hour for every 30 hours worked.
  • Unlimited DTO (Discretionary Time Off).
  • Company-sponsored virtual events and team-building activities.
  • Daily virtual yoga, meditation, or boot camp classes.
Full Job Description
Were looking for a highly experienced Manager, Site Reliability Engineering to lead the team behind one of healthcares most relied-on workforce platforms. In this leadership role, youll guide and grow a team of engineers driving product and platform reliability - ensuring an exceptional experience for the clinicians, clients, and internal teams who depend on us every day. Youll shape our reliability architecture, lead complex operational initiatives, and drive the adoption of AI-native operations (AIOps) and automation to eliminate toil and advance performance - owning measurable business outcomes across uptime, customer trust, and platform efficiency, and leading with the radical ownership Aya expects of every leader.

Responsibilities:
  • Lead and grow the SRE team
    • Lead, mentor, and grow a team of high-performing Site Reliability Engineers across hiring, performance management, career development, and on-call rotation health.
    • Set the operating cadence for the team - standups, incident reviews, SLO/error-budget reviews, post-incident learning, and capacity planning.
    • Build a culture of blameless learning, technical depth, customer empathy, and disciplined ownership.
    • Partner closely with DevSecOps, Security Engineering, DRE, Incident & Change Management, and product engineering leadership to remove cross-team friction.
  • Drive reliability, performance, and availability
    • Own the reliability strategy for customer-facing products and internal platforms - defining SLOs, SLIs, and error budgets in partnership with product and engineering leadership, and operationalizing them in the release process.
    • Lead major incident response as senior incident commander for severity-1 events; institutionalize blameless post-incident reviews and ensure systemic fixes ship.
    • Champion proactive reliability - chaos engineering, game days, failure-mode analysis, capacity and load testing - well before incidents force the conversation.
    • Manage software release support and 24/7 on-call escalation rotations across the platform surface area, with humane on-call load and clear escalation paths.
  • Operational intelligence and AI-native operations
    • Build the AIOps practice - anomaly detection, predictive alerting, intelligent correlation, and automated triage - to drive measurable reductions in MTTD and MTTR.
    • Operationalize AI-assisted workflows for incident summarization, runbook generation, log and trace analysis, change risk scoring, and post-incident narrative drafting.
    • Pilot and scale agentic remediation where appropriate, with strict guardrails, audit trails, and human-in-the-loop controls suitable for a HIPAA-regulated environment.
    • Evolve the observability platform (Datadog metrics, logs, traces, RUM, synthetics, CI Visibility) so engineering teams can operate their own services with confidence and clear ownership.
  • Platform efficiency and stakeholder trust
    • Treat reliability as a product with a roadmap, measurable outcomes, and an executive-credible narrative - not as overhead.
    • Drive platform unit economics by partnering with FinOps and platform leadership on cost-to-serve, right-sizing, capacity efficiency, and waste elimination.
    • Communicate outcomes to executive, product, and customer-facing stakeholders in plain language tied to clinician and client experience.
    • Uphold HIPAA, PHI, and security obligations across every reliability decision, change, and tool selection.

Required Qualifications:
  • 10+ years in a combination of Site Reliability Engineering, DevOps, Platform Engineering, or related production-operations roles.
  • 4+ years of direct people management experience - hiring, performance management, career development, and running remote on-call teams.
  • Demonstrated ownership of reliability outcomes for customer-facing SaaS at meaningful scale - defining and operationalizing SLOs/SLIs/error budgets and using them to drive engineering prioritization.
  • Deep Azure experience - 3+ years operating production workloads on Azure, with hands-on depth in AKS, networking, identity, and platform services. Equivalent depth in AWS or GCP will be considered.
  • Modern observability fluency - production-grade experience with Datadog (or equivalent: New Relic, Dynatrace, AppDynamics) across metrics, logs, traces, RUM, and synthetics.
  • AI in operations - hands-on experience integrating AI/LLM-assisted tooling into operational workflows (incident summarization, runbook generation, log analysis, anomaly triage, change risk scoring).
  • Incident command experience - proven ability to lead severity-1 incidents end-to-end, run blameless reviews, and convert lessons into systemic improvements.
  • Regulated-environment instinct - operates with HIPAA, PHI, SOC 2, or comparable compliance constraints as a default mindset, not an afterthought.
  • Executive-grade communication - translates reliability work into business outcomes for executive, product, and customer-facing audiences.
  • Bachelors degree in Computer Science, Information Technology, Engineering, or related field - or an equivalent combination of education, training, and experience.

Preferred Qualifications:
  • Cloudflare at the edge - production experience with Cloudflare CDN, WAF, Workers, Access (ZTNA), Tunnel, Turnstile, and certificate management.
  • IaC at scale - Terragrunt and Terraform in a multi-environment, policy-gated pipeline; experience evolving IaC from it works to it scales safely.
  • CI/CD maturity - GitHub Actions with OIDC/workload identity federation, OPA/Conftest policy-as-code, progressive delivery, and DORA-metric instrumentation.
  • Container platform depth - Kubernetes/AKS in production, including Helm, ingress, service mesh, autoscaling, and node lifecycle.
  • ITSM integration - ServiceNow for change, incident, and problem management; experience tying observability and CI data into ITSM workflows.
  • Identity ecosystem - operating in an Okta / Entra ID / M365 identity environment, including PIM, conditional access, and service-principal hygiene.
  • Chaos and resilience engineering - running game days, fault injection, and resilience exercises as a routine practice.
  • FinOps fluency - cost-to-serve, right-sizing, capacity efficiency, and unit-economics work in cloud environments.
  • Agile delivery - Scrum/Kanban delivery with Jira; comfortable operating in a quarterly planning + continuous-delivery cadence.

What We Offer:
  • Free premium medical, dental, life and vision insurance
  • Generous 401(k) match
  • Aya also offers other benefits to those that are eligible and where required by applicable law, including reimbursements and discretionary bonuses
  • Aya provides paid sick leave in accordance with all applicable state, federal, and local laws. Ayas general sick leave policy is that employees accrue one hour of paid sick leave for every 30 hours worked. However, to the extent any provisions of the statement above conflict with any applicable paid sick leave laws, the applicable paid sick leave laws are controlling
  • Celebrations! We hit our goals and reward ourselves.
  • Company-sponsored virtual events, happy hours and team-building activities are always on the horizon - plus, you get a special treat on your birthday!
  • Unlimited DTO - we believe in time off!
  • Virtual yoga, meditation or boot camp classes offered daily

Compensation: Aya reasonably anticipates the pay scale for this position to be an annual salary of $230,000 to $255,000.

The pay scale for this position may vary if applicant possesses experience outside of what Aya reasonably anticipates for this position. Bonuses are subject to the role and your managers discretion.

About Aya Healthcare

Aya Healthcare is a leading provider of travel nurse staffing and workforce solutions to hospitals and healthcare facilities across the United States. The company was founded in 2001 and is headquartered in San Diego, California. Aya Healthcare's mission is to provide exceptional healthcare staffing services and solutions to healthcare providers and facilities, while also providing career opportunities and support to healthcare professionals. The company has been recognized for its outstanding workplace culture and has received numerous awards for its commitment to employee satisfaction and engagement. Aya Healthcare is committed to delivering high-quality healthcare staffing services and solutions that meet the needs of its clients and the communities they serve.
Learn more about Aya Healthcare
Size
5,000 employees
Industry
Net Income
$100 million
Founded
2001
5 Year Trend
+50%
Revenue
$2 billion
NASDAQ

Similar Jobs

More Jobs at Aya Healthcare

More Information Technology Jobs

Find similar Manager, Site Reliability Engineering jobs: