Senior Site Reliability Champion

Vanguard Group, Inc.

$100K — $140K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Experience with modern observability tools like Splunk, Honeycomb, or CloudWatch.
  • Strong grasp of reliability metrics including SLIs, SLOs, and SLAs.
  • Familiarity with alert design and predictive alert methodologies.
  • Proficient in Python automation and RPA platforms such as Blue Prism or UiPath.
  • Knowledge of resilience practices including chaos engineering and failure analysis techniques.

Responsibilities

  • Evaluate applications and vendors for resiliency and operational risk.
  • Design processes that uphold enterprise standards for resiliency and reliability.
  • Lead post-incident reviews to analyze high-severity incidents.
  • Collaborate with teams to identify and fix reliability risks proactively.
  • Develop and promote new standards and frameworks across subdivisions.
  • Troubleshoot production issues and implement long-term fixes.
  • Participate in on-call rotation to ensure production stability.

Benefits

  • Contribute to strategic initiatives enhancing operational maturity.
  • Engage actively in reliability engineering communities for shared learning.
Full Job Description

Core Responsibilities:

  • Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.

  • Design and implement processes that enforce enterprise resiliency and reliability standards.

  • Lead blameless post‑incident reviews for high‑severity incidents or incidents spanning multiple complex product families.

  • Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.

  • Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.

  • Troubleshoot complex production issues and implement durable solutions that prevent recurrence.

  • Participate in a periodic on‑call rotation to support production stability.

  • Evaluate and onboard resiliency and reliability tooling.

  • Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.

  • Contribute to strategic initiatives that advance Vanguard’s operational maturity and resiliency posture.

Qualifications | Technical Skills:

  • Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.

  • Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.

  • Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.

  • Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).

Special Factors

Sponsorship

Vanguard is not offering visa sponsorship for this position.

Similar Jobs

More Jobs at Vanguard Group, Inc.

More Information Technology Jobs

Find similar Senior Site Reliability Champion jobs: