Senior Site Reliability Champion

Vanguard Group, Inc.

$120K — $150K *
Wayne, PA 19087In-Person
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Experience with observability and monitoring tools (e.g., Splunk, Honeycomb, CloudWatch).
  • Strong understanding of SLIs, SLOs, and SLAs with dashboarding capabilities.
  • Proficient in alert design and predictive alerting methodologies.
  • Familiarity with Python automation and resilience engineering practices.
  • Knowledge of RPA platforms like Blue Prism or UiPath, and chaos engineering techniques.

Responsibilities

  • Evaluate applications and vendors for resiliency and operational risk.
  • Design processes to enforce enterprise resiliency standards.
  • Lead post-incident reviews for high-severity incidents.
  • Collaborate with teams to identify reliability risks proactively.
  • Develop and promote new standards across departments.
  • Troubleshoot and resolve complex production issues.
  • Participate in on-call rotation for production support.
  • Onboard resiliency and reliability tools to improve operations.
  • Engage in reliability engineering communities for knowledge sharing.
  • Contribute to strategic initiatives enhancing operational maturity.

Benefits

  • Opportunities for professional development and growth.
  • Access to innovative tools and technologies.
  • Collaborative work environment fostering learning.
  • Participation in communities of practice for professional engagement.
Full Job Description
Core Responsibilities:
  • Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
  • Design and implement processes that enforce enterprise resiliency and reliability standards.
  • Lead blameless post-incident reviews for high-severity incidents or incidents spanning multiple complex product families.
  • Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
  • Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
  • Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
  • Participate in a periodic on-call rotation to support production stability.
  • Evaluate and onboard resiliency and reliability tooling.
  • Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
  • Contribute to strategic initiatives that advance Vanguard's operational maturity and resiliency posture.


Qualifications | Technical Skills:
  • Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
  • Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
  • Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
  • Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).

Special Factors

Sponsorship
Vanguard is not offering visa sponsorship for this position.

Similar Jobs

More Jobs at Vanguard Group, Inc.

More Information Technology Jobs

Find similar Senior Site Reliability Champion jobs: