Senior Site Reliability Champion

Vanguard Group, Inc.

$110K — $140K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in reliability engineering and operational risk assessment.
  • Proficient with observability tools such as Splunk or CloudWatch.
  • Deep understanding of reliability metrics like SLIs, SLOs, and SLAs.
  • Hands-on experience with monitoring methodologies, including anomaly detection.
  • Familiarity with automation tools and techniques for resilience engineering.

Responsibilities

  • Assess applications and vendors for operational risk and reliability.
  • Implement processes ensuring enterprise resiliency standards.
  • Conduct post-incident reviews for significant incidents.
  • Collaborate with teams to address reliability risks proactively.
  • Develop and promote standards and tools across departments.
  • Resolve complex production issues and establish lasting solutions.
  • Engage in on-call rotation to maintain production stability.

Benefits

  • Comprehensive health and wellness programs.
  • Opportunities for professional development and career advancement.
  • Flexible work arrangements to support work-life balance.
  • Collaborative and innovative work environment.
Full Job Description
Core Responsibilities:
  • Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
  • Design and implement processes that enforce enterprise resiliency and reliability standards.
  • Lead blameless post-incident reviews for high-severity incidents or incidents spanning multiple complex product families.
  • Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
  • Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
  • Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
  • Participate in a periodic on-call rotation to support production stability.
  • Evaluate and onboard resiliency and reliability tooling.
  • Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
  • Contribute to strategic initiatives that advance Vanguard's operational maturity and resiliency posture.


Qualifications | Technical Skills:
  • Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
  • Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
  • Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
  • Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).

Special Factors

Sponsorship
Vanguard is not offering visa sponsorship for this position.

Similar Jobs

More Jobs at Vanguard Group, Inc.

More Information Technology Jobs

Find similar Senior Site Reliability Champion jobs: