Core Responsibilities:- Evaluate applications, platforms, and vendors to assess resiliency, reliability, and operational risk.
- Design and implement processes that enforce enterprise resiliency and reliability standards.
- Lead blameless post-incident reviews for high-severity incidents or incidents spanning multiple complex product families.
- Partner with product and platform teams to proactively identify and remediate reliability risks before they impact clients.
- Develop, communicate, and evangelize new standards, tools, and frameworks across subdivisions, ensuring consistent adoption.
- Troubleshoot complex production issues and implement durable solutions that prevent recurrence.
- Participate in a periodic on-call rotation to support production stability.
- Evaluate and onboard resiliency and reliability tooling.
- Actively participate in reliability engineering and resilience communities of practice, contributing to shared learning and enterprise consistency.
- Contribute to strategic initiatives that advance Vanguard's operational maturity and resiliency posture.
Qualifications | Technical Skills:- Observability Platforms: Experience with modern observability and monitoring tools, such as Splunk, Honeycomb, CloudWatch, Dynatrace, or AppDynamics.
- Reliability Metrics: Strong understanding of SLIs, SLOs, and SLAs, including dashboarding and reporting practices.
- Monitoring & Alerting: Experience with alert design, anomaly detection, predictive alerting, and synthetic monitoring using structured methodologies.
- Automation & Resilience Engineering: Experience with automation and resilience practices such as Python-based automation, RPA platforms (e.g., Blue Prism, UiPath), chaos engineering, and failure analysis techniques (e.g., FMEA).
Special FactorsSponsorshipVanguard is not offering visa sponsorship for this position.