Lead Software Engineer (Site Reliability)

Vanguard Group, Inc.

$110K — $140K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Deep knowledge of Java or JavaScript with practical experience in distributed systems.
  • Strong problem-solving and analytical skills, especially in debugging and optimization.
  • Hands-on experience with AWS services and cloud infrastructure.
  • Proficiency in designing scalable and secure system architectures.
  • Working knowledge of Python or similar scripting languages.
  • Expertise in resiliency engineering techniques across platforms and applications.
  • Experience in troubleshooting complex production issues.

Responsibilities

  • Enhance resiliency engineering practices across various platforms and applications.
  • Detect, troubleshoot, and resolve incidents effectively.
  • Automate incident response and infrastructure management processes.
  • Develop and support OpenTelemetry integrations for multiple application platforms and languages.
  • Contribute to architectural decisions and implement scalable solutions.

Benefits

  • Collaborative and innovative work culture focused on problem-solving.
  • Opportunity to impact operational strategies and system designs.
  • Professional growth opportunities in a leading financial services company.
  • Access to cutting-edge technologies and practices in software engineering.
Full Job Description
Shape the Future of Observability at Vanguard

We are seeking an experienced engineer with broad, end-to-end software development experience, including operating applications in a microservices environment in production at scale. This role goes beyond feature implementation - it requires someone who can design, build, and support resilient systems from the ground up.

As a Senior Reliability Engineer at Vanguard, you will play a critical role in solving impactful operational problems. You are curious and take a proactive approach to identifying problems and making improvements. You balance innovative thinking with pragmatism and understand the long-term impacts of technical decisions. You communicate complex ideas clearly and collaborate effectively to deliver scalable solutions.

Core Responsibilities
  • Improve resiliency engineering practices across platforms and applications, including resilient application design patterns, system observability and deployment strategies
  • Incident detection, troubleshooting, and resolution.
  • Develop automation for incident response and infrastructure management
  • Develop and support OpenTelemetry integrations for multiple application platforms (browser, ECS, lambda, etc) and languages (JavaScript, Java)
  • Contribute to architectural decisions and support implementation of solutions.


Skills and Qualifications
  • Deep knowledge of Java or Javascript. Practical experience developing and operating software in distributed systems environments.
  • Problem-solving and analytical thinking: ability to diagnose complex issues and propose efficient solutions. Strong debugging and optimization skills for performance and scalability.
  • Cloud platforms: Hands-on experience with AWS services and cloud infrastructure
  • System architecture and design: ability to design scalable, secure, and maintainable systems.
  • Working knowledge of Python (or similar scripting language).
  • Strong knowledge of resiliency engineering techniques for both platforms and applications.
  • Experience troubleshooting complex production issues and implementing effective mitigations.
  • Familiarity with OpenTelemetry specification and core APIs.


Special Factors

Sponsorship
Vanguard is not offering visa sponsorship for this position.

Similar Jobs

More Jobs at Vanguard Group, Inc.

More Information Technology Jobs

Find similar Lead Software Engineer (Site Reliability) jobs: