Senior Platform Reliability Engineer

Grow Therapy

$182K — $250K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 6+ years of experience operating and improving reliability of production systems at scale.
  • Hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform.
  • Experience defining or working with SLOs/SLAs and understanding error budgets.
  • Proficient with modern observability tooling (specifically DataDog) for actionable monitoring systems.
  • Ability to identify patterns across teams and design scalable solutions.

Responsibilities

  • Establish frameworks for SLOs/SLAs, error budgets, and operational readiness across teams.
  • Identify gaps in metrics, logging, and tracing for improved observability.
  • Develop incident response practices from detection to post-incident learning.
  • Build tooling and frameworks to facilitate self-service compliance with reliability standards.
  • Educate and influence engineering teams on adopting reliability practices.

Benefits

  • Comprehensive health coverage including medical, dental, vision, life, and disability insurance.
  • Up to 18 weeks paid parental leave and a new child stipend.
  • 401(k) program and equity opportunities for financial wellness.
  • Stipends for home office setup and ongoing meal support.
  • Flexible PTO plus 12 paid holidays and a full winter break week.
  • Annual stipends for personal and professional development.
  • Access to therapy, flexible self-care hours, and wellness app memberships.
  • Additional perks like pet insurance discounts and commuter benefits.
Full Job Description
About the Role

We're hiring a Senior Platform Reliability Engineer to help define and scale reliability as a first-class capability at Grow. In this role you'll operate horizontally across the organization, shaping how reliability is understood, measured, and built into the developer experience.

You'll work closely with other members of the platform team as well as our product engineering teams to establish standards around observability, SLOs/SLAs, and incident response-while also helping translate those standards into self-service tooling and "golden paths" that make it easy for teams to adopt them.

This is a high-impact, highly autonomous role where you'll drive both cultural and technical change, ultimately enabling teams to independently build and operate reliable systems at scale.

What You'll Work On

You'll help us establish and scale reliability as a discipline at Grow by:
  • Defining Reliability Standards Establishing frameworks for SLOs/SLAs, error budgets, and operational readiness; helping teams understand what to measure and why it matters.
  • Improving Observability & Measurement Identifying gaps in metrics, logging, and tracing; ensuring services are measurable, debuggable, and aligned with reliability goals.
  • Evolving Incident Response Developing and improving incident response practices, from detection to post-incident learning, and helping teams build sustainable on-call and escalation patterns.
  • Enabling Self-Service Reliability Partnering with the platform team to build tooling and abstractions (e.g., service scorecards, dashboards, templates, golden paths) that make it easy for teams to adopt and stay compliant with reliability standards.
  • Driving Adoption Across Teams Working cross-functionally to educate, influence, and guide engineering teams-scaling reliability practices through a combination of clear standards, strong communication, and developer-friendly systems

Who You Are
  • Experienced in production systems: You have 6+ years of experience operating and improving reliability of production systems at scale.
  • Strong foundation in cloud and infrastructure: You have hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform.
  • Deep understanding of reliability principles: You've defined or worked with SLOs/SLAs, understand error budgets, and have experience improving reliability through measurement and iteration.
  • Observability expertise: You've worked with modern observability tooling (we use DataDog) and understand how to build actionable monitoring systems across metrics, logs, and traces.
  • Systems thinker: You're able to zoom out, identify patterns across teams and services, and design solutions that scale beyond a single system.
  • Impact-oriented: You focus on outcomes over output and care deeply about improving real reliability outcomes-not just adding processes.
  • Strong communicator and influencer: You can drive change across teams without direct authority, balancing pragmatism with long-term vision.
  • Self-directed: You thrive in ambiguous environments and are comfortable defining problems, proposing solutions, and executing independently.
  • Team player: You collaborate well, communicate with empathy, and enjoy mentoring and learning from others.

Bonus Points
  • You've helped introduce or scale reliability practices in a growing organization.
  • You've built internal tooling or platforms used by multiple teams.
  • You have experience designing service-level scorecards or compliance/reporting systems.
  • You've worked with both SaaS (e.g., DataDog) and self-managed observability stacks.
  • You were previously a product engineer and bring empathy for developer experience.
  • You have experience with database reliability and performance (we use PostgreSQL)

Why This Role Is Exciting

This is a rare opportunity to define what reliability looks like at a growing, scaling engineering organization-and to do it in a way that actually sticks.

You won't just be responding to incidents or working within a single team. You'll be shaping how reliability is measured, enforced, and experienced across the entire company. You'll work alongside your team mates to turn best practices into intuitive, self-service systems that engineers rely on every day.

Your work will directly improve system reliability, reduce incidents, and enable teams to move faster with confidence, ultimately making reliability a built-in property of how we build software at Grow.

Role Details
  • Employment Type: Full Time, Exempt
  • Base Compensation: The base compensation range for this position is $182,000-$250,000 USD Annually.

This is a hybrid role with the expectation to work onsite from our San Francisco, NYC, or Seattle hub location three days per week (Tuesday, Wednesday, and Thursday) and travel 2-3 times per year (e.g., company and department offsites).

The base compensation for this role will vary depending on several factors, including relevant experience, qualifications, and the candidate's working location.

Full Time Employee Benefits:
  • Comprehensive Health Coverage: Medical, dental, and vision insurance, plus life and disability coverage.
  • Parental Leave & Family Support: Up to 18 weeks paid leave and a new child stipend.
  • Financial Wellness: 401(k) program and equity opportunities.
  • Meals & Home Office Support: Stipends for home office setup and ongoing funds for meals, with tailored perks for both remote and in-office employees.
  • Time Off to Recharge: Flexible PTO, 12 paid holidays, and a full winter break week.
  • Wellness & Development: Annual stipends to put towards personal & professional growth.
  • Mental & Physical Health Support: No-cost access to therapy through the Grow platform, weekly flexible hours for self-care ("Mental Health Mornings/Afternoons") and memberships to leading wellness apps (such as One Medical, Headspace, and Talkspace).
  • Extra Perks: Pet insurance discounts, commuter benefits, and global travel assistance.


Research shows that some groups hesitate to apply unless they meet every qualification. If you're excited about this role but don't check every box, we encourage you to apply. At Grow, we value diverse experiences, transferable skills, and the unique strengths each person brings.

Similar Jobs

More Jobs at Grow Therapy

More Information Technology Jobs

Find similar Senior Platform Reliability Engineer jobs: