Senior Platform Reliability Engineer

Grow Therapy

• $182K — $250K *

San Francisco, CA 94112In-Person

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

6+ years of experience operating and improving reliability of production systems at scale.
Hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform.
Experience defining or working with SLOs/SLAs and understanding error budgets.
Proficient with modern observability tooling (specifically DataDog) for actionable monitoring systems.
Ability to identify patterns across teams and design scalable solutions.

Responsibilities

Establish frameworks for SLOs/SLAs, error budgets, and operational readiness across teams.
Identify gaps in metrics, logging, and tracing for improved observability.
Develop incident response practices from detection to post-incident learning.
Build tooling and frameworks to facilitate self-service compliance with reliability standards.
Educate and influence engineering teams on adopting reliability practices.

Benefits

Comprehensive health coverage including medical, dental, vision, life, and disability insurance.
Up to 18 weeks paid parental leave and a new child stipend.
401(k) program and equity opportunities for financial wellness.
Stipends for home office setup and ongoing meal support.
Flexible PTO plus 12 paid holidays and a full winter break week.
Annual stipends for personal and professional development.
Access to therapy, flexible self-care hours, and wellness app memberships.
Additional perks like pet insurance discounts and commuter benefits.

Full Job Description

About the Role

We're hiring a Senior Platform Reliability Engineer to help define and scale reliability as a first-class capability at Grow. In this role you'll operate horizontally across the organization, shaping how reliability is understood, measured, and built into the developer experience.

You'll work closely with other members of the platform team as well as our product engineering teams to establish standards around observability, SLOs/SLAs, and incident response-while also helping translate those standards into self-service tooling and "golden paths" that make it easy for teams to adopt them.

This is a high-impact, highly autonomous role where you'll drive both cultural and technical change, ultimately enabling teams to independently build and operate reliable systems at scale.

What You'll Work On

You'll help us establish and scale reliability as a discipline at Grow by:

Defining Reliability Standards Establishing frameworks for SLOs/SLAs, error budgets, and operational readiness; helping teams understand what to measure and why it matters.
Improving Observability & Measurement Identifying gaps in metrics, logging, and tracing; ensuring services are measurable, debuggable, and aligned with reliability goals.
Evolving Incident Response Developing and improving incident response practices, from detection to post-incident learning, and helping teams build sustainable on-call and escalation patterns.
Enabling Self-Service Reliability Partnering with the platform team to build tooling and abstractions (e.g., service scorecards, dashboards, templates, golden paths) that make it easy for teams to adopt and stay compliant with reliability standards.
Driving Adoption Across Teams Working cross-functionally to educate, influence, and guide engineering teams-scaling reliability practices through a combination of clear standards, strong communication, and developer-friendly systems

Who You Are

Experienced in production systems: You have 6+ years of experience operating and improving reliability of production systems at scale.
Strong foundation in cloud and infrastructure: You have hands-on experience with AWS, Kubernetes (e.g., EKS), and infrastructure as code tools like Terraform.
Deep understanding of reliability principles: You've defined or worked with SLOs/SLAs, understand error budgets, and have experience improving reliability through measurement and iteration.
Observability expertise: You've worked with modern observability tooling (we use DataDog) and understand how to build actionable monitoring systems across metrics, logs, and traces.
Systems thinker: You're able to zoom out, identify patterns across teams and services, and design solutions that scale beyond a single system.
Impact-oriented: You focus on outcomes over output and care deeply about improving real reliability outcomes-not just adding processes.
Strong communicator and influencer: You can drive change across teams without direct authority, balancing pragmatism with long-term vision.
Self-directed: You thrive in ambiguous environments and are comfortable defining problems, proposing solutions, and executing independently.
Team player: You collaborate well, communicate with empathy, and enjoy mentoring and learning from others.

Bonus Points

You've helped introduce or scale reliability practices in a growing organization.
You've built internal tooling or platforms used by multiple teams.
You have experience designing service-level scorecards or compliance/reporting systems.
You've worked with both SaaS (e.g., DataDog) and self-managed observability stacks.
You were previously a product engineer and bring empathy for developer experience.
You have experience with database reliability and performance (we use PostgreSQL)

Why This Role Is Exciting

This is a rare opportunity to define what reliability looks like at a growing, scaling engineering organization-and to do it in a way that actually sticks.

You won't just be responding to incidents or working within a single team. You'll be shaping how reliability is measured, enforced, and experienced across the entire company. You'll work alongside your team mates to turn best practices into intuitive, self-service systems that engineers rely on every day.

Your work will directly improve system reliability, reduce incidents, and enable teams to move faster with confidence, ultimately making reliability a built-in property of how we build software at Grow.

Role Details

Employment Type: Full Time, Exempt
Base Compensation: The base compensation range for this position is $182,000-$250,000 USD Annually.

This is a hybrid role with the expectation to work onsite from our San Francisco, NYC, or Seattle hub location three days per week (Tuesday, Wednesday, and Thursday) and travel 2-3 times per year (e.g., company and department offsites).

The base compensation for this role will vary depending on several factors, including relevant experience, qualifications, and the candidate's working location.

Full Time Employee Benefits:

Comprehensive Health Coverage: Medical, dental, and vision insurance, plus life and disability coverage.
Parental Leave & Family Support: Up to 18 weeks paid leave and a new child stipend.
Financial Wellness: 401(k) program and equity opportunities.
Meals & Home Office Support: Stipends for home office setup and ongoing funds for meals, with tailored perks for both remote and in-office employees.
Time Off to Recharge: Flexible PTO, 12 paid holidays, and a full winter break week.
Wellness & Development: Annual stipends to put towards personal & professional growth.
Mental & Physical Health Support: No-cost access to therapy through the Grow platform, weekly flexible hours for self-care ("Mental Health Mornings/Afternoons") and memberships to leading wellness apps (such as One Medical, Headspace, and Talkspace).
Extra Perks: Pet insurance discounts, commuter benefits, and global travel assistance.

Research shows that some groups hesitate to apply unless they meet every qualification. If you're excited about this role but don't check every box, we encourage you to apply. At Grow, we value diverse experiences, transferable skills, and the unique strengths each person brings.

* Ladders Estimates

Similar Jobs

Senior Infrastructure Engineer
$120K — $200K *
Bland
San Francisco, CA 94112 (San Francisco County)
Today
Frontend Systems Architect
$250K — $300K *
Noon
San Francisco, CA 94112 (San Francisco County)
Today
Senior System Architect, Hardware Architecture
$183K — $247K *
Amazon
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Yesterday
Principal Systems Architect
$190K — $215K *
BlackSky Global
Remote
2 days ago
Quantum Topological Qubits Research Scientist
$143K — $275K *
GlobalFoundries
Santa Clara, CA 95051 (Santa Clara County)
2 days ago
Forward Deployed Engineer
$161K — $242K *
Astreya Partners
Santa Clara, CA 95051 (Santa Clara County)
3 days ago

Get Ready For Your
Next Interview

More Jobs at Grow Therapy

Senior Platform Reliability Engineer
$182K — $250K *
Seattle, WA 98115 (King County)
Today
Enterprise Technology
In-Person
Senior Software Engineer - DevX & AI Enablement
$200K — $250K *
San Francisco, CA 94112 (San Francisco County)
Today
Enterprise Technology
In-Person
Senior Platform Reliability Engineer
$182K — $250K *
San Francisco, CA 94112 (San Francisco County)
Today
Information Technology
In-Person
Senior Platform Reliability Engineer
$182K — $250K *
New York, NY 10025 (New York County)
Today
Enterprise Technology
In-Person
Lifecycle Marketing Manager, Retention
$144K — $168K *
New York, NY 10025 (New York County)
Today
Healthcare
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
6 days ago
Senior Developer, Cybersecurity Detection Engineering (Ottawa (Downtown), ON, CA)
$111K — $130K *
Bank of Canada
Ottawa, ON K1G 3J6
Reposted Today
Supervisory Information Technology Specialist
$120K — $150K *
Department of State (Agency Wide)
Washington, DC 20011 (District Of Columbia County)
Today
IT Specialist (Network) (Network Administrator)
$85K — $110K *
Office of the Chief Information Officer
Seattle, WA 98115 (King County)
Today
Executive IT Support Engineer
$80K — $100K *
Everest Re Group
Warren, NJ 07059 (Somerset County)
Reposted Today

Find similar Senior Platform Reliability Engineer jobs:

Nationwide San Francisco, CA

Senior Platform Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Platform Reliability Engineer jobs:

Get Ready For Your
Next Interview