Member of Technical Staff, Site Reliablity Engineer

Vapi

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years of experience in incident management and postmortems at scale
Proficiency with SLOs and error budgets using monitoring tools like Chronosphere, Prometheus, or Grafana
Experience in capacity planning and load testing for production systems
Solid understanding of Kubernetes production operations
Familiarity with autoscaling patterns, particularly KEDA and custom metrics.

Responsibilities

Join the oncall rotation and analyze patterns from past stability-gap incidents
Establish SLOs for call-completion processes
Implement SLO-based alerting using Chronosphere/Prometheus
Conduct load tests to evaluate provider rate limits
Develop a platform service in Go or TypeScript to improve call completion.

Benefits

Competitive salary with equity ownership
Comprehensive health coverage including medical, dental, and vision
Quarterly team off-sites for bonding activities
Flexible time off policy allowing employees to take what they need
Catered meals, transportation perks, gym access, and $10k annual learning budget

Full Job Description

Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes.

The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth
The customers: Amazon Ring, ServiceTitan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500
The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M

Why We're Hiring This Role:

99.99% call completion is the number this role drives. Vapi runs live phone calls - a p99 spike means callers drop. We've had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch.
This is not a bash-and-YAML role. You'll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi's wscaler and workerpool-cron-scaler are on your plate.

What You'll Do:

30 Day: Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path.
60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / workerpool-cron-scaler.
90 Day: Ship a real platform service - capacity forecaster, auto-remediation, or oncall tooling - in Go or TypeScript. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR.

Who You Are:

Must-haves

You've run incident command and postmortem discipline at scale on a real oncall rotation.
You've operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
You've done capacity planning and load testing for production systems with real users.
You're fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
You know backpressure and autoscaling patterns - KEDA, custom metrics scaling.

Nice-to-haves

You ship code, not just scripts. You can build platform services in Go or TypeScript (matches Vapi's cluster-manager, database-health, wscaler, incidentManager).
Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.

Tech stack you'll work in

Languages: Go and TypeScript (you ship code, not just scripts), Bash.
Observability: Chronosphere, Prometheus, Grafana, Datadog, OpenTelemetry.
Orchestration: Kubernetes on EKS - production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi's wscaler and workerpool-cron-scaler).
Load testing: script-based load testing, provider rate-limit auditing, per-org concurrency auditing.
Vapi services you'll touch or build: cluster-manager, database-health, wscaler, incidentManager.

Where you likely come from

A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on.
Weak fit: SRE from analytics or CRM backends where "degraded" means a slow dashboard, not a dropped call. Anyone uncomfortable reading or writing code.

Why Vapi:

Generational impact: Build the human interface for every business
Ownership culture: 70% of the company are previous founders
Kind team: The founders, Jordan and Nikhil, are Canadians
Tier-1 Investors: YC, KP seed, Bessemer Series A

What We Offer:

Real stake: We offer a competitive salary and excellent equity ownership
Comprehensive health coverage: medical, dental, and vision plans
Team love: We love hanging out, and we do quarterly off-sites
Flexible time off: take what you need

More: catered meals, transportation, gym, and a $10k annual L&D budget

* Ladders Estimates

Similar Jobs

Site Reliability Engineer II
$120K — $150K *
Illumio
Sunnyvale, CA 94087 (Santa Clara County)
2 days ago
Site Reliability Engineer, Cloud
$116K — $218K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Reposted 4 days ago
Site Reliability Engineer
$165K — $190K *
Obsidian Security
Palo Alto, CA 94303 (Santa Clara County)
6 days ago
Site Reliability Engineer
$100K — $170K *
Nscale
San Francisco, CA 94112 (San Francisco County)
6 days ago
Compute SRE
$120K — $150K *
Apple
Cupertino, CA 95014 (Santa Clara County)
1 week ago
Site Reliability Engineer (SRE)
$163K — $306K *
Gem
San Francisco, CA 94112 (San Francisco County)
1 week ago

Get Ready For Your
Next Interview

More Jobs at Vapi

Product Designer - AI UX / Product Experience
$120K — $150K *
San Francisco, CA 94112 (San Francisco County)
3 days ago
Enterprise Technology
In-Person
Sales Engineer - NY
$90K — $130K *
New York, NY 10025 (New York County)
5 days ago
Telecommunications & Hardware
In-Person
Senior Recruiting Coordinator
$75K — $95K *
San Francisco, CA 94112 (San Francisco County)
6 days ago
Staffing
In-Person
Senior Manager, Solutions Partner
$120K — $160K *
New York, NY 10025 (New York County)
1 week ago
Enterprise Technology
In-Person
Senior Manager, Solutions Partner
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Enterprise Technology
In-Person

More Information Technology Jobs

Chief Executive Officer
The Mitalmor Group
San Francisco, CA 94102 (San Francisco County)
2 weeks ago
Network Administrator III (WAN)
$75K — $95K *
Abacus Technology
Montgomery, AL 36117 (Montgomery County)
Today
Linux Systems Engineer
$90K — $130K *
Abile Group, Inc.
Springfield, VA 22153 (Fairfax County)
Today
Machine Learning Engineer
$100K — $130K *
Abile Group, Inc.
Chantilly, VA 20152 (Loudoun County)
Today
PostgreSQL Database Architect
$100K — $130K *
Abile Group, Inc.
St. Louis, MO 63129 (Saint Louis County)
Today

Find similar Member of Technical Staff, Site Reliablity Engineer jobs:

Nationwide San Francisco, CA

Member of Technical Staff, Site Reliablity Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Member of Technical Staff, Site Reliablity Engineer jobs:

Get Ready For Your
Next Interview