Most phone systems trap callers in menus and scripts. Vapi is the platform for deploying voice agents that know your business and can listen, adapt, and resolve in minutes.
- The numbers: 1 billion calls. 1 million developers. 10x enterprise ARR growth
- The customers: Amazon Ring, ServiceTitan, New York Life, Intuit, Kavak, and thousands more, from YC startups to the Fortune 500
- The news: a $50M Series B led by Peak XV Partners, with Bessemer Venture Partners, Kleiner Perkins, M12 (Microsoft's Venture Fund), Y Combinator, and our earlier backers. Total raised: $72M
Why We're Hiring This Role:- 99.99% call completion is the number this role drives. Vapi runs live phone calls - a p99 spike means callers drop. We've had 15 stability-gap outages worth learning from, and we need someone who runs incident command, owns SLOs and error budgets, and builds the reliability culture from scratch.
- This is not a bash-and-YAML role. You'll ship code (Go or TypeScript) for services that monitor and manage the platform: auto-remediation, capacity forecasters, oncall tooling. Capacity planning, load testing, and KEDA-based autoscaling for Vapi's wscaler and workerpool-cron-scaler are on your plate.
What You'll Do:- 30 Day: Join the oncall rotation. Walk the 15 stability-gap incidents and turn the patterns into a prioritized reliability backlog. Define the first set of SLOs for the call-completion path.
- 60 Day: Stand up error budgets and SLO-based alerting in Chronosphere/Prometheus for the highest-impact services. Run the first proper load test against provider rate limits and per-org concurrency. Tune autoscaling for wscaler / workerpool-cron-scaler.
- 90 Day: Ship a real platform service - capacity forecaster, auto-remediation, or oncall tooling - in Go or TypeScript. Own the postmortem process. Drive a measurable improvement in p99 call completion or MTTR.
Who You Are:Must-haves- You've run incident command and postmortem discipline at scale on a real oncall rotation.
- You've operated SLOs and error budgets in Chronosphere, Prometheus, Grafana, or Datadog.
- You've done capacity planning and load testing for production systems with real users.
- You're fluent in Kubernetes production ops: pod crash diagnosis, HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown.
- You know backpressure and autoscaling patterns - KEDA, custom metrics scaling.
Nice-to-haves- You ship code, not just scripts. You can build platform services in Go or TypeScript (matches Vapi's cluster-manager, database-health, wscaler, incidentManager).
- Real-time / latency-sensitive product background where degraded means a dropped call, not a slow dashboard.
Tech stack you'll work in- Languages: Go and TypeScript (you ship code, not just scripts), Bash.
- Observability: Chronosphere, Prometheus, Grafana, Datadog, OpenTelemetry.
- Orchestration: Kubernetes on EKS - production ops (HPA/VPA tuning, PodDisruptionBudgets, graceful shutdown, pod crash diagnosis).
- Autoscaling and backpressure: KEDA, custom metrics scaling (matches Vapi's wscaler and workerpool-cron-scaler).
- Load testing: script-based load testing, provider rate-limit auditing, per-org concurrency auditing.
- Vapi services you'll touch or build: cluster-manager, database-health, wscaler, incidentManager.
Where you likely come from- A real-time / latency-sensitive product (Discord, Zoom, Mux, Twitch, Twilio, LiveKit, Cloudflare, a trading firm, a gaming backend), or a FAANG SRE / Production Engineer (Google, Uber, Twitter/X, Meta) who misses being hands-on.
- Weak fit: SRE from analytics or CRM backends where "degraded" means a slow dashboard, not a dropped call. Anyone uncomfortable reading or writing code.
Why Vapi:- Generational impact: Build the human interface for every business
- Ownership culture: 70% of the company are previous founders
- Kind team: The founders, Jordan and Nikhil, are Canadians
- Tier-1 Investors: YC, KP seed, Bessemer Series A
What We Offer:- Real stake: We offer a competitive salary and excellent equity ownership
- Comprehensive health coverage: medical, dental, and vision plans
- Team love: We love hanging out, and we do quarterly off-sites
- Flexible time off: take what you need
More: catered meals, transportation, gym, and a $10k annual L&D budget