Senior Site Reliability Engineer

David AI

$130K — $180K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in Site Reliability, Infrastructure, or Platform Engineering for large-scale SaaS or cloud systems.
  • Hands-on experience with security best practices in production systems and cloud infrastructure.
  • Strong background in building reliable and scalable systems.
  • Experience with AWS, Terraform, containers (e.g., Kubernetes), and cloud networking basics.
  • Proficient in observability tooling (e.g., Prometheus, Grafana, Datadog).
  • Effective collaborator in fast-paced, cross-functional teams.
  • Bachelor's degree in Computer Science or related field, or equivalent practical experience.

Responsibilities

  • Own the observability stack, managing monitoring, alerting, logging, and tracing.
  • Partner with product and platform engineering teams to build resilient systems from inception.
  • Design and implement secure, scalable cloud infrastructure on AWS using Terraform.
  • Lead enhancements in CI/CD processes and incident response practices to boost efficiency.
  • Define and evolve SRE practices, influencing reliability culture and standards across the organization.

Benefits

  • Unlimited PTO.
  • Comprehensive health, dental, and vision coverage with 100% coverage for most plans.
  • FSA & HSA access.
  • 401k access.
  • Meals twice daily through DoorDash and office snacks.
  • Unlimited company-sponsored Barry's classes.
Full Job Description
About this role

As a SeniorSite Reliability Engineer at David AI, you will shape and build the foundation for reliability, observability, and scalability across David AI's infrastructure. Working closely with our engineering and product teams, you'll help ensure our systems are resilient, efficient, and designed to scale as the company grows.

In this role, you will
  • Own David AI's observability stack, including monitoring, alerting, logging, and tracing, to provide engineers with clear visibility into system health, reliability, and performance.
  • Partner closely with product and platform engineering teams to design systems that are scalable, resilient, and reliable from day one, not as an afterthought.
  • Design and implement secure, scalable cloud infrastructure across AWS using Terraform and modern DevOps tooling to support rapid product and research iteration.
  • Lead improvements across deployment pipelines, CI/CD systems, and incident response processes to reduce downtime, improve operational efficiency, and strengthen engineering velocity.
  • Define and evolve the foundation of SRE practices at David AI, influencing reliability culture, tooling standards, operational excellence, and best practices across the engineering organization.
Your background looks like
  • 5+ years of experience in Site Reliability, Infrastructure, or Platform Engineering supporting large-scale SaaS or cloud systems.
  • Hands-on experience applying Security best practices in production systems and cloud infrastructure.
  • Strong experience building and running reliable, highly available, and scalable systems.
  • Hands-on experience with AWS, Terraform, containers (like Kubernetes), and cloud networking basics.
  • Experience implementing and maintaining observability tooling across monitoring, logging, alerting, and tracing (e.g., Prometheus, Grafana, Datadog, or similar).
  • Comfortable working in fast-paced teams and collaborating closely with product, ML, and engineering teams.
  • Bachelor's degree in Computer Science or related field, or equivalent practical experience.
Bonus points if you have
  • Past experience in an early-stage startup environment, especially defining SRE culture and tooling from scratch.
  • Familiarity with incident management automation or self-healing infrastructure patterns.
Some technologies we work with

Next.js, TypeScript, TailwindCSS, Node.js, tRPC, PostgreSQL, AWS, Temporal, WebRTC, FFmpeg.

Benefits
  • Unlimited PTO.
  • Top-notch health, dental, and vision coverage with 100% coverage for most plans.
  • FSA & HSA access.
  • 401k access.
  • Meals 2x daily through DoorDash + snacks and beverages available at the office.
  • Unlimited company-sponsored Barry's classes.

Similar Jobs

More Jobs at David AI

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: