Senior Site Reliability Engineer

Drata • $166K — $225K *

San Francisco, CA 94112Hybrid

Information Technology

5 - 7 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

6+ years in Site Reliability Engineering or Cloud Engineering
Proficient in Terraform, Docker, Git, and Linux
Hands-on experience using Datadog for monitoring and alerting
Strong experience in automation with Python and/or Bash
Familiar with CI/CD automation using GitHub Actions
Solid understanding of observability concepts for production systems
Experience with container orchestration technologies like AWS ECS Fargate or Kubernetes

Responsibilities

Engage in reliability architecture discussions to identify risks early
Lead Production Readiness Reviews to ensure critical reliability standards are met
Build reusable reliability resources like SLO templates and observability checklists
Automate operational requests to eliminate repetitive tasks
Design shared platform infrastructure to enhance reliability organization-wide
Participate in on-call rotation and lead incident responses
Contribute to the evolution of SRE standards and practices

Benefits

Employee stock equity for shared success
100% employer-paid medical, dental, and vision coverage
Comprehensive financial benefits including 401(k) and life insurance
Paid parental leave and fertility support services
Annual stipends for professional and personal development
Flexible vacation policy and paid holidays

Full Job Description

Job Summary:

Drata's SRE team operates as both a central engineering function and an embedded reliability practice. You'll be part of a close-knit SRE team where you grow your career, shape standards, and collaborate with peers - while also serving as the dedicated reliability partner for one of Drata's product engineering teams across the full lifecycle of their work.

This is a highly technical role at the intersection of software engineering and systems engineering. The best SREs at Drata are engineers first: they solve problems by building solutions, not by executing manual processes. Automation is a core value, and nowhere is that more visible than in how we approach reliability.

Our infrastructure runs on AWS across multiple accounts, defined entirely in Terraform. You'll work across a modern cloud-native stack to help Drata scale reliably for a rapidly growing customer base.

What you'll do:

Reliability Architecture for Your Product Team

You are the reliability expert for your aligned product team. You engage early - during architecture reviews and design discussions - to surface risks before they become incidents.

Lead Production Readiness Reviews (PRRs) before new services launch, with the authority to flag gaps and gate launches when critical reliability standards aren't met
Partner with product engineering leads and staff engineers to define SLOs and SLIs for critical services, turning reliability from a vague goal into a measurable commitment
Participate in team planning and architecture reviews to provide proactive reliability guidance
Build reusable artifacts - SLO templates, observability checklists, alerting standards, reference dashboards - that raise the reliability floor across the team, not just the services you touch directly

Eliminating Toil Through Engineering

You handle operational needs from your product team, but your job isn't to be a help desk. Your goal is to make each request the last of its kind. When an engineer needs something, your priority is: automate it so anyone can do it 12 document it so the team can self-serve 12 execute it manually only as a last resort.

Build and maintain Datadog monitors, dashboards, and alert routing - enforcing infrastructure-as-code standards via Terraform so those resources are owned, versioned, and auditable
Handle infrastructure requests: ECS task management, secret rotations, Terraform changes, capacity adjustments
Identify repeated manual work and convert it into self-service tooling or runbooks
Audit existing services for reliability anti-patterns and surface top risks before they cause incidents

Central SRE Platform Work

Beyond your product team, you contribute to cross-cutting infrastructure, tooling, and standards that benefit every team at Drata. Recent examples include automated Datadog governance workflows, dynamic AWS account provisioning, and disaster recovery exercises.

Design and build shared platform infrastructure - reusable Terraform modules, standardized observability stacks, service templates - so reliability improvements compound across the organization
Participate in the on-call rotation and lead incident response when needed; conduct thorough post-incident reviews to drive lasting fixes
Design and manage CI/CD pipelines using GitHub Actions
Contribute to evolving SRE standards, tooling, and practices across the organization

What you'll bring:

6+ years of experience in Site Reliability Engineering, Cloud Engineering, or building and maintaining scalable, resilient services
Robust knowledge of cloud computing technologies: Terraform, Docker, Git, and Linux
Hands-on experience with Datadog for monitoring, alerting, dashboards, SLO tracking, and distributed tracing
Experience building software systems as a software engineer
Experience developing tooling and automation in Python and/or Bash
Experience with CI/CD pipeline automation, specifically GitHub Actions
Experience with disaster recovery practices and incident management
Strong understanding of observability concepts - monitoring, logging, distributed tracing, and metrics - and how to apply them to production systems
Experience with container orchestration and deployment technologies including AWS ECS Fargate and/or Kubernetes
Experience working with relational databases (MySQL proficiency is a plus)
Ability to take ownership of problems and act on them independently in a constantly evolving environment

Nice to Have:

Experience with AIOps - using AI/ML-based tooling for anomaly detection, predictive alerting, or automated incident triage
Familiarity with the reliability characteristics of AI/ML-backed services (e.g., LLM inference latency, non-determinism, prompt pipeline observability)
Experience with the JavaScript/Node.js ecosystem
Certified Kubernetes Administrator (CKA) certification
Familiarity with compliance frameworks like SOC 2, ISO 27001, or NIST

AI Experience (required - at least one of the following):

Hands-on experience using AI-assisted development tools (e.g., GitHub Copilot, Cursor, or similar) to accelerate automation, scripting, or infrastructure work
Demonstrated use of AI/AIOps capabilities for reliability tasks - anomaly detection, incident triage, runbook generation, or alert noise reduction
Familiarity with the operational characteristics of AI/ML-backed services and what it means to make them observable and reliable in production
Demonstrated passion for AI through personal projects, contributions, or continuous learning in the context of infrastructure or reliability engineering

How we support you:At Drata, our people are our strongest advantage-and we prove it with support that exceeds industry standards. Our total rewards package is designed to power your well-being, accelerate your growth, and keep your work-life balance thriving.

Explore how we invest in your Life at Drata.

Shared Success: We provide stock equity to ensure that as the company grows, you share directly in that success. Equity gives every employee a sense of ownership and the opportunity to celebrate our wins together-because your contributions don't just support our progress; they help drive our collective success.
Health & Wellness: Up to 100% employer-paid premiums for medical, dental, and vision coverage for employees and their dependents, along with comprehensive wellness benefits and healthcare concierge services designed to support your needs beyond traditional insurance.
Financial Well-being: A comprehensive suite of financial benefits, including a 401(k) plan, company-paid life and disability insurance, tax-advantaged spending accounts, and a range of discounted voluntary offerings to help you customize and strengthen your overall financial position.
Family Support: We want to support you in life's most important moments, so we offer a paid Parental Leave policy, after six months of employment. Employees also receive access to Kindbody fertility and family-building benefits and dedicated leave specialists who help guide you through the entire process.
Growth & Development: Generous annual stipends for both professional and personal development, empowering you to invest in your continued growth. You'll also have access to a wide range of internal learning opportunities, ensuring you can build new skills, deepen your expertise, and advance your career with confidence.
Time Off & Flexibility: We believe that to do your best work, you should get the time you need for rest, rejuvenation and recovery. Drata offers a flexible vacation policy, paid holidays, and other perks to recharge.

This role will receive a competitive base salary, benefits, and stock, typically in the form of Restricted Stock Units (RSUs). The applicable salary range for this role is: $166,900 - $225,900.

A variety of factors are considered when determining someone's leveling and compensation-including a candidate's professional background and experience. These ranges may be modified in the future and final offer amounts may vary from the amounts listed above.

About Drata

Drata is a security and compliance automation platform that continuously monitors, manages, and reports on compliance. It provides a single pane of glass for companies to manage their security and compliance posture. Drata's platform automates the collection of evidence, streamlines workflows, and provides real-time visibility into compliance status. The company was founded in 2020 by Adam Markowitz, Daniel Marashlian, and Troy Markowitz.

Learn more about Drata

Size

50 employees

Industry

Information Technology

Founded

2020

* Ladders Estimates

Similar Jobs

Staff Site Reliability Engineer
$119K — $170K *
Zscaler
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Staff Site Reliability Engineer
$119K — $170K *
Zscaler
Remote
Reposted Today
Staff System Engineer
$160K — $185K *
Super Micro Computer, Inc
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Ontology Systems Engineer
$157K — $174K *
General Dynamics
Remote
Today
Senior Presales Systems Engineer
$146K — $343K *
Hewlett Packard Enterprise Development LP
Fall River Mills, CA 96028 (Shasta County)
Reposted Today
Senior Systems Engineer, AM&D Services & Products
$171K — $190K *
Uber
San Francisco, CA 94112 (San Francisco County)
Today

Get Ready For Your
Next Interview

More Jobs at Drata

Senior Manager, Enterprise Customer Success - US Central
$156K — $241K *
Remote
Reposted Today
Enterprise Technology
Remote in United States
Senior AI Product Engineer 2, Control Remidiation
$192K — $259K *
San Francisco, CA 94112 (San Francisco County)
3 days ago
Enterprise Technology
Hybrid
Staff Software Engineer, Monetization Platform
$200K — $271K *
San Francisco, CA 94112 (San Francisco County)
3 days ago
Enterprise Technology
Hybrid
Enterprise Account Executive - Los Angeles
$270K — $315K *
Remote
3 days ago
Enterprise Technology
Remote in United States
Solutions Engineer - Central
$149K — $230K *
Remote
4 days ago
Information Technology
Remote in United States

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Senior Data Engineer
$120K — $150K *
ECS
Remote
Today
Engineer I- Software
$70K — $95K *
Microchip Technology
Chandler, AZ 85225 (Maricopa County)
Today
Software Engineer lll - Payments Modernization
$102K — $179K *
Bank of America Corporation
Charlotte, NC 28269 (Mecklenburg County)
Reposted Today

Find similar Senior Site Reliability Engineer jobs:

Nationwide San Francisco, CA

Senior Site Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Site Reliability Engineer jobs:

Get Ready For Your
Next Interview