Software Engineer, Site Reliability

Hebbia

$160K — $300K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in software development with a focus on production systems
  • Proficiency in systems or backend language (Go, Python, C++, Rust)
  • Experience as Production Engineer, SRE, or infrastructure-oriented software engineer
  • Solid understanding of distributed systems
  • Expert in container orchestration and debugging complex production failures
  • Knowledge of OS-level concepts
  • Fluent in cloud platforms, preferably AWS
  • Experience with observability stacks
  • Strong expertise in CI/CD pipelines and enhancing developer velocity
  • Background in companies with SRE culture is advantageous

Responsibilities

  • Own and oversee critical production services from design to incident response
  • Profile, benchmark, and improve performance to remove bottlenecks
  • Lead incident response initiatives and implement post-mortem findings into code improvements
  • Design observability frameworks and custom instrumentation for production
  • Define and enforce SLOs and maintain accountability within engineering teams
  • Plan capacity and ensure cost efficiency within infrastructure
  • Develop robust internal platforms held to high engineering standards
  • Continuous improvement of CI/CD systems for safe and rapid deployment
  • Collaborate with product engineering teams on reliability from inception
  • Support infrastructure security through automated compliance and threat modeling

Benefits

  • Unlimited PTO
  • Comprehensive Medical, Dental, and Vision insurance
  • 401K plan
  • Catered lunches and dinner credits for late work
  • Generous parental leave (3 months for non-birthing parent, 4 months for birthing parent)
  • $15K lifetime fertility benefits
  • Competitive equity grant with significant upside potential
Full Job Description
The Role

We are looking for a Site Reliability Engineer who thinks like a software engineer first. You will own critical production systems end-to-end, designing, building, and improving them rather than simply operating them. You will write production-quality code that keeps the platform reliable at scale, embed with product
engineering teams to influence architecture from the start, and build the internal tooling that every engineer at Hebbia depends on. This is not a ticket-driven ops role. You will spend most of your time writing code: instrumenting services, eliminating performance bottlenecks, building deployment platforms, and translating incident post-mortems into lasting architectural improvements.

Responsibilities
  • Own critical production services end-to-end, from design and code review through deployment,
    operation, and incident response
  • Profile, benchmark, and rewrite hot paths to eliminate bottlenecks as Hebbia scales
  • Lead incident response and drive post-mortem culture, translating findings into code changes and
    architectural improvements rather than runbooks
  • Design and build observability frameworks from scratch, writing custom instrumentation, alerting
    logic, and debugging tooling that surfaces production issues before customers feel them
  • Define and enforce SLOs across platform services and build the feedback loops that keep
    engineering teams accountable to them
  • Own capacity planning and cost efficiency: model growth, right-size infrastructure, and write
    automation that prevents over-provisioning and resource exhaustion
  • Build robust, well-tested internal platforms and deployment tooling held to the same engineering
    standards as customer-facing code
  • Own and continuously improve CI/CD systems so engineering teams can ship safely and quickly
  • Embed with product engineering teams as a peer software engineer, contributing directly to
    production codebases and co-designing systems for reliability from the start
  • Partner on infrastructure security through threat modeling, hardening, and automated compliance
    tooling


Who You Are
  • 5+ years software development with a track record of writing, shipping, and maintaining production services, not just operating infrastructure
  • Production-grade proficiency in at least one systems or backend language: Go, Python, C++, or Rust
  • Proven experience as a Production Engineer, SRE, or software engineer with a deep infrastructure focus, comfortable owning services end-to-end across the full stack
  • Deep understanding of distributed systems
  • Container orchestration expertise and hands-on experience debugging complex distributed failures in production
  • Working knowledge of OS-level concepts
  • Cloud platform fluency (AWS preferred)
  • Experience in building and maintaining observability stacks
  • Strong CI/CD pipeline expertise and a track record of improving developer velocity without sacrificing safety
  • Background at a company with a Production Engineering or software-focused SRE culture is a strong plus
  • Experience building platforms for AI/ML workloads or high-throughput document processing pipelines is a plus


Compensation

The salary range for this role is $160,000 to $300,000. This range may be inclusive of several career levels at Hebbia and will be narrowed during the interview process based on the candidate's experience and qualifications. Adjustments outside of this range may be considered for candidates whose qualifications significantly differ from those outlined in the job description.

Life @ Hebbia

PTO: Unlimited

Insurance: Medical + Dental + Vision + 401K

Eats: Catered lunch daily + doordash dinner credit if you ever need to stay late

Parental leave policy: 3 months non-birthing parent, 4 months for birthing parent

Fertility benefits: $15k lifetime benefit

New hire equity grant: competitive equity package with unmatched upside potential

#LI-Onsite

Similar Jobs

More Jobs at Hebbia

  • Revenue Operations
    $90K — $130K *
    New York, NY 10025 (New York County)
    Business Services
    In-Person

More Information Technology Jobs

Find similar Software Engineer, Site Reliability jobs: