Senior Site Reliability Engineer

Blitzy

$160K — $180K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Strong proficiency in AWS, with Kubernetes and container orchestration experience at scale.
  • Hands-on experience with infrastructure-as-code tools like Terraform or Pulumi.
  • Proven track record in designing and maintaining high-availability, distributed systems.
  • Deep expertise in observability tools and incident management practices.
  • Strong scripting skills in Python, Go, Bash, or similar languages.
  • Excellent communication skills for collaboration across engineering teams.

Responsibilities

  • Design, build, and operate scalable infrastructure across cloud environments (AWS, GCP, or Azure).
  • Define and enforce SLOs, SLAs, and error budgets, leading blameless postmortems.
  • Build and maintain CI/CD pipelines, automation, and deployment infrastructure.
  • Own observability by maintaining logging, metrics, tracing, and alerting stacks.
  • Collaborate with software engineering to embed reliability practices into development.
  • Drive capacity planning, performance benchmarking, and cost optimization.
  • Champion security best practices within infrastructure and deployment layers.

Benefits

  • Opportunity to work in a fast-paced, high-impact environment.
  • Direct influence over architectural decisions on a new platform.
  • Collaboration with world-class engineers and a dynamic team culture.
  • Potential for professional growth as a founding member of the Pune SRE team.
Full Job Description
Location: Cambridge, MA (In-Office)

Compensation: $160,000 - $180,000 + equity eligibility based on performance

The Role

As a Senior Site Reliability Engineer at Blitzy's Cambridge headquarters, you will be the backbone of our platform's reliability, scalability, and operational excellence. You'll work at the intersection of software engineering and infrastructure, ensuring our AI-powered development platform remains highly available and performant as we scale rapidly. This is a high-impact, hands-on role for an engineer who thrives in a fast-moving environment and takes deep ownership of the systems they build.

What Success Looks Like
  • In 30 days: You have a deep understanding of Blitzy's infrastructure architecture, have identified key reliability risks, and are actively contributing to on-call rotations.
  • In 90 days: You have shipped meaningful improvements to observability, incident response workflows, and deployment pipelines that measurably reduce MTTR and increase system uptime.
  • In 6 months: You have driven at least one major reliability initiative from inception to production, established SLO/SLA frameworks for critical services, and are a trusted technical voice shaping our infrastructure roadmap.


Areas of Ownership
  • Design, build, and operate scalable, fault-tolerant infrastructure across cloud environments (AWS, GCP, or Azure).
  • Define and enforce SLOs, SLAs, and error budgets; lead blameless postmortems and drive systemic improvements.
  • Build and maintain robust CI/CD pipelines, release automation, and deployment infrastructure.
  • Own observability: design and maintain logging, metrics, tracing, and alerting stacks (e.g., Prometheus, Grafana, Datadog, OpenTelemetry).
  • Partner closely with software engineering teams to embed reliability practices into the development lifecycle.
  • Drive capacity planning, performance benchmarking, and cost optimization across our infrastructure.
  • Champion security best practices within the infrastructure and deployment layers.


Required Experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or Infrastructure Engineering roles.
  • Strong proficiency in at least one major cloud platform (AWS preferred); experience with Kubernetes and container orchestration at scale.
  • Hands-on experience with infrastructure-as-code tools (Terraform, Pulumi, or equivalent).
  • Proven track record designing and maintaining high-availability, distributed systems.
  • Deep expertise in observability tooling, incident management, and on-call practices.
  • Strong scripting and automation skills (Python, Go, Bash, or similar).
  • Excellent communication skills with the ability to collaborate across engineering teams and present technical findings to leadership.


What Makes You Stand Out
  • Experience supporting AI/ML workloads or GPU-accelerated infrastructure.
  • Prior experience in a high-growth startup environment where you wore multiple hats.
  • Familiarity with eBPF, service mesh technologies (Istio, Linkerd), or advanced networking.
  • Contributions to open-source SRE/DevOps tooling or communities.
  • Experience building global, multi-region infrastructure with strict latency and availability requirements.


What Makes This Role Different

You won't be maintaining legacy systems or fighting fires in a sprawling monolith. At Blitzy, you're building reliability into a greenfield AI platform that is redefining how the world creates software. You'll have direct influence over architectural decisions, work side-by-side with world-class engineers, and see the tangible impact of your work as we scale to serve Fortune 500 customers. As a founding member of the Pune SRE team, you'll help shape the culture and technical standards of a team that will grow with the company.

Similar Jobs

More Jobs at Blitzy

  • Senior Backend Engineer
    $160K — $220K *
    Cambridge, MA 02139 (Middlesex County)
    Enterprise Technology
    In-Person
  • Product Marketing Manager
    $160K — $180K *
    Cambridge, MA 02139 (Middlesex County)
    Enterprise Technology
    In-Person
  • Automation Engineer
    $90K — $145K *
    Cambridge, MA 02139 (Middlesex County)
    Information Technology
    In-Person
  • Developer Support Engineer
    $75K — $135K *
    Cambridge, MA 02139 (Middlesex County)
    Technical Services
    In-Person
  • DevOps Engineer
    $85K — $180K *
    Cambridge, MA 02139 (Middlesex County)
    Information Technology
    In-Person

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: