Site Reliability Engineer

Gem.com • $135K — $160K *

New York, NY 10025In-Person

Healthcare

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years in SRE, DevOps, or Platform roles managing production environments at scale.
Expertise in AWS services: EKS, EC2, RDS, S3.
Proficiency in infrastructure as code with Terraform.
Solid experience with observability tools like Datadog and CI/CD systems.
Strong coding skills in languages such as Python, Bash, Ruby, or Go.
Experience with AI-assisted tooling and agentic workflows preferred.
Focused on HIPAA compliance and high-availability systems.

Responsibilities

Design and maintain production Kubernetes (EKS) clusters for enterprise-grade availability.
Automate infrastructure management entirely through Terraform to eliminate manual configuration.
Optimize AWS resources for performance and cost-efficiency.
Explore and implement AI-assisted workflows for operational automation.
Develop rapid and safe deployment pipelines using GitHub Actions or Semaphore.
Enhance observability with metrics, traces, and logs in Datadog.
Lead incident response efforts and facilitate blameless postmortems to reduce recovery time.

Benefits

Medical, dental, and vision insurance coverage.
Unlimited paid time off (PTO).
401(k) plan with company match.
Stock options and bonuses.

Full Job Description

About the Role

As a Site Reliability Engineer, you will own and evolve the infrastructure powering healthcare experiences for millions of patients. This role bridges the gap between traditional infrastructure excellence and the future of AI-driven operations. You will act as a primary architect for our AWS and Kubernetes (EKS) environment, ensuring the platform is resilient, scalable, and compliant while exploring how agentic workflows can modernize SRE practices.

What You'll Do

As a Site Reliability Engineer, you will be a steward of Fabric's production integrity, leading the strategy for infrastructure automation, observability, and system resilience. Your primary responsibilities include:

Infrastructure & Kubernetes Orchestration
- Designing, deploying, and maintaining production Kubernetes (EKS) clusters to ensure enterprise-grade availability for our users.
- Eliminating manual configuration by building and managing a scalable infrastructure state entirely through Terraform.
- Optimizing the AWS footprint-specifically EC2, RDS, and S3-to balance high performance with cost-efficiency and reliability.
AI-Assisted Operations & Automation
- Exploring and deploying agentic workflows for AI-assisted runbooks that automate complex operational decisions and repetitive tasks.
- Building and evolving deployment pipelines using GitHub Actions or Semaphore to ensure delivery is both rapid and safe.
- Focusing on toil reduction by developing internal tools that replace manual operational work with intelligent, autonomous systems.
Observability & Incident Management
- Driving the evolution of the observability stack in Datadog by implementing the sophisticated metrics, traces, and logs needed to meet SLOs.
- Leading incident response efforts and facilitating the blameless postmortems that help systematically reduce recovery time (MTTR).
- Defining and monitoring the SLIs and SLOs that ensure the platform consistently meets rigorous healthcare performance standards.
Compliance & Collaboration
- Ensuring every piece of infrastructure remains fully compliant with HIPAA and other critical healthcare regulatory requirements.
- Mentoring engineers across the company on reliability best practices and contributing a clinical-safety perspective to cross-functional design reviews.

Why You Might Be a Good Fit

You are a deeply proficient engineer who excels at the intersection of cloud infrastructure, automation, and system design.
You possess a meticulous approach to observability and a passion for finding the "root cause" rather than just applying a patch.
You enjoy exploring the "next frontier" of SRE, including how AI and agentic tools can make operations more efficient.
You thrive in fast-paced environments where technical rigor is balanced with pragmatism and clinical-grade safety.

This Might Not Be The Right Fit If...

You prefer working on static infrastructure rather than evolving systems through code and automation.
You are uncomfortable with the "agile" pace of tech-driven platform development or integrating AI tools into your daily workflow.
You prefer a siloed role that does not involve active participation in incident response or collaborative postmortems.

Your Qualifications

5+ years of experience in SRE, DevOps, or Platform roles managing production environments at scale.
Expert technical depth in AWS (EKS, EC2, RDS, S3) and production-grade Kubernetes management.
Proficiency with modern tooling including Terraform (IaC), Datadog (Observability), and CI/CD systems.
Deeply proficient coding and scripting skills in Python, Bash, Ruby, or Go.
Preferred experience building agentic workflows or AI-assisted tooling to drive operational efficiency.
A "rigor-first" mindset with a dedication to HIPAA-compliant, high-availability architecture.

The national pay range for this role is $135,000.00 - $160,000.00 per year. Actual compensation will be determined by factors such as the candidate's geographic market, experience, skills, and qualifications. Certain roles may also be eligible for additional compensation, including a comprehensive benefits package such as medical, dental, vision, unlimited PTO, and a 401(k) plan, stock options and bonuses. If your compensation requirement is greater than our posted range, please still consider applying; a determination can be made based on unique qualifications. Expected compensation ranges for this role may change over time.

About Gem.com

Learn more about Gem.com

Industry

Enterprise Technology

Founded

2013

* Ladders Estimates

Similar Jobs

Senior Lead Technology Resiliency Engineer
$159K — $305K *
Wells Fargo
Iselin, NJ 08830 (Middlesex County)
Today
Principal Infrastructure Architect - REMOTE
$146K — $293K *
Siemens
Remote
Reposted Today
Technical Program Manager - Clearance Required
$134K — $195K *
Logistics Management Institute
Fort Belvoir, VA 22060 (Fairfax County)
Today
Utilities Transmission & Distribution Control Room & Real-Time Systems Consultant or Manager
$66K — $205K *
Accenture
East Marion, NY 11939 (Suffolk County)
2 days ago
Utilities Transmission & Distribution Control Room & Real-Time Systems Consultant or Manager
$66K — $205K *
Accenture
Troy, NY 12180 (Rensselaer County)
2 days ago
Utilities Transmission & Distribution Control Room & Real-Time Systems Consultant or Manager
$70K — $205K *
Accenture
Philadelphia, PA 19120 (Philadelphia County)
2 days ago

Get Ready For Your
Next Interview

More Jobs at Gem.com

Site Reliability Engineer
$135K — $160K *
New York, NY 10025 (New York County)
Today
Healthcare
In-Person

More Healthcare Jobs

Chief Medical Officer Part Time
$210K + $210,000 annually. mpi offers free medical, dental, vision, pto, *
Motion Picture Industry Pension & health Plans
Studio City, CA 91604 (Los Angeles County)
1 week ago
Clinical Specialist - Radiology
$125K + $15K bonus + equity *
Confidential Company
Atlanta, GA 30303 (Fulton County)
1 week ago
Senior Revenue Cycle Analyst (CA)
$90K — $110K *
Shriners Children's
Remote
Today
Community Liaison (Hospice Sales)
$80K — $90K *
Sanctuary Hospice
Akron, OH 44312 (Summit County)
Today
Physical Therapist
$82K — $110K *
ProActive Physical Therapy Specialists
Bend, OR 97701 (Deschutes County)
Today

Find similar Site Reliability Engineer jobs:

Nationwide New York, NY

Site Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Site Reliability Engineer jobs:

Get Ready For Your
Next Interview