Site Reliability / Infrastructure Engineer

Medal

$120K — $160K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in site reliability engineering or related fields
  • Deep understanding of GCP services, particularly Kubernetes and managed databases
  • Extensive experience with database scaling, sharding, and optimization in production
  • Proficient in Terraform and infrastructure-as-code practices
  • Hands-on experience with Elasticsearch cluster management
  • Strong incident response skills and experience in postmortem analysis
  • Fluency in CI/CD processes and tools, specifically GitHub Actions

Responsibilities

  • Own reliability across GCP infrastructure focusing on availability and latency
  • Lead incident response processes including on-call duties and postmortem analysis
  • Architect and implement database scaling strategies for MariaDB and PostgreSQL
  • Collaborate with product teams to design infrastructure that meets growing feature demands
  • Manage and enhance Terraform and Kubernetes configurations
  • Oversee Elasticsearch cluster management, including performance tuning and capacity planning
  • Develop and maintain observability tools like metrics and alerting systems

Benefits

  • Competitive salary and equity opportunities
  • Comprehensive health insurance including medical, dental, and vision
  • 401(k) retirement savings plan
  • Wellness programs including fitness memberships and mental health resources
  • Paid parental leave with additional fertility support
  • Generous paid time off policy
  • Daily meal provisions and commuter benefits available on-site
  • Stipends for professional development and continuous learning
Full Job Description
The Role

Medal's infrastructure handles billions of clips, video ingestion pipelines, and social features at a massive scale most engineers never get to touch. We're looking for an SRE who cares deeply about reliability and scalability.

The work centers on reliability, incident response, scaling, and making sure our infrastructure keeps up with our growth. You'll own the on-call rotation, drive postmortems, and work directly with engineering teams to meet their infra needs.

The right person probably came through startups and scale-ups. You've been in the room when things broke at 2am, you've scaled databases under pressure, and you know the difference between a durable fix and a patch that buys you a week.

Key Responsibilities
  • Own reliability across our GCP infrastructure: Kubernetes clusters, managed services, and data pipelines, driving measurable improvements to availability and latency
  • Lead incident response end-to-end: on-call rotations, runbooks, postmortems, and the follow-through that makes sure the same thing doesn't happen twice
  • Architect and execute database scaling strategies (sharding, replication, query optimization, and capacity planning) across MySQL and Postgres at meaningful scale
  • Partner with product engineering to translate feature requirements into infrastructure designs that hold up as we grow
  • Manage and evolve our Terraform-managed GCP environment and Kubernetes cluster configurations
  • Own our Elasticsearch cluster end-to-end: capacity planning, sharding strategy, index lifecycle management, version upgrades, and performance tuning at production scale
  • Build and maintain observability across the stack: metrics, dashboards, alerting, and tracing
  • Constantly improve CI/CD reliability and delivery pipelines across GitHub Actions
  • Harden IAM, secrets management, and network segmentation as part of normal infra hygiene


About You
  • You've worked at startups and are comfortable in an environment of rapid growth where scaling up is a priority
  • You have great judgment - you know the difference between a durable, sustainable fix vs. a patch that buys you a week
  • You have deep, hands-on experience scaling and sharding relational databases in production environments
  • You know GCP maybe a little too well: Kubernetes, VPC, IAM, Cloud Logging, and the managed services ecosystem
  • You are fluent in Terraform and have owned real infrastructure-as-code at scale
  • You've operated Elasticsearch in production and know how to keep a cluster healthy
  • You have strong incident response instincts: you can work a P0 calmly, communicate clearly under pressure, and run a postmortem that prevents recurrence.
  • You've worked with GitHub Actions in a production CI/CD environment.
  • You have excellent communication skills (this is crucial!) and can both flag issues clearly and rapidly during incidents, and lead / write actionable postmortems


Our Stack

Google Cloud Platform

Terraform, Salt, GitHub Actions

Java, Redis, RabbitMQ, ElasticSearch, BigQuery, Kubernetes for backend

Electron+React

C# and C++ for native windows recording & more

Swift for iOS, Kotlin for Android

Benefits
  • Competitive salary and meaningful equity
  • Comprehensive medical, dental, and vision coverage
  • 401(k)
  • Wellness and fitness perks including a Wellhub membership and mental health resources
  • Paid parental leave, fertility and maternal health benefits
  • Generous PTO policy
  • Daily meals and commuter benefits at our NYC HQ in Flatiron
  • Learning and development stipend

Benefits vary by country and employment type.

Similar Jobs

More Jobs at Medal

More Information Technology Jobs

Find similar Site Reliability / Infrastructure Engineer jobs: