Site Reliability Engineer

LiteLLM AI Gateway

$120K — $160K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 1-4 years of experience with Python services in production at scale
  • Proficiency in debugging OOMs, memory leaks, and connection pool issues
  • Strong understanding of PostgreSQL, including optimization and connection pooling
  • Experience with Kubernetes and handling pod management
  • Familiarity with Prometheus and Grafana for monitoring
  • Desire to engage with open source projects and user communities
  • Ability to work effectively in small team environments

Responsibilities

  • Own the reliability and performance of the LiteLLM proxy in production
  • Fix out-of-memory (OOM) issues in Kubernetes deployments
  • Resolve database connection problems under load
  • Address race conditions and deadlocks in code
  • Optimize performance of key database operations
  • Enhance Redis and cache reliability
  • Implement effective production monitoring and alerting

Benefits

  • Work directly with executive leadership like the CEO and CTO
  • Engage in critical projects with high impact on users
  • Opportunity for hands-on problem solving in production environments
  • Contribute to the reliability of a critical tech stack for AI applications
  • Experience a culture that values open source and direct user engagement
Full Job Description

What you will be working on

Skills: Python, FastAPI, PostgreSQL, Redis, Kubernetes, Prometheus, performance profiling

As the SRE, you'll own the reliability and performance of the LiteLLM proxy in production. Our users run LiteLLM as a critical gateway handling millions of LLM requests - when it goes down, their entire AI stack goes down. You'll work directly with the CEO and CTO on critical projects including:
  • Fixing OOM issues - e.g. Prisma Query Engine unable to recover from OOMKill in K8s deployments, unbounded in-memory buffers in spend log transactions
  • Solving database connection problems - e.g. database query limits getting reached under load, spend logs loading extremely slowly, Prisma connection pool exhaustion
  • Fixing race conditions and deadlocks - e.g. max_parallel_requests deadlocking API keys after provider timeouts (counter never released, Redis reset required), PodLockManager releasing another pod's lock, in-memory cache increment race conditions
  • Performance optimization - e.g. update_database() doing 7 deep copies per request in the spend tracking hot path, health check fan-out overloading startup
  • Improving Redis/cache reliability - e.g. budget limiter reading stale Redis data, cache sync issues between in-memory and Redis layers
  • Production monitoring - making Prometheus metrics accurate (fixing missing/inf budget metrics), adding alerting, improving observability for multi-pod deployments
  • Making the proxy self-healing - graceful degradation when DB/Redis is temporarily unavailable, connection retry logic, proper health checks

What is our tech stack

The tech stack includes Python, FastAPI, Redis, Postgres, Prisma ORM, Kubernetes, Prometheus, Docker.

Who we are looking for
  • 1-4 years of experience running Python services in production at scale
  • Experience debugging OOMs, memory leaks, connection pool issues, and race conditions
  • Comfortable with PostgreSQL (query optimization, connection pooling, PgBouncer) and Redis
  • Kubernetes experience - you've dealt with pod restarts, resource limits, health probes, and multi-replica coordination
  • Familiarity with Prometheus/Grafana for monitoring and alerting
  • Passion for open source and user engagement
  • Strong work ethic and ability to thrive in small teams
  • Eagerness to talk to users and help solve real problems - our GitHub issues are full of production debugging sessions and you'd be jumping into those directly

Similar Jobs

More Jobs at LiteLLM AI Gateway

  • Site Reliability Engineer
    $120K — $160K *
    San Francisco, CA 94112 (San Francisco County)
    Information Technology
    In-Person
  • Technical Account Manager
    $130K — $180K *
    San Francisco, CA 94112 (San Francisco County)
    Technical Services
    In-Person
  • Backend Engineer
    $120K — $160K *
    San Francisco, CA 94112 (San Francisco County)
    Information Technology
    In-Person
  • Head of Technical Support
    $250K — $300K *
    San Francisco, CA 94112 (San Francisco County)
    Technical Services
    In-Person

More Information Technology Jobs

Find similar Site Reliability Engineer jobs: