More About the Role
We are seeking service-focused SREs to ensure reliable delivery of projects (and food!). Delivery, E-Commerce, and Restaurant SRE teams are looking for qualified candidates. SREs in these verticals embed within their Software Engineering counterparts and co-own critical application and service designs. We ensure that the bar of the software we’re building is high, that Service Level Objectives (SLOs) are rigorously well-designed and achievable, and that we have all the telemetry we need to make informed decisions about how to scale our services.
The Impact You Will Make
- SREs in the “Runtime Engineering” org embed within the Software Engineering verticals. You will co-own critical production service designs to ensure a high bar of reliability is achievable and measurable.
- You’ll drive reliability and observability improvements in the services within the vertical you are embedded. Using SLOs and other telemetry data, you’ll help your team make informed decisions on where reliability challenges may exist and help design and build solutions to improve them.
- You’ll build and improve internal tools and automation software to make maintaining production services easier and safer.
- You’ll champion and lead reliability-focused practices such as Failure Analysis, Load and Capacity Planning, Service Reviews, Architecture Designs, Incident Postmortems, and others
- You’ll be a subject matter expert on how the platform operates and a contact point for software engineers.
What You Bring to the Table
- Software Engineering experience: Python, Java, Go, or a similar object-oriented language
- Microservice Architecture and Application design experience
- Distributed monitoring experience: SLOs, metrics, tracing, etc
- Working knowledge of Cloud technologies (AWS, GCP, Compute/Containers, Storage, etc)
- Technical writing, documentation, communication skills
- Highly trafficked web-based service experience
About Our Tech:
- Tooling / Automation Code: Python
- Service Code: Java (Spring/Guice) REST / RPC
- Monitoring: Datadog, Splunk, Lightstep
- Cloud (AWS) Technologies: EC2, S3, ElastiCache (Redis/Memcache), Kinesis, Lambda, etc
- Data Tech: Cassandra, ElasticSearch, Redis, Memcache, Kafka
- Principals: Always hot+hot (N+1 datacenters, external providers, etc), cache all the things, secure from the start, load / unit / functional tests for everything, measure everything (metrics).