Site Reliability Engineer- Monitoring & Observability

OfferUp   •  

Bellevue, WA

Industry: Technology


5 - 7 years

Posted 24 days ago


  • Work with other SREs to build a comprehensive set of tools to monitor our production infrastructure to detect issues before users do.
  • Enhance the observability of our systems to reduce time to answer why an issue happened.
  • Work with other engineering teams to build resilient, operable, self-healing services
  • Participate in reasonable on-call rotations with the rest of Engineering
  • Practice sustainable incident response and blameless postmortems
  • You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues


  • Previous experience architecting, building and deploying monitoring and observability systems. Preferrably with statsd/Datadog, Prometheus, and SumoLogic.
  • Solid understanding of systems and application design, including the operational trade-offs of various designs.
  • Minimum of 5+ years managing servers, preferably in AWS, at scale
  • Ability to lead technical teams through design and implementation across an organization
  • Reasonably deep knowledge of Linux and internet technologies
  • Practical knowledge of various aspects of service design like messaging protocols & behavior, caching strategies and software design practices.

Nice to have

  • Experience with distributed tracing and the Cloud Native Computing Foundation technology stack.
  • Previous experience driving adoption of new systems across engineering teams
  • Contribution to open source projects
  • An active interest in serverless computing and containerization

Our team

  • Collaborates and works as a team
  • Avoids doing things twice
  • Solves hard problems for tomorrow, not just for today
  • Stays positive and prefers fixing problems to complaining about them
  • Investigates, considers and adopts new technology where it makes sense
  • Doesn’t tolerate brilliant jerks