- Work with other SREs to build a comprehensive set of tools to monitor our production infrastructure to detect issues before users do.
- Enhance the observability of our systems to reduce time to answer why an issue happened.
- Work with other engineering teams to build resilient, operable, self-healing services
- Participate in reasonable on-call rotations with the rest of Engineering
- Practice sustainable incident response and blameless postmortems
- You will mentor SREs on standard methodology for everything from monitoring to troubleshooting complex code issues
- Previous experience architecting, building and deploying monitoring and observability systems. Preferrably with statsd/Datadog, Prometheus, and SumoLogic.
- Solid understanding of systems and application design, including the operational trade-offs of various designs.
- Minimum of 5+ years managing servers, preferably in AWS, at scale
- Ability to lead technical teams through design and implementation across an organization
- Reasonably deep knowledge of Linux and internet technologies
- Practical knowledge of various aspects of service design like messaging protocols & behavior, caching strategies and software design practices.
Nice to have
- Experience with distributed tracing and the Cloud Native Computing Foundation technology stack.
- Previous experience driving adoption of new systems across engineering teams
- Contribution to open source projects
- An active interest in serverless computing and containerization
- Collaborates and works as a team
- Avoids doing things twice
- Solves hard problems for tomorrow, not just for today
- Stays positive and prefers fixing problems to complaining about them
- Investigates, considers and adopts new technology where it makes sense
- Doesn’t tolerate brilliant jerks