At Olo, Site Reliability Engineering is a discipline that combines software and systems engineering to build and run web-scale, distributed, fault-tolerant and performant systems. As an SRE you will ensure that Olo's internal and external applications have reliability and uptime appropriate to end users' needs and a feedback loop focused on improvement while keeping a watchful eye on capacity and performance.
What You’ll Be Doing
- Take ownership of the entire process, from observability and SLIs/SLOs to Incident Response to postmortems and follow-up actions.
- Work to define standards and best practices and help drive those into each team.
- Help us implement and tailor our incident response tools in order to minimize outage durations.
- Brainstorm, define, and build collaborative monitoring solutions with members across multiple product teams.
- Contribute insights across teams to help us improve or re-architect existing systems to support scale, performance and extensibility.
- Constantly re-evaluate our observability tooling to improve architecture, knowledge models, user experience, performance and stability.
- Analyze and mature our processes around Incident Response, Observability, Postmortems and Predictive Monitoring.
- Maintain production services by measuring and monitoring availability, latency and overall system health.
- Influence an engineering culture of reliability, observability, and availability.
- Strive to coach and mentor engineering teams through game days, SRE boot camps and other training and feedback channels.
What We’ll Expect From You
- Strong experience with monitoring systems like Datadog, Sumo Logic, Raygun, New Relic or similar.
- Fluency in at least one Incident Management tool such as FireHydrant, OpsGenie, PagerDuty, VictorOps or similar.
- Some past experience with build and deploy tools such as Jenkins, TeamCity, Octopus, CircleCI, etc.
- You've been in the trenches building highly scalable, efficient, and resilient systems.
- Prior hands-on software development experience highly desired.
- Self-starter: can take high level direction and organize to achieve its objectives.
- Highly motivated individual with a curiosity to learn as you grow.
- Legally able to work in the U.S.
- Willing to roll up your sleeves, work hard and be scrappy!
Nice to Have
- Prior hands-on software development experience.
- Experience with Ansible, Terraform or other Infrastructure-as-Code tools.
- Experience with containers and container orchestration frameworks.
- Expertise in guiding Incident Response, in terms of both process and tooling.