What You’ll Do:
As a Lead Site Reliability Engineer (SRE), you will be responsible for the availability, automation, performance, efficiency, scaling, monitoring and emergency response of operating systems. You use your deep understanding of platforms, architecture, people, systems, and processes to both establish and continuously improve SLIs and SLOs for uptime, performance, deployment, monitoring, and troubleshooting. You are interested in setting direction and leading the day to day processes that shape our vision for reliability.
Your Day to Day
- Maintain and support the product and data systems: proactively monitor events, investigate issues, analyze solutions, and drive problems through to resolution.
- Define requirements and develop tools and reporting as needed by projects and operations.
- Work with products to define application hardening and define opportunities for chaos engineering.
- Use operational tools and monitoring platforms to gain in-depth knowledge, understanding, and ongoing monitoring of system availability, performance, and capacity.
- Work with business partners to establish Service Level Indicators and Objectives (SLIs and SLOs)
- Implement alerting strategy that makes alerts actionable and unique.
- Provide follow-through to ensure issues are resolved to satisfaction
- Drive continuous improvement and innovation within the team.
- A sense of ownership, initiative and drive.
- Bachelor's degree or higher with previous experience in a technical support role.
- You have been working in technology for 3+ year
- Experience of Java or .NET application development
- Experience with SQL Server 2005/2008/2012/2016
- Experience with browser related technologies
- Experience with Linux and Windows.
- Knowledge of monitoring tools and strategy.
- Experience running incident postmortems.
- Solid understanding of automated deployment processes
- You have been working in technology for 3+ years.