As part of our growing team, you will be dedicated to improving and scaling the reliability of our end-to-end infrastructure. We dive deep into complex operational challenges, including software, systems, automation, and process analysis. We empower our team to be self-directed and self-motivated in their work. If you'd thrive in that environment, and our core values resonate with you -- build trust, question assumptions, and validate direction -- you'll fit right in!
What you will do:
- Lead projects from proposal through postmortem, assessing vague problems, proposing high-impact solutions, and executing them against a set of success criteria.
- Develop effective tools, alerts, and responses to identify and address reliability risks.
- Work closely with search engineers to triage production issues and determine appropriate remediation, including code changes and performance considerations.
- Participate in our on-call rotation; triage and address reliability issues that come up in production.
- Help determine the future technical direction of our deployment with an effort to improve reliability and performance.
What we are looking for:
- 7+ years of experience tackling reliability challenges of large-scale deployments and high-traffic, distributed systems
- Experience with production troubleshooting, including: distributed systems, code, storage, networking, and operating systems
- Moderate-to-advanced programming experience, preferably in a high-level language like Perl or Python
- Experience participating in a 24x7 on-call rotation for a large-scale deployment.
- Experience configuring and troubleshooting Linux and NGiNX
- Strong organizational skills, you have an eye for detail and are not afraid to use it!
- Effective project management skills; you have successfully launched projects from inception to production
- Strong communication skills: You clearly articulate, in verbal and written communication, your recommendations and decisions
- Comfortable providing feedback to an array of stakeholders, both internal and external