More About the Role
GrubHub is looking for an experienced SRE specialized in managing large critical data persistence platforms including Cassandra and Elasticsearch on AWS. Grubhub platform supports high volume applications in a container based microservice architecture running on multiple AWS regions in fully Active/Active mode. The entire platform is powered by a very large multi-datacenter Cassandra infrastructure for persistence, and Elasticsearch for indexing and scaling search and content experience. You will be working with a team of passionate and skilled engineers responsible for automation, scaling, tuning, and troubleshooting of Elasticsearch and Cassandra databases. You will also collaborate and work with a diverse group of engineers across the organization to design and engineer solutions
The Impact You Will Make
- Manage large critical Cassandra and Elasticsearch clusters supporting Millions of transactions per day
- Build systems to automate all build and maintenance tasks using Ansible and python
- Develop self-service tools to allow engineers to manage and provision resources with GrubHub best practices and standards
- Monitor cluster availability, read/ write latencies, and other key performance metrics to proactively identify SLO misses and help mitigate issues
- Evaluate new technologies, tools, and software versions. Test, plan and develop roadmaps
- Tune Cassandra and ES databases for optimizing throughput and read /write latencies
- 24X7 on-call rotation support with rest of team for rapid incident response
- Implement DR strategies, including backups and recovery techniques with minimal downtime.
- Work with other engineers to manage our data persistence integration and performance with the GrubHub platform.
- Proactively monitor and scale Elasticsearch/Cassandra clusters to handle growth in traffic
What You Bring to the Table
- Experience developing backend applications in Python or Java
- Experience managing, working or developing large Elasticsearch clusters in highly available 24x7 production environments
- Experience automating the maintenance of infrastructure using Python and Ansible or similar tools.
- Strong experience managing automated cloud infrastructures on AWS or other major cloud providers.
- Experience managing large Cassandra clusters in production is a strong plus.
- Experience working with docker is a plus
- Ability to quickly learn new concepts and technologies and adapt to changing needs.