Moovweb is looking for a senior Site Reliability Engineering Manager to join our Platform team! Moovweb provides a high-speed, scalable platform for hosting modern, dynamic e-commerce web sites. Every day, our platform handles a massive amount of traffic for many mid-to-large size retailers such as 1-800-Flowers, Pep Boys, and United Airlines. We are heavy users of Amazon Web Services, Kubernetes (EKS), and Node.js. Have a passion for speed, scalability, security, automation, and reliability? You could be a great fit for Moovweb!
- At least 2 years of experience building, running and scaling production Kubernetes architectures, preferably on Amazon EKS
- At least 2 years of experience working with Amazon Web Services — including S3, EC2, CloudFormation, EKS, Route 53, RDS and more
- At least 2-4 years of experience working with production Node.js web services
- Experience managing a team
The ideal candidate will have...
- Knowledge of the tenets of SRE and best practices related to: security, performance, reliability/durability and disaster recovery.
- Extensive experience working with highly scalable, globally distributed systems
- Strong knowledge of high-performance networking on AWS, Internet protocols, and CDN configuration (preferably Fastly)
- The ability to build and scale large scale platform architecture, with a strong understanding of common scaling pitfalls and potential tradeoffs
- Experience monitoring large environments using tools like Sumo Logic, Datadog, CloudWatch, etc.
- Passion for good, usable documentation, and appreciation for how it allows for a widely distributed team to function
- Experience designing libraries and tooling to facilitate smooth CI/CD pipelines
- Strong understanding of security best practices
- Comfortable working with critical, customer-facing issues and able to prioritize quickly when escalations happen.
- Passion for making things better and faster!
- Be on-call at least one week out of every month for services that the SRE team owns; help triage, then coordinate/resolve escalations as they arise.
- Collaborate with other Moovweb engineering teams to support projects before they go live through activities such as: app design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- Design and develop automated solutions that improve the performance and reliability of the Moovweb Platform.
- Improve logging, metrics, telemetry, and monitoring to help use ensure high availability and reduce mean time to resolution (MTTR)
- Improve the company's SRE processes, and help build a culture around them.
- Experience with Varnish and the VCL language
- Experience with Ruby and Ruby on Rails
- Experience with React
- Experience with AWS Lambda and Serverless