Site Reliability Engineering Manager in San Francisco, CA

$80K - $100K(Ladders Estimates)

Moovweb   •  

San Francisco, CA 94102

Industry: Enterprise Technology

  •  

Less than 5 years

Posted 51 days ago

Moovweb is looking for a senior Site Reliability Engineering Manager to join our Platform team! Moovweb provides a high-speed, scalable platform for hosting modern, dynamic e-commerce web sites. Every day, our platform handles a massive amount of traffic for many mid-to-large size retailers such as 1-800-Flowers, Pep Boys, and United Airlines. We are heavy users of Amazon Web Services, Kubernetes (EKS), and Node.js. Have a passion for speed, scalability, security, automation, and reliability? You could be a great fit for Moovweb!


Requirements:

  • At least 2 years of experience building, running and scaling production Kubernetes architectures, preferably on Amazon EKS
  • At least 2 years of experience working with Amazon Web Services — including S3, EC2, CloudFormation, EKS, Route 53, RDS and more
  • At least 2-4 years of experience working with production Node.js web services
  • Experience managing a team


The ideal candidate will have...

  • Knowledge of the tenets of SRE and best practices related to: security, performance, reliability/durability and disaster recovery.
  • Extensive experience working with highly scalable, globally distributed systems
  • Strong knowledge of high-performance networking on AWS, Internet protocols, and CDN configuration (preferably Fastly)
  • The ability to build and scale large scale platform architecture, with a strong understanding of common scaling pitfalls and potential tradeoffs
  • Experience monitoring large environments using tools like Sumo Logic, Datadog, CloudWatch, etc.
  • Passion for good, usable documentation, and appreciation for how it allows for a widely distributed team to function
  • Experience designing libraries and tooling to facilitate smooth CI/CD pipelines
  • Strong understanding of security best practices
  • Comfortable working with critical, customer-facing issues and able to prioritize quickly when escalations happen.
  • Passion for making things better and faster!


Responsibilities:

  • Be on-call at least one week out of every month for services that the SRE team owns; help triage, then coordinate/resolve escalations as they arise.
  • Collaborate with other Moovweb engineering teams to support projects before they go live through activities such as: app design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
  • Design and develop automated solutions that improve the performance and reliability of the Moovweb Platform.
  • Improve logging, metrics, telemetry, and monitoring to help use ensure high availability and reduce mean time to resolution (MTTR)
  • Improve the company's SRE processes, and help build a culture around them.


Bonus points:

  • Experience with Varnish and the VCL language
  • Experience with Ruby and Ruby on Rails
  • Experience with React
  • Experience with AWS Lambda and Serverless


Valid Through: 2019-10-18