Site Reliability Engineering Manager

RMS   •  

Newark, CA

Industry: Technology

  •  

5 - 7 years

Posted 44 days ago

RMS is seeking a Site Reliability Engineering Manager to lead a US-based SRE team. You will play a critical and visible role in delivering and supporting our next generation RMS(one) platform.

We are building a new platform that:

  • is a highly scalable, cloud-based SaaS offering that helps our clients understand and manage risk
  • is based on Linux, Java, and open source technologies, and leverages the latest advances in database tools, vector processing, hardware-based acceleration techniques, and geographic visualization tools
  • utilizes a unique Big Data approach scaling to massive sizes over time and large scale distributed data processing technology

About you:

You are driven by professional curiosity and a desire to develop a deep understanding of the services and the technologies they depend upon.

You are passionate about automation and can demonstrate practical knowledge of various aspects of distributed service design, such messaging protocols, caching strategies, persistence technologies, and queuing.

You have the ability to understand and explain the effect of product architecture decisions on the ability to run as distributed systems.

You are deeply technical and have the ability to dig in and get your hands dirty when needed.

About the Role:

  • You will manage a kickass team of SREs, providing vision and leadership
  • Foster the adoption of software and systems engineering approaches within the team and mentor junior SRE's in their growth into mature SREs
  • Manage the work and priorities of the team, to facilitate the reduction of toil work and establish a great toil vs development work balance
  • Partner with our extremely talented development teams to help them build reliable and scalable services, and resolve any production issues as quickly as possible.
  • Champion service reliability, observability, and supportability.
  • Manage Incident response/resolution, service restoration, and incident prevention.
  • Identify gaps in processes, skills, tooling, technology choices and work with upper management to drive improvements within the organization.
  • Lead by example, care for your team and establish credibility with the quality of your and your team's technical execution.
  • Stay abreast of the latest SRE methodologies, and skillfully adopt the appropriate ones
  • Be a change agent, with the ability to skillfully and strategically implement the SRE vision
  • Recruit, mentor, retain, and grow top-notch talent

Requirements

  • 6+ years related experience in a hands-on technical role such as SRE, Systems Administrator and/or Development Engineering
  • 2+ years experience leading and managing a team of engineers
  • Knowledge of cloud computing patterns
  • Experience supporting and deploying platforms/cloud applications on AWS
  • Experience with Container and Container Management technologies, such as Docker and Kubernetes
  • Good understanding of microservices concepts/architecture and design patterns
  • Experience with Big data and analytics technologies
  • Experience coding in Java, Python, and shell
  • DevOps skills with Jenkins, Terraform, Ansible
  • Experience and knowledge of systems monitoring and logging
  • Experience working withdevelopers to instrument applications
  • Knowledge of Infrastructure Security and compliance
  • Familiarity with ITIL-based incident, problem, and change management