Overview & Responsibilities
As a Site Reliability Engineer, you will work with other SREs, Engineers, Developers and our support & operations teams to ensure maximum performance, reliability and automation of our Managed Kubernetes deployments and infrastructure on top of Azure / AWS.
We recognize that manual approaches to operations do not scale, and have a dedicated team of Site Reliability Engineering to tackle the significant problems of managing many, discrete Private Cloud and Public Cloud Kubernetes deployments with multiple offerings and form-factors at scale world-wide.
Our Site Reliability Engineer is someone who is familiar with both software and systems engineering with a desire not to just resolve the problem but prevent it in the future. You should have excellent written and verbal communication skills and you should be comfortable operating in fast paced environment.
You will be working with many new and cutting-edge technologies, such as Kubernetes, Docker & LXC containers, software defined networking, security tools, and other Cloud Native Compute Foundation projects as well as our extended platform support for Managed Kubernetes on top of Azure / AWS.
In addition to resolving and automating issues internally and downstream if a problem, or issue is better served by fixing the issue in the upstream Open Source code, you will be submitting patches to improve the operational and reliability aspects of the upstream projects.
Design, architect, as well as maintain existing operational solutions for managing our customer environments and infrastructure, across data centers and technologies with the specific goal of increasing the automation, repeatability, and consistency of operational tasks.
Implement and maintain monitoring and alerting solutions that help discover failures in a timely fashion while working with engineers to identify root cause and fix issues
Provide basic to intermediate network administration and troubleshooting.
Day-to-day operational management, including response, incident, event and problem management activities along with our service delivery and engineering teams.
Participate in on-call rotation duties.
Engage in and improve the whole lifecycle of services—from inception and design, through deployment, operation and refinement.
Support services & deployments before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Scale systems sustainably through mechanisms like automation, and evolve systems by pushing for changes that improve reliability and velocity.
Practice sustainable incident response and blameless postmortems.
Experience in one or more of the following: Microsoft Azure, Amazon Web Services
Kubernetes and Docker/container runtimes is a must.
Experience in one or more of the following: Python, Go, and cross platform scripting is a must.
Experience with algorithms, data structures, complexity analysis and software design.
Experience with Linux systems administration and tuning.
Experience with automation tools such as Docker, Jenkins, Ansible, Terraform
Understand and have implemented containerized systems.
Comfort with collaboration, open communication and remote teams.
Interest in designing, analyzing and troubleshooting large-scale distributed systems.
Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive.
Ability to debug and optimize code and automate routine tasks.
Think of infrastructure and automation as code and critical engineering tasks.