The Site Reliability Team’s job is to keep the platform that our 1,400+ customers use running smoothly and efficiently. We build powerful automation that impacts everything from development and testing through to production deployment, scaling, monitoring, and alerting. Put another way, we eliminate work through automation. We have fun leveraging cutting edge technologies such as Terraform, Kubernetes, Docker, Istio, Jenkins, and Spinnaker.
Help us scale our business to meet the needs of our growing customer base and develop new products on the Namely platform. You'll be a critical part of our growing company, working on a cross-functional team to implement best practices in technology, architecture, and process. You’ll have the chance to work in an open and collaborative environment, shape Namely’s engineering culture, and have ample opportunities to grow and accelerate your career.
We know SRE involves different responsibilities and requires slightly different skill sets at every organization. At Namely, the SRE team evaluates candidates across three core competencies. Coding, Cloud Infrastructure, and Kubernetes. We do not expect all candidates to be experts in all three, but rather have a high level of skill/experience in two of the three, with mid level understanding in the other.
- We have internal tools which abstract certain concepts from engineers, enabling them to deliver value faster. These are primarily written in golang with a few in python.
- We have scripts and internal tools that we as the SRE team use to reduce toil and as a part of processes to ensure consistent outcomes.
- We run everything on AWS, across multiple accounts, with the majority of the resources captured in Terraform.
- Working with engineering leadership to define standard patterns that can be adopted across the organization.
- The SRE team provides Kubernetes in Platform as a Service (PaaS) model to engineers at Namely.
- We’re using EKS for the Kubernetes control plane.
- We provision open source controllers and some closed source tools to provide developers with a PaaS. Examples include, but are not limited to Istio, Kubecost, SignalFX, Prometheus, LogDNA, Jaeger, Gatekeeper, and Kiali.
- Maintain and update Namely’s documentation for developers to understand these components of the platform, their responsibilities, and “how to’s”.
- Design and build the tools, frameworks, systems, and processes that Namely engineers use to build, integrate, deploy, scale, and manage their software.
- Automate tasks across the full CI/CD lifecycle to create an efficient developer experience and reduce manual toil.
- Scale solutions from proofs-of-concept to full production systems.
- Collaborate effectively with other engineers on the SRE team and in the larger engineering organization.
- Promote and implement best practices in observability (monitoring, tracing, alerting, logging) and high availability software engineering.
- Participate in an on-call rotation to mitigate site disruption.
- Minimize the risk of reliability-related failure outcomes as pertaining to durability, availability, performance, and correctness.
- 3+ years in SRE or DevOps roles, with a focus on tooling, automation and distributed systems development.
- 5+ years of overall software industry experience.
- A desire to stay on the cutting edge of infrastructure and automation technologies.
- Strong software development skills in at least one programming language. The Engineering primarily uses Go and .NET Core, while the SRE team uses Go, Python, and Bash.
- Production experience with infrastructure frameworks like Docker, Terraform and Kubernetes.
- Production experience with AWS and Linux environments.