We’re looking for a DevOps or Site Reliability Engineer (SREs) who can help us design, build, and maintain high-performance, scalable, reliable services. You will work with our software engineers to teach and enable them to build and run performant, scalable, observable, and reliable applications. We believe every engineering team at Upwave should be responsible for the software they build, and you’ll play an indispensable role in providing the tools, practices, and expertise to make that possible and sustainable.
Your responsibilities will include:
- Designing, building, and maintaining the core cloud infrastructure used by all of Upwave’s engineers.
- Developing and promoting conventions on production readiness and operational excellence
- Debugging production issues across services and levels of the stack
- Partnering with engineering teams to ensure their applications and pipelines meet production standards
- Continuously tinkering with our processes, tooling, automation, and documentation to remove friction and improve overall system reliability
- Participating in design reviews and production reviews for new features, products, or pieces of infrastructure
- Carrying forward our efforts to towards expressing as much of our infrastructure “as code” as we can
- Educating our entire team on the ideals and practices of DevOps
What we're looking for:
- You know how to design large scale systems that can handle billions of events per month and are reliable, scalable, secure, efficient, maintainable, extensible, and elegant.
- You deeply understand the power and promise of cloud infrastructure, and you have enough experience building in the cloud to know where the pitfalls are and how to avoid them.
- You have material experience with containers, plus at least one configuration management system (we use Terraform) and container orchestration platform (we use Kubernetes).
- You are deeply familiar with the cloud, devops, and reliability ecosystems, and you continuously invest in keeping your knowledge fresh and current. No one has used every tool, but you should have used many of them in extensively in production. You should be able to intelligently discuss the tradeoffs inherent in different tools and solutions, e.g. when to use Kubernetes and when not to.
- You understand how computers work and how they talk to each other. You’re comfortable popping a UNIX shell and getting into the weeds to understand exactly why one service can send packets to another server but can’t get packets back.
- People tend to look to you as a leader and respect your expertise, even in roles where you don't have formal authority. You have experience mentoring junior team mates, and you understand that healthy human systems are essential to developing and maintaining healthy technical systems.
- You thrive on the energy of operating in a fast-paced, ever-changing startup atmosphere. You are a self-starter and you love working self-driven in a dynamic team.
- We don’t have a formal requirement around years of experience. The typical candidate who’s reached the necessary level for this role has more than 8 years of professional experience and more than 4 working specifically in DevOps or Site Reliability. But we care more about what you’ve accomplished than about how many years you’ve spent doing it.
- Experience with AWS (in particular EKS, EMR, Athena, Glue, Kinesis, Route 53, IAM, S3).
- Experience with Docker.
- Experience with Terraform.
- Well informed opinions about the best way to set up CI/CD.
- Experience working with a microservice architecture.
- Experience having primary responsibility for infrastructure/DevOps/reliability on a large-scale production system.
- Experience with marketing or advertising technology.