Site Reliability Engineer

Appboy   •  

New York, NY

Not Specified years

Posted 269 days ago

This job is no longer available.


The Site Reliability Engineering (SRE) team at Braze is the team that provides the guidance, expertise, mentorship and education to the entire Engineering operation on how to build, test, monitor and deploy massively scalable applications. As a member of this team, you will develop a profoundly fundamental understanding of how the applications you are responsible interact with the underlying infrastructure and how to translate that to more efficient scalable application code. You will also be defining what the standards are for "production" by working with Engineering teams to establish and implement testing frameworks for both application and the infrastructure they run on. These standards will be critical for defining the applications Service Level Objectives (internal and external) and meeting those objectives. To be successful on this team, you will need to be able to seamlessly go between system administration to writing the code that impacts the systems, with the goal of providing reliability and uptime at a massive scale.

The primary responsibilities of an SRE are to:

• Define and enable standards for configuration, monitoring, reliability, and performance
• Evolve services and educate engineers to create a culture of reliability and velocity
• Support and improve services from inception, through development and production by planning for scale and reliability
• Solve live performance and reliability issues and prevent their recurrence
• Scale services sustainably through the development of internal tools and automation
• Pair with other SRE / DevOps to plan for future capacity and infrastructure needs
• Practice sustainable incident response and blameless postmortems.


• Comfortable working in a highly collaborative environment
• Strong communication skills
• Interest in designing, optimizing and troubleshooting large-scale services
• Conviction and curiosity empowering a knack for troubleshooting hard problems
• Ability to learn rapidly in high stress situations and implement changes from those learnings
• Strong familiarity with containers and container-orchestration (Kubernetes, ECS, etc.)
• Experience using automation (Chef, Puppet, etc.) to make services more sustainable
• Experience in developing, debugging and optimizing code (Java, Python, Go, Perl or Ruby)