Full Job Description
Sr Site Reliability Engineer I develops and implements Site Reliability Engineering (SRE) strategies, ensures real-time system observability, promotes best practices and automation, and collaborates with cross-functional teams to enhance system reliability and customer experiences while mentoring junior engineers.
Responsibilities
Mentors junior Site Reliability Engineers and cross-functional team of colleagues, fostering a culture of excellence and innovation
Provides guidance and support to junior engineers, fostering professional growth and development within the team, ensuring adherence to best practices in Site Reliability Engineering
Manages and oversees collaboration with Software Engineering teams to design, develop, and implement advanced features that enhance system resilience, scalability, and performance, proactively identifying and resolving complex system bottlenecks and failure points
Leads the development and refinement of sophisticated automation tools and frameworks, including advanced infrastructure as code (IaC) practices, to streamline complex operational workflows, deployment processes, and infrastructure management, significantly reducing manual intervention and ensuring high system efficiency
Actively engages in and influences high-level architectural design discussions, ensuring that advanced reliability, scalability, and performance considerations are deeply integrated into strategic decision-making processes, and driving the adoption of innovative solutions
Designs, executes, and oversees comprehensive chaos engineering experiments and advanced resiliency testing, analyzing results to implement robust improvements that enhances system robustness and recovery capabilities, and mentors colleagues in these practices
Leads the development, optimization, and maintenance of comprehensive disaster recovery plans and business continuity strategies, ensuring systems can recover quickly and effectively from complex and unexpected disruptions
Advocates for and implements advanced observability practices, including error budgeting, service-level objectives (SLOs), and service-level indicators (SLIs), contributing to a culture of continuous improvement and reliability, and mentoring colleagues in these practices
Collaborates with cross-functional teams to enhance customer journeys, ensuring seamless and reliable technology experiences by addressing potential reliability and performance issues proactively, and leading initiatives to improve overall system reliability
Collaborates and co-creates effectively with teams in product and the business to align technology initiatives with business objectives
Qualifications
Education Qualifications:
Bachelor's degree in Computer Science, Information Technology, Engineering, and/or comparable experience; advance degree preferred
8+ years of experience in software engineering and application development with strong proficiency in Java/J2EE, Python, Kotlin, Spring Boot, SQL, NoSql.
Knowledge of modern observability stack - Splunk, Elastic Search, Prometheus, Grafana
Knowledge of containerization technologies (e.g., Kubernetes, Docker) and microservices architecture
Knowledge of observability tools and methodologies, including experience with logging, monitoring, tracing, and performance analysis platforms
Knowledge of cloud-based Site Reliability Engineering (SRE) practices and experience with public cloud platforms such as AWS, Azure, or Google Cloud
Work Experience:
Experience in software development, or technology operations, with a focus on Site Reliability Engineering
Experience in Linux/Unix systems, object-oriented programming languages (e.g., Java), scripting languages (e.g., Python, Bash), and cloud platforms (e.g., AWS, Azure, GCP)
Licenses and Certifications:
Advanced certification in Site Reliability Engineering (SRE) or related is a plus