Design, implement, and maintain highly resilient and secure infrastructure for our SaaS platform using AWS services, including API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth
Ensure best-in-class security of the application using AWS security services such as WAF, Shield, GuardDuty, and implement industry-leading security practices
Develop, implement, and maintain robust monitoring and alerting solutions to ensure the reliability and performance of our SaaS platform, including the use of CloudWatch, Prometheus, Grafana etc.
Facilitate and drive incident response, triage & resolution, and retrospective/root cause analysis to maintain the reliability and uptime of our platform
Lead incident post-mortem/retrospectives to surface reliability improvements and drive to completion
Implement strategies to increase system resilience and performance through on-call rotation and process optimization
Strong understanding of SRE principles, including error budgets, SLOs, SLIs and SLAs, including the ability to identify and establish them for the team
Build and maintain infrastructure as code using Terraform
Provide input and expertise for system architecture and feature development
Engage and collaborate with stakeholders including Product, Development, QA, Customer Success & others to ensure work is properly defined, prioritized and executed, including improvements & future initiatives
Educate and guide Engineering teams on best practices wrt reliability, resiliency, security, etc
Participate in the Agile Development lifecycle helping us to stay realistic on our goals and flexible in our execution
Foster a culture of group collaboration while being effective at working independently at the same time
Requirements:
Prior SRE experience supporting a cloud-native SaaS platform with AWS
Bachelor's degree in Computer Science, Software Engineering, or a related field (or equivalent work experience)
AWS Solutions Architect and/or AWS DevOps Professional Certifications
A self-starter with strong communication skills, written and verbal, and prior experience thriving in a distributed work environment
5+ years of hands-on experience in site reliability engineering roles
Expert knowledge of AWS services, specifically API Gateway, Lambda, Aurora Serverless, OpenSearch Serverless, Secrets Manager, and FusionAuth
Expertise in AWS security services, including WAF, Shield, GuardDuty, and a deep understanding of cloud security practices
Strong experience with monitoring and alerting tools such as CloudWatch, Prometheus, Grafana, or similar
Proven ability to design and implement effective monitoring strategies to ensure system reliability and performance
Willingness and availability for participation in a 24x7x365 on-call rotation, ensuring prompt and effective responses to business-critical alerts outside of regular working hours
Extensive experience with Terraform for infrastructure as code
Experience building, securing, and maintaining a multi-tenant SaaS application
Experience with IDPs such as FusionAuth, Okta, Auth0, or similar
Strong understanding of information security principles and practices