As a Principal Site Reliability Engineer you will be working with the LifeLock applications to build, maintain and support a highly available application infrastructure stack running at AWS. This team strives to continuously improve our environment by ensuring the highest levels of security and reliability while automating as many deployment and quality assurance and maintenance processes as possible. In order to accomplish this, you will work closely with our Product/Engineering and TechOps teams to understand how the Business Applications utilize the infrastructure and implement monitoring/alerting for these applications.
What you will be doing
- Ensures high availability, security, and performance of production and development environments for our customer facing applications that utilize the following technologies: NodeJS, Java/Tomcat, Activiti BPM, Apache Nifi & MongoDB.
- Automate configuration management tasks, deployment of product releases and provisioning of AWS services in development, staging and production environments
- Optimize performance and restore applications to full health as required
- Work with software development teams to shape the architecture, design, and implementations of new and existing systems to enhance their reliability, performance, efficiency, and scalability
- Ensure all key service metrics are measured, monitored and raising alerts when needed.
- Drive efficiencies in systems and processes: capacity planning, configuration management, performance tuning, monitoring and root cause analysis.
- Promote a DevOps culture by actively building relationships with other technical and business teams.
- Participate in rotating 24x7 on-call support schedule for production systems.
What you bring:
- BS / BA degree in Computer Science or Information Systems, or an equivalent combination of education and experience.
- 4+ years’ experience with Linux installation and patch administration.
- 2+ years of experience with Configuration Management software and techniques such as Jenkins, CI/CD, Groovy, Chef, Puppet, Ansible, CloudFormation or Terraform
- 2+ years' experience building, testing, deploying and operating highly scalable and resilient cloud-based infrastructure hosting solutions utilizing NodeJS, Java/Tomcat, Apache Nifi, Activiti BPM, and MongoDB in a medium or large enterprise.
- Experience with container technologies in enterprise environment preferred (e.g. Kubernetes)
- Strong understanding and experience with AWS services including but not limited to EC2, RDS, Lambda, S3, EFS, IAM, CloudWatch, Cloudtrail, AWS Systems Manager, AWS Service Catalog, ELB/ALB, Auto Scaling Groups, VPC and related services.
- Fluency with at least one current generation scripting language used by DevOps professionals (Python, Perl, PHP, Ruby)
- Experience with Application Performance Management tools (Dynatrace/AppDynamics) preferred.
- Experience working high availability environments (7x24) and highly scalable sites/applications preferred
- Experience creating meaningful dashboards, logging, alerting, and responses using log analytics tools like SumoLogic, Splunk or ELK
- Clearly document and diagram deployment-specific aspects of architectures and environments.
- Working knowledge of PCI or experience working in a regulated industry preferred