Site Reliability Engineer
Be a part of a small team of Site Reliability Engineers who work within Engineering to build and operate Acquia's PaaS/SaaS products such as Acquia Cloud, Content Hub and Lift. The successful candidate will have a tremendous ability to affect change while working on deep technical challenges using the latest cloud technology from Amazon Web Services.
Site Reliability Engineers (SREs) are embedded in engineering teams to help build highly resilient and scalable systems by automating, measuring, and monitoring everything. SREs ensure that their services are operationally ready through their contributions to sprint work and by participating in the engineering support on-call rotation. SREs have the explicit authority and responsibility to 'stop the line' on releases when a service is under SLA and overflow manual labor to the overall engineering team when the level of manual work exceeds sustainability. Acquia products run 100% on Amazon Web Services using CloudFormation and other best practices and are managed by their respective engineering teams.
- Work with team to implement highly-available and scalable architectures for core and third-party components of Acquia's PaaS/SaaS products;
- Solve availability/performance problems and build software-based solutions to prevent recurrences;
- Guide and implement build pipelines and automated deployments;
- Implement metrics, monitoring, and incident response processes;
- Implement change management and capacity planning processes;
- Initiate automated production deployments for patches and features;
- Champion the needs of Operations and the Customer Support team;
- Be aware of operations-related issues affecting Acquia's PaaS/SaaS systems;
- Monitor levels of manual effort and signal when it grows;
- Measure availability metrics and signal when under SLA;
- Identify the specific information needed to clarify a situation or make a decision;
- Keep customers informed by providing status reports and progress updates;
- Share a 24/7 on-call rotation with development engineers;
- Contribute as part of a Scrum team to maintain a deep understanding of system functionality and architecture, with primary focus to operational aspects of the service (availability, performance, change management, emergency response, capacity planning, etc.);
- Develop and maintain effective customer relationships and listen to customers (internal) and addresses needs and concerns.
Skills and Attributes:
- BS in Computer Science or a comparable field of study, or equivalent practical experience.
- Experience with Unix/Linux systems administration using the CLI.
- Fundamental understanding of TCP/UDP networking concepts
- Solid oral and written communications skills.
- Experience building systems on cloud technology (AWS, GCE, Rackspace, Openstack)
- Understanding of Software Development Life Cycle, Test Driven Development, Continuous Integration, and Continuous Delivery
- Experience with gathering/analyzing App/Host performance metrics