Site Reliability Engineer
5 - 7 years experience • Professional, Scientific & Technical Services
Job Number : 84710
Division : Information Technology
The Site Reliability Engineer is responsible for the continuous monitoring of all applications and production environments and responding to alerts and potential issues. The right candidate for this position will be able to manage multiple systems, analyze issues & severity, provide timely reporting of events and develop & track performance statistics. The position requires strong communication as well as technical skills and the ability to work with cross-functional teams in a rapidly growing environment.
DUTIES AND RESPONSIBILITIES:
* Partner with delivery teams to improve reliability and operational efficiency throughout the entire SDLC
* Monitor various applications to proactively identify system disruptions and preempt enterprise outages
* Monitor applications and ensure that required Service Level Agreements (SLAs) are met
* Notify internal and external departments of performance issues and trends
* Support maintenance and monthly outages
* Review and update tickets with most current status information
* Incorporate monitoring of any new applications or systems
* Review and suggest monitoring tools as needed
* Monitor and support the Testing, Education, and Production environments
* Develop after action reports and provide inputs to post-mortems pertaining to the following as needed.
* Performance issues
* Scheduled server maintenance
* Root Cause Analysis (RCA) and follow up both internal and external
* Provide updated reports
* Perform full system analysis on software performance in addition to capacity planning, and demand forecasting.
* Triage tickets raised by our support organization and implement fixes
* Improve monitoring infrastructure, build out data aggregation and alerting rules
* Work closely with engineering to build scalable solutions
* Partner with delivery teams on change management to more effectively manage change to environments, especially Production.
* Leverage automation to enable progressive rollouts, speed up problem detection as well as automate safe and quick rollback when problems occur.
* Special projects and assignments as business dictates
This position has no supervisory responsibilities.
SKILLS AND QUALIFICATIONS:
* At least 5 years of relevant overall experience.
* BS in Computer Science or comparable field of study
* A background with distributed systems, databases and performance analysis.
* Very strong SQL Skills
* Excellent scripting skills for debug and automation (Python knowledge is a plus).
* Experience with data platforms, data warehouses, and data pipelines/ETL
* Server hardware troubleshooting is a plus.
* Strong communication skills.
* Outstanding organizational skills and keen attention to detail
* Extensive knowledge of common Internet Protocols
* Experience with virtualization and cloud technologies
* Experience with writing code around infrastructure automation
* Understanding of how to architect and implement highly available, scalable, and secure network in multiple cloud environments
* Strong affinity and experience in working with continuous deployment and continuous integration environments
* Full stack troubleshooting and instrumentation experience
* Understanding of AWS or other cloud platform and Atlassian toolsuite
* Sitting for extended periods of time
* Dexterity of hands and fingers to operate a computer keyboard, mouse, and other computing equipment
* The employee frequently is required to talk or hear
* The employee is occasionally required to reach with hands and arms
* Specific vision abilities required by this job include close vision, distance vision, color vision, peripheral vision, depth perception, and ability to adjust focus
* Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.
* The noise level in the work environment is usually moderate
* Fast paced office environment