Industry: Information Services•
5 - 7 years
Posted 292 days agoby Charulatha Krishnamoorthy
The Site Reliability Engineer is responsible for ensuring that our digital space is up and running at all times and performing at its peak capability. This position will be responsible for planning, deploying and troubleshooting application stacks in support of our growing ecommerce business. The Site Reliability Engineer will interact with several functional areas across all levels of the organization.
Essential Job Functions:
? Work closely with our Store, ESB, and Development teams to design our application stack for an enhanced digital presence, performance, and availability on multiple cloud services.
Conduct post-mortem reviews of system down time with internal stakeholders to put short- and long-term solutions in place to eliminate repeat occurrences.
? Conduct risk analysis to review system shortcomings that present risk of downtime for application stacks. Continuously improve our internal processes and controls to ensure optimal performance.
? Implement DevOps changes and rollouts and shepherding deployment in a manner leading to optimal results.
? Combine software and systems engineering to build and run large-scale, distributed, fault-tolerant systems. SRE ensures our internally critical and our externally-visible systems have reliability and uptime appropriate to users' needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
? Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
? Use configuration management tools to create repeatable environments.
? Create dashboards which communicate and alert on the overall system health to less technical colleagues.
? Develop system configuration management templates, and audit systems against those templates over the system lifecycle.
Work with developers to quickly identify and address issues to provide smooth code rollouts and seamless change back-out when there are problems.
Minimum knowledge, skills and abilities:
? BachelorDegree in Computer Science or similar area. Experience may be considered in lieu of a degree.
? Minimum of four (4) years of Linux systems administration experience.
? Experience with Apache, Tomcat, Wildfly, .NET or .NET Core hosting.
? Working knowledge and experience with networking fundamentals.
? Expert skill level in Scripting and Automation.
? Expert in high-availability and load balancing technologies.
? Willingness to document technical processes and share knowledge with others. Capable of following and composing process and procedure documentation, as well as training other users on complex topics.
? Ability to interact with colleagues from all levels of the organization, both technical and non-technical, and communicate technical ideas effectively.
? Proven ability to work independently with minimal supervision.
? Someone driven to get an ?extra 9? of availability.
Preferred knowledge, skills and abilities:
? QSR experience.
? Ability to learn new technologies or support existing applications quickly and with minimal guidance.
? Thrives on technical challenges and takes pride in solving them.
? Comfortable in a fast-paced, dynamic environment, and doing things differently.
? This is a full-time position that provides Level 2 & 3 support, on a 24 x 7 schedule, for all operational and outage issues relating to the infrastructure.