The resource is a member of the Residential Reliability Engineering Support Team responsible for production support, developing and maintaining standard operating procedures (SOP's) for customer facing products. The resource will ensure that all incidents are identified, triaged and resolved within the Service Level Agreement and serve as an escalation point for technical support. Additionally, this position will be responsible for ensuring that all root cause analysis is promptly and properly documented for high severity incidents and delivered to the respective Product owners. This position will interface with Comcast Product, Change, Problem, Release, Engineering, and Operations Management teams.
- Lead technical investigation and triage of production issues; analyze logs, perform end-to-end investigation including but not limited to network, software and infrastructure issues
- Leads technical outage bridges and engages appropriate resources to drive issues to closure
- Document triage and training procedures (including enhancing existing procedures) for complex application workflows (including API's and endpoints)
- Draft engineering production support readiness documentation
- Actively manage relationship with key stakeholders, markets and resolver groups
- Respond to service-level issues and work to restore normal service operations as quickly as possible
- Develop procedures for incident triage and management, metric and measure creation, management and administration of monitoring tools
- Oversee the timely execution of scheduled and repeatable processes such as periodic system validations, daily triage, and system monitoring and event log management
- Work with architecture, development and engineering teams to identify root cause for incidents and create an action plan for resolution
- Monitor systems and services for most efficient operation, identifying fault conditions as well as opportunities for further optimization
- Analyses problems in design, configuration, data flow, and data state within a highly complex multi-product provisioning system
- Assist in training and developing junior engineers and offshore resources
- Identify and lead the implementation of creative process and technology solutions within the team
- Provide mentorship and team development opportunities
- Assist in representing Production Support to the organization ensuring that high-availability and the ability to identify customer-facing issues is included in the development or deployment of new products and services.
- Identify and recommend opportunities for "clean-slate" process improvement with regards to incident management, fault monitoring, triage procedures and issue escalation
- Maintain escalation and contact lists for mission critical systems and services
- Consistent exercise of independent judgment and discretion in matters of significance
- Regular, consistent and punctual attendance. Must be able to work nights and weekends, variable schedules(s) as necessary, including participating in an on-call schedule for after hour support
- Bachelor's Degree or Equivalent
- Engineering, Computer Science
- Generally requires 7-11 years related experience
- Experience with programming, writing queries and scripting, ie, shell, Linux, SQL, Splunk, Python, Bash, Perl
- Experience in application development and engineering and support; review and understanding of Java stack traces, log files and application diagnostic files
- An understanding of WebLogic, Cloud infrastructure and Network and Server architecture
- Strong understanding of ITIL and Incident and Problem Management experience.
- Experience defining, implementing, and monitoring IT service level processes.
- Experience with monitoring technologies is a plus
- Must be able to work nights and weekends as part of an after-hours on-call support schedule