Site Reliability Engineer in Charlotte, NC

$100K - $150K(Ladders Estimates)

Moody's Analytics   •  

Charlotte, NC 28202

Industry: Finance & Insurance


5 - 7 years

Posted 48 days ago

Moody's Shared Services are the front line professionals including Finance, Technology, Legal, Compliance and Human Resources, that operationally support our business units. Exceptional Shared Services teams are vital to the international success of our business.


The AVP – Systems Engineering role will be a part of the Infrastructure Services group in Central IT and will act as the Site Reliability Engineer and Disaster Recovery Lead for the organization.

The Site Reliability Engineering aspect of this role will be responsible for the monitoring, availability, performance and incident recovery, amongst other things, of the Infrastructure platforms and services that the IaaS organization owns and delivers. In addition, this role is also accountable for the overall strategy, assessment, setup, maintenance, continuous improvement and reporting of the monitoring needs for Moody's IT technology stack (applications, databases, middleware, infrastructure, cloud and on-premise data centers).

As Disaster Recovery Lead, the role will be responsible for establishing and enhancing the Business Continuity and Disaster Recovery plans, automation of tasks, planning and successful execution of scheduled Disaster Recovery and Operational Resiliency exercises, their reporting and cross-functional coordination.

Job Description

Site Reliability Engineering:

  • Act as the Subject Matter Expert (SME) in a variety of enterprise monitoring technologies and solutions.
  • Analyze and transform operational and/or functional needs of the organization into monitoring solutions, while remaining compliant with the standard IT policies and procedures
  • Build a catalog with detailed descriptions of system monitoring parameters and integrate them to optimize the overall value and effectiveness.
  • Life cycle management (onboarding, maintenance, migration and retirement) of several monitoring tools in use (Example: IPCenter, Dynatrace, AppDynamics, Splunk, SCOM, etc.,) and maximizing the use / benefits from each tool. Administer and provide software support for monitoring tools, and perform the necessary customization and implementations with any of the tool suites.
  • Responsible for the day to day administration of the Monitoring platform, with focus on improvements that will help reduce alert volumes without compromising system stability and availability.
  • Maintain and support infrastructure Monitoring environment to ensure the highest availability while reducing the impact of incidents.
  • Collaborate with stakeholders across Moody's IT and business teams on projects and support initiatives and build automated solutions to help detect, log and resolve events and problems that can potentially cause service disruptions.
  • Perform and provide certification or feedback on Production Readiness for various technology solutions owned by the organization.
  • Conduct in depth evaluations of monitoring / alert data to assist with the diagnosis of various infrastructure and application problems.
  • Test, recommend and implement new monitoring technologies. Retire the underused and outdated monitoring technologies with higher costs and / or diminishing returns.
  • Develop governance reports and perform analysis of the IT performance data using Tableau or Power BI
  • Maintain the knowledge (documentation), reports and other artifacts in a central repository (ServiceNow Knowledge base)

Disaster Recovery Leadership:

  • Manage the strategy, design, implementation, execution, automation, documentation and communication of business continuity and disaster recovery plans and processes that ensure the seamless and successful failover, security and integrity of data, applications, databases, infrastructure systems and other related technologies.
  • Partner with the internal and external stakeholders and supplier teams to understand the Infrastructure Service organization's objectives, challenges and needs of the Business Continuity Management (BCM) and Disaster Recovery (DR) functions and address them to deliver organizational goals.
  • Own, streamline, optimize, automate, document and continuously enhance the BCM and DR plans and the corresponding tasks.
  • Ownership, planning, execution and reporting of the scheduled BCM and DR exercises, with successful cross-functional coordination, matrix resource management, crisp communication, enhanced documentation, change management and reporting.
  • Reduce the overall execution duration as well as risk of failure and maximize the success rate by simplifying tasks with automation and elimination of redundant, unnecessary steps in the workflows.
  • Establish and oversee the successful delivery of DR plan roadmaps with Moody's internal and supplier IT teams, Info Risk, Audit and other stakeholders as applicable.
  • Conduct risk analysis to identify critical operations and systems that are core to continued business services in the event of a disruption and include them in the DR planning scope for successful delivery and risk mitigation.
  • Develop and deploy training, documentation, and communication of disaster procedures to the organization.


  • Own and manage the relevant contracts with suppliers for off-site and other resources required for the execution of Enterprise Monitoring and Disaster Recovery responsibilities.
  • Review Suppliers' SLA and SLO details and be accountable for their improvement.
  • Ensure that the organizational goals and milestones are met and adhering to approved budgets.
  • Develop, enhance and enforce knowledge of organization's IT Service Management processes.


Minimum education and work experience required for this position include:

  • BS Computer Science or related technical discipline (or equivalent experience).
  • At least 5 years of hands-on experience in Monitoring and Disaster Recovery execution.
  • Competent in networking principles and OS operation and maintenance.
  • Experienced in design/implementation for reliability, availability, scalability and performance.
  • Development skills in at least two scripting languages such as Java, Python, PERL, Shell, SQL, Containers and APIs (provide GitHub account details or code examples, if available).
  • Experience with installing, configuring and maintaining monitoring software such as IPCenter (or equivalent), Dynatrace, AppDynamics, Splunk, SCOM, VMWare VRops, AWS CloudWatch, Nagios, Azure Monitoring etc.,
  • Solid working knowledge of both Windows and Linux Operating Systems, file and directory structures, commands, command-line interfaces and utilities.
  • Knowledge of IT Best Practices as they relate to the following areas: IT Infrastructure Monitoring, Data Networks, IT Security, Virtualization, Web Servers, Cloud and Storage technologies
  • Ability to leverage Excel for analysis, produce charts & reports (Pivot tables, charts, tables, and analysis) using macros/VBA and tools like Tableau or Power BI
  • Proficiency in ITSM (ITIL v3 Foundation knowledge)
  • Experience in Cloud Environments such as, Azure, AWS, Google or private cloud would be a plus
  • Understanding of containerization such as, Docker, Kubernetes and Micro services would be a plus.
  • Working knowledge of ServiceNow would be preferred
  • Strong communication, presentation, analytical and problem solving skills required. Must have the ability to effectively understand and communicate technical issues and their solutions to multiple stakeholder groups and influence their outcome
  • Strong customer focus with project management and follow-up skills
  • This is not a 9 am to 5 pm job. Candidate must be willing to work during non-standard business hours and weekends – on demand and onsite, if necessary.

Valid Through: 2019-10-21