Drive the optimization of technology operations through system/service performance monitoring, data and analytics. Create, oversee and execute a monitoring program for normal operation (performance, usage and trends) of assigned environments, technical assets and/or services to ensure established service level agreements (SLAs) are met. Design, implement and champion monitoring tools/reports to ensure thorough monitoring, identify exceptional conditions and recommend opportunities for improvement. Provide technical and procedural direction to more junior members of the team.
- Process Ownership, Championship & Improvement - with thorough understanding of technology assets/environments/services, business needs and SLAs, lead the creation, revision and implementation of monitoring tools, processes and reports. Regularly review and identify process improvement opportunities and implement changes in collaboration with process owner and other technology functions. Champion and provide oversight to ensure adherence to established processes, tools and methodologies. Establish baseline for current state, vision for future state, and a strategic roadmap that moves the monitoring program towards point of arrival.
- Service Design & Continuous Improvement - engage in establishment of environment and technical asset and service availability, reliability and maintainability requirements. Review physical and logical configuration plans and provide feedback to applicable technical operations/services team to address gaps or opportunities. Review availability information and identify developing issues and opportunities for improvement. Ensure effective hand-offs with appropriate technology function(s). Provide input into and drive availability improvement plans.
- Leadership & Oversight - lead (with Process Owner) the definition and creation of data collection/tracking tools and methods and standardized reports to understand and optimize technology systems and/or services. Execute and/or lead the use of a variety of techniques and systems to collect and understand performance data, design and implement reporting templates and dashboards to convey trends, consumption, and performance, and monitor compliance with SLAs. Manage ad hoc reporting needs and requests.
- Monitor & Alert - monitor assigned environments, technical assets and/or services for behavior or performance outside of standards or SLAs. Identify potential cause and evaluate impact on infrastructure, delivery or services. Determine appropriate next steps (e.g. closer monitoring, further review or immediate action). Alert appropriate team (per process) when a threshold has been reached or a change/failure has occurred. Provide advice and guidance to others in monitoring and analysis of assets, systems and services.
- Data Collection, Tracking & Reporting - provide oversight, technical direction, and expertise to the operations support teams as it relates to data analysis, monitoring tools and processes, and event detection. Influence and lead cross-functional teams comprised of internal and vendor resources to champion processes and drive service improvement initiatives.
- Documentation Management & Invoice Validation - document concerns and findings, collecting all pertinent data (to include comparison of exception data and normal data). Ensure incident/event tracking tools are current (per established guidelines and procedures). Review, improve and champion the accuracy and maintenance of knowledge base content and known error database.
*Open to Virtual work arrangements
- 3+ years of experience with event monitoring to include setting-up thresholds and views
- 5+ years of broad technical experience in a majority of the following areas: Java & ASP.NET containers (Apache, Tomcat, IIS), servers, networking (switches, firewalls, load balancers), hardware, operating systems (Windows, AIX, Linux), virtualization software, middleware, databases (Oracle, SQL Server) and related base build infrastructure and software
- 5+ years of hands on experience with infrastructure monitoring solutions (Microsoft SCOM, SolarWinds, or equivalent) and/or application performance management solutions (HP Diagnostics, Dynatrace, or equivalent)
- Experience and subject matter expertise in the web and distributed computing environment
- Proven ability to analyze and interpret technical data that leads to performance improvements, greater availability & capability
- Self-starter with proven organizational and leadership skills to successfully lead and influence cross-functional teams without a direct line of authority
- Proven thought leader with excellent troubleshooting, reasoning skills and the ability to quickly understand complex architectures and operating environments
- Strong written and verbal communication skills with experience creating, championing and maintaining processes, procedures and policies