We are seeking a leader for our Reliability Engineering team. The ideal candidate will have advanced experience of Enterprise and Application Performance monitoring, as well as a demonstrable track record of effective Capacity monitoring. Experience of and a passion for automation and constant improvement are also desirable. You will work directly with product development, application and platform support teams and report directly to the Vice President.
GENERAL JOB FUNCTIONS
Major - Delivery and Execution
- Leads a team of Monitoring SMEs to achieve the following goals:
- Collaborates with other teams to develop secure, reliable, efficient and scalable software services.
- Works with Architecture, Development and Systems Engineering teams to develop innovative solutions to attain high availability, scalability, and reliability
- Works with internal and external teams to develop automation for tool configuration and functional certification.
- Maximizes product reliability by developing implementations of commercial monitoring software to align with rapidly evolving business needs.
- Creates effective dashboards, reporting, alerting and responses to ensure that impact from issues is either avoided or rapidly resolved.
Medium - Support and Collaboration
- Develops team to act as thought leaders in the enterprise.
- Monitors effectiveness of implementation and plans for constant improvement in support of the goals of the organization.
- Provides first line application support for automation and tools.
- Proactively reviews system performance and capacity and aligns with customer roadmap to plan each release accordingly.
- Familiar with ITIL framework around change, incident, problem management
Minor - Learning
- Proactively identifies learning opportunities for developing industry best practices and tools usage.
- Proactively seeks out knowledge on new technologies and techniques and how they are benefitting other organizations
Preferred Skills and Experience:
Years of Relevant Work Experience: 8-10 years
- Proficient in production performance monitoring concepts and implementation.
- Experience with specific or similar tools and technologies: AppDynamics, DataDog, Apica, Elastic Stack, Linux, Java, Oracle.
- Understanding of production systems design concepts including Reliability, Security, High Availability and Disaster Recovery.
- Understanding of infrastructure automation tools (e.g. Chef) and associated concepts.
- Experience working in Application SaaS delivery channel with micro service based architectures.
- Bachelor's degree in Information Systems, Computer Science or related field
- Operations exposure: deployment configuration, sustainability, scaling patterns, load balancing, performance tuning, SLA management, integration with enterprise systems
- Creativity in problem solving and analysis, particularly in resolving application technical issues
- Excellent verbal and written communication skills
- Experience working in public cloud
- Experience in managing compliance with PCI Data Security Standards
- Experience with web-based application development and industry trends