Job DescriptionWhat is the opportunity?RBC Insurance Technology is seeking to hire a Senior Site Reliability Engineer for its Insurance Technology Platform Support team. The Insurance Technology Platform Support Team is a specialized unit dedicated to ensuring the optimal performance, availability, and resilience of IT applications used in the insurance line of business. With a unique blend of technical expertise and industry-specific knowledge, this team plays a critical role in ensuring the seamless operations of digital services that cater to both the business's internal and external stakeholders.
As a Senior Site Reliability Engineer, you will bring the engineering mindset of bold ambition, curiosity and outcome focus to ensuring the performance and reliability of our systems. This role calls for a dynamic individual who excels in a collaborative environment, working with cross-functional teams to implement best practices for observability, monitoring, logging, alerting, and automation. As we evolve toward AI-driven autonomous operations, you will play a key role in transitioning from traditional reactive incident response to intelligent, self-healing systems. This role will be responsible for the development, implementation, and support of Site Reliability Engineering (SRE) solutions for applications supported by RBC Insurance Technology. You'll leverage your proficiency in Elasticsearch, Ansible, GitHub Actions, Moogsoft, PagerDuty, Dynatrace, and emerging AIOps platforms to build and maintain robust automation, intelligent observability, and AI-enhanced SRE tooling.
What will you do?- Contribute to the SRE product base (intelligent monitoring, alerting, machine learning anomaly detection, Agentic AI self-healing, reliability testing)
- Implement and enhance AI-driven monitoring and intelligent observability capabilities across supported applications
- Design and implement ML-based anomaly detection pilots, transitioning from rule-based to predictive alerting
- Architect and develop Agentic AI self-healing solutions that autonomously remediate common incidents
- Design human-AI workflows that balance automation efficiency with appropriate human oversight and governance
- Standardize application telemetry data to increase coverage of signal types, building the foundation for advanced AI/ML capabilities
- Contribute to centralization of observability and monitoring backends for advanced telemetry correlation
- Collaborate with cross-functional teams to implement best practices for monitoring, logging, and incident response, driving a proactive stance on system health
- Implement and manage automation processes with Ansible and GitHub Actions to streamline operational tasks
- Develop and maintain custom tooling and automation scripts in languages like Bash, Python, and PowerShell to enhance operational efficiency and system reliability
- Work closely with development teams to understand code changes and their impact on the production environment, ensuring that new releases meet our reliability standards
- Actively contribute to the definition and tracking of SLIs, SLOs, and other critical metrics, refining our alerting and monitoring strategies accordingly
- Evolve runbooks into automated remediation workflows and Agentic AI automation, reducing manual intervention
- Create and refine custom tooling and automation scripts using languages such as Bash, Python, and PowerShell, supporting the infrastructure's scalability and reliability needs
- Support deployments by advocating for reliability and performance improvements based on industry trends and company objectives
- Participate in incident management and problem management for applications in scope and contribute to RCA Action items fulfillment
- Validate and govern AI outputs to ensure compliance with financial services regulations and maintain human accountability for AI-driven decisions
- Drive transformation by continuously looking for ways to automate existing processes and adopt intelligent operations
- Debug production issues across services and levels of the stack and provide primary operational support
- Perform production support role, including off-hours support (as part of an on-call rotation)
Must-have- 4+ years of SRE or Systems Engineering experience with strong technical expertise
- Bachelor's degree in Computer Science, Engineering, or a related field, or equivalent experience
- Expertise in infrastructure-as-code and configuration management, particularly Ansible
- Advanced scripting capabilities in Bash, Python, PowerShell, or other similar languages
- In-depth knowledge of tools such as Elasticsearch, Ansible, GitHub, OpenShift, Kubernetes, Dynatrace, Kafka, and their role in system reliability
- Knowledge of creating, maintaining, and alerting on SLIs, SLOs, and other reliability metrics
- Understanding of AI/ML concepts and their application to observability and operations (AIOps)
- Experience with or strong interest in intelligent monitoring, anomaly detection, and automation technologies
- Ability to design and implement human-AI workflows with appropriate governance controls
Nice-to-have- Insurance or financial services industry experience
- Hands-on experience with AIOps platforms and intelligent observability tools
- Experience with ML anomaly detection, predictive analytics, or self-healing automation
- Knowledge of prompt engineering and AI model tuning for operational use cases
- Experience designing Agentic AI or autonomous remediation systems
- Familiarity with AI governance frameworks and validating AI outputs in regulated environments
- In-depth hands-on experience in a variety of SRE tools (Azure Automation, Catchpoint, Prometheus, Splunk, Grafana)
- Familiarity with containerization technologies such as Docker
- Hands-on experience with DevOps CI/CD tools e.g. Jenkins, Artifactory and Vault
- Experience with telemetry standardization (OpenTelemetry) and observability data correlation
What's in it for you?We thrive on the challenge to be our best, progressive thinking to keep growing, and working together to deliver trusted advice to help our clients thrive and communities prosper. We care about each other, reaching our potential, making a difference to our communities, and achieving success that is mutual.
- A comprehensive Total Rewards Program including bonuses and flexible benefits, competitive compensation, commissions, and stock where applicable
- Leaders who support your development through coaching and managing opportunities
- Ability to make a difference and lasting impact
- Work in a dynamic, collaborative, progressive, and high-performing team
- A world-class training program in financial services
- Flexible work/life balance options
- Opportunities to do challenging work
#LI-POST
#TECHPJ
Job SkillsAgile Methodology, Application Infrastructure, Group Problem Solving, IT Automation, IT Monitoring, Operations Support, Production Support, Software Development Life Cycle (SDLC), Software Engineering, Software Product Technical Knowledge, System Applications, Systems Software
Additional Job DetailsAddress:MEADOWVALE BUSINESS PARK, 6880 FINANCIAL DR:MISSISSAUGA
City:Mississauga
Country:Canada
Work hours/week:37.5
Employment Type:Full time
Platform:TECHNOLOGY AND OPERATIONS
Job Type:Regular
Pay Type:Salaried
Posted Date:2026-06-18
Application Deadline:2026-07-17
Note: Applications will be accepted until 11:59 PM on the day prior to the application deadline date above