Manager, Platform & Site Reliability

Canadian Internet Registration Authority

$100K — $130K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 7+ years in Site Reliability Engineering, platform engineering, DevOps or cloud operations, with public cloud expertise, preferably AWS.
  • 3+ years of leadership experience in managing technical teams within SRE or platform engineering.
  • Proven success in mentoring and building high-performing engineering teams, fostering continuous learning and accountability.
  • Skilled in defining technical strategies relating to reliability, security, and operational excellence.
  • Strong understanding of public cloud operations including architecture and resilience strategies.
  • Experience with DevOps practices such as infrastructure as code, GitOps, and CI/CD principles.
  • Proficient in containerization technologies, incident management, and observability frameworks.

Responsibilities

  • Lead and develop a team of Site Reliability Engineers and Platform Specialists to enhance reliability and operational excellence.
  • Define and execute platform strategies aligned with organizational goals and customer needs.
  • Establish and mature SRE practices, including SLOs and operational acceptance criteria.
  • Drive continuous improvement of scalable cloud-native platforms using AWS or similar.
  • Champion automation practices like infrastructure as code to minimize operational toil.
  • Enhance monitoring and observability to ensure platform reliability and customer satisfaction.
  • Manage high-severity incidents, ensuring effective response and follow-up actions to improve platform resilience.

Benefits

  • Blended remote and in-office work arrangements to foster team connection.
  • Regular events and social activities to encourage community engagement.
  • Focus on a people-centered recruitment process that values human judgment over AI in hiring.
Full Job Description
By working with the CIRA registry team, you'll play a part in advancing the CIRA Registry Platform, which supports a wide range of domains globally. Help us drive innovation and maintain the high standards of stability and security that our platform is known for. Join us in advancing digital identity and technology in Canada and beyond.

Who You Are:

You are a people-first technology leader who thrives at the intersection of reliability, platform engineering, and operational excellence. You enjoy building high-performing teams, creating clarity in complex environments, and empowering engineers to do their best work. You balance strategic thinking with technical depth, helping teams deliver resilient, scalable services while continuously improving processes, tooling, and ways of working. Most importantly, you're motivated by solving meaningful challenges and contributing to infrastructure that Canadians and organizations around the world rely on every day.

What You'll Do:

  • Lead, coach, and develop a high-performing team of SRE and Platform Specialists responsible for the reliability, scalability, security, and operational excellence of CIRA's registry platforms and supporting technology services.
  • Define and execute the platform and site reliability strategy, aligning priorities and investments with organizational objectives and customer needs.
  • Define and mature SRE practices, including Service Level Objectives (SLOs), Service Level Indicators (SLIs), error budgets, production readiness standards, and operational acceptance criteria for mission-critical registry services.
  • Drive the design, operation, and continuous improvement of scalable, resilient, cloud-native platforms using public cloud technologies such as AWS.
  • Champion automation, infrastructure as code, GitOps, CI/CD, and self-service platform capabilities to reduce manual effort, operational toil, and engineering bottlenecks.
  • Establish and continuously improve observability, monitoring, alerting, and dashboarding practices to provide clear visibility into platform health, service reliability, and customer-impacting issues.
  • Lead incident management for high-severity events, providing incident command, stakeholder communication, root cause analysis, and driving follow-up actions that strengthen long-term platform resilience.
  • Collaborate with engineering, security, support, compliance, and business stakeholders to establish priorities, balance risk, and deliver platform improvements that support registry operations and organizational goals.
  • Drive performance engineering, capacity planning, disaster recovery testing, and resilience validation to ensure the ongoing reliability and availability of critical registry platforms and related services.
  • Foster a culture of ownership, accountability, continuous learning, operational excellence, and psychological safety that empowers the team to innovate and perform at their best.


What You Bring:

  • 7+ years of progressive experience in Site Reliability Engineering (SRE), platform engineering, DevOps, infrastructure, or cloud operations, including hands-on experience with public cloud platforms such as AWS.
  • 3+ years of experience leading, coaching, and developing technical teams in SRE, platform engineering, DevOps, infrastructure, or cloud operations.
  • Demonstrated success building and developing high-performing engineering teams through mentoring, coaching, performance management, and fostering a culture of continuous learning and accountability.
  • Experience defining technical strategy, influencing cross-functional stakeholders, and balancing reliability, security, operational excellence, and business priorities.
  • Strong hands-on background with public cloud platforms, preferably AWS, including cloud-native architecture, networking, security, resilience, scalability, and cost-aware operations.
  • Experience leading teams that implement and operate infrastructure as code (IaC), GitOps, and automation practices to manage cloud infrastructure, platform services, and deployment workflows.
  • Strong understanding of CI/CD principles, release automation, and modern software delivery practices.
  • Experience with containerization and orchestration technologies such as Docker and Kubernetes.
  • Experience with observability platforms, monitoring frameworks, incident management practices, and operational analytics tools.
  • Demonstrated experience defining and implementing SLOs, SLIs, error budgets, production readiness standards, and incident response processes.
  • Strong understanding of disaster recovery, business continuity, backup and recovery strategies, and resilience testing.
  • Experience supporting highly available, mission-critical, or regulated technology platforms where reliability, security, and operational discipline are essential.
  • Exceptional communication, collaboration, and stakeholder management skills, with the ability to translate complex technical concepts into clear business outcomes for both technical and non-technical audiences.


CIRA embraces a blend of remote and IRL in-office work to keep our team connected and engaged. Our Ottawa headquarters is a hub for regular events and social activities that bring our team together, encouraging a strong sense of community within our organization. No matter where you work from, you'll always feel part of our vibrant team and our shared mission.

At CIRA, people remain at the centre of our recruitment process. While CIRA uses recruitment platforms that include artificial intelligence-enabled features, which may be used to support administrative processes or skills-based assessments, these features are intended to assist our recruitment activities and do not replace human judgment. All applicant screenings, interviews, evaluations and selection decisions are conducted by our staff. Artificial intelligence is not used to make autonomous or final hiring decisions.

This posting is for an existing vacancy.

Similar Jobs

More Jobs at Canadian Internet Registration Authority

More Information Technology Jobs

Find similar Manager, Platform & Site Reliability jobs: