Site Reliability Engineer

Tecsys • $90K — $120K *

Montreal, QC H1A 0A1In-Person

Information Technology

5 - 7 years of experience

Reposted Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.
Experience designing and deploying large-scale systems and multi-vendor platforms.
Proven cloud infrastructure management experience in AWS and Kubernetes at scale.
Strong hands-on experience with Infrastructure as Code (IaC) and automation tools like Terraform.
Familiarity with CI/CD pipelines and its automation, preferably GitLab.
Deep knowledge of monitoring and observability practices using Datadog.
Strong communication skills in English, both written and spoken.

Responsibilities

Collaborate with Engineering teams for pre-live service support through design consulting and launch reviews.
Innovate to identify pain points and propose creative solutions for platform improvement.
Measure and monitor service availability, latency, and system health post-launch.
Enhance observability features using Datadog and establish actionable dashboards.
Develop and improve internal tooling and automation frameworks to reduce manual interventions.
Act as incident commander during system incidents to manage responses and communications.
Support cross-functional collaboration to maintain performance and reliability during global growth.

Benefits

Work in a highly skilled team focused on continuous improvement and operational excellence.
Opportunity to innovate relentlessly and drive initiatives that strengthen the platform.
Engagement in post-incident reviews and long-term stability improvements.
Exposure to advanced tools like Amazon Kiro for automation.
Participate in quarterly offsites and conferences, enhancing professional development.

Full Job Description

About the Role

We are looking for a Site Reliability Engineer to join our Network and Security Operations Center (NOC), a team at the heart of platform reliability for mission-critical SaaS environments. You will help maintain, optimize, and ensure the reliability and performance of the systems that power our cloud infrastructure across AWS and Kubernetes, with a strong focus on automation, observability, and continuous improvement. This role blends reliability engineering with incident command, giving you real ownership over uptime, performance, and innovation. You will be part of a highly skilled team that values creative problem-solving, operational excellence, and continuous improvement through automation and resilience engineering.

Your responsibilities

Collaborate with other Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
Innovate relentlessly: Identify pain points, propose creative solutions, and drive initiatives that simplify, scale, and strengthen the platform.
Maintain services once they are live by measuring and monitoring availability, latency and overall system health.
Own observability: Enhance and expand monitoring and alerting using Datadog; define SLOs/SLIs and create actionable dashboards that drive reliability outcomes.
Drive automation: Develop and improve internal tooling, IaC frameworks, and pipelines (Terraform, GitLab CI/CD) to reduce manual intervention and enable self-healing systems.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Act as an agent orchestrator using Amazon Kiro: run multiple activities in parallel by leveraging AI agents to accelerate execution, while personally validating results and completing selected tasks manually when needed.
Be on-call.
Practice sustainable incident response and blameless postmortems. Lead post-incident reviews (RCAs) and identify long-term fixes that improve stability, reliability, and developer experience.
Implement monitoring, Logging, alerting, and SLA Reporting.
Create and maintain technical documentation.
Implement, maintain and mature SRE best practices.
Lead incidents: Act as Incident Commander for Incidents; coordinate cross-team response, manage communications, and ensure rapid service restoration.
Provide support for our planning and deployment teams to enable stability, predictability, and scale in our continued growth.
Collaborate with members of the Platform Engineering team to implement and support far-reaching strategic efforts, provide constructive feedback, and foster a collaborative environment.
Work cross-functionally with internal teams and vendors to manage our growth around the globe, with a strong focus on maintaining the high level of performance, availability, and reliability for our users.

Requirements
Tools used

AWS (multi-account, VPC, EC2, EKS)

Kubernetes

Datadog

Terraform

GitLab CI/CD (Jenkins acceptable)

Amazon Kiro (licenses provided) - expected to be used proactively and heavily in day-to-day engineering tasks, with human validation of outputs.

Python, Bash, Java or equivalent for automation and diagnostics.

Qualifications

5+ years in Site Reliability, Cloud, or DevOps Engineering, ideally in SaaS or large-scale production environments.

Experience designing and deploying large scale systems, multi-vendor platforms and globally distributed infrastructure.

Proven experience managing cloud infrastructure in AWS (multi-account, VPC, EC2, EKS) and Kubernetes at scale.

Strong hands-on experience with IaC and automation (Terraform, Ansible, or similar).

Familiarity with CI/CD pipelines and release automation (GitLab preferred, Jenkins acceptable).

Deep understanding of monitoring and observability using Datadog (or equivalent), including metric design, log pipelines, alerting, and dashboards.

Experience with incident management, on-call participation, escalation, and structured postmortems.

Scripting skills in Python, Bash, Java or equivalent for automation and diagnostics.

Curiosity, ownership, and a bias for action; you see a problem, you solve it, and you share the lessons learned.

Experience with Fedramp (The Federal Risk and Authorization Management Program) compliance is a strong asset.

Basic knowledge of Java- or .Net-based development required.

Strong English communication skills, both written and spoken, are essential for effective correspondence with customers, business partners and colleagues beyond the province of Quebec.

Additional requirements:

Escalation on-call rotation
Occasional travel (quarterly offsites, conferences - less than 10%)

We understand that experience comes in many forms and that careers are not always linear. If you don't meet every requirement in this posting, we still encourage you to apply.

About Tecsys

Tecsys Inc. is a Canadian company that provides supply chain management software solutions. The company's software is used by healthcare providers, third-party logistics providers, and other organizations to manage their supply chains. Tecsys was founded in 1983 and is headquartered in Montreal, Quebec. The company has offices in the United States, Canada, and the United Kingdom. Tecsys is publicly traded on the Toronto Stock Exchange under the ticker symbol TCS.

Learn more about Tecsys

Size

600 employees

Industry

Information Technology

Founded

1983

* Ladders Estimates

Similar Jobs

Infrastructure Architect
$120K — $150K *
LTS
Remote
Today
Platform Engineer
$100K — $130K *
Defense Unicorns
Remote
Reposted Today
Senior Applications Systems Analyst - Registration and Billing
$77K — $124K *
Dartmouth-Hitchcock Medical Center
Lebanon, NH 03766 (Grafton County)
Reposted Today
Systems Analyst Lvl3
$90K — $120K *
Exposant 3
Remote
Reposted 2 days ago
C5ISRT Architect
$120K — $160K *
Thales Group
Ottawa, ON K1G 3J6
Reposted 2 days ago
Architect - Platform Engineer
$120K — $160K *
Quantiphi
Remote
2 days ago

Get Ready For Your
Next Interview

More Jobs at Tecsys

Site Reliability Engineer
$90K — $120K *
Montreal, QC H1A 0A1
Reposted 4 days ago
Information Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
UX Architect/Lead
$130K — $200K *
HP Development Company, L.P.
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer III
$90K — $180K *
Walmart, Inc.
Bentonville, AR 72712 (Benton County)
Reposted Today
Site Reliability Engineer
$90K — $120K *
Tecsys
Montreal, QC H1A 0A1
Reposted Today
Client Onboarding Manager
$75K — $95K *
Global Data Consultants
Lafayette, LA 70506 (Lafayette County)
Reposted Today

Find similar Site Reliability Engineer jobs:

Nationwide Montreal, QC

Site Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Site Reliability Engineer jobs:

Get Ready For Your
Next Interview