Senior Site Reliability Engineer

Magnet Forensics

$110K — $160K *
US-AnywhereRemote in Canada
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in cloud infrastructure and site reliability engineering (SRE) practices.
  • Proven ownership of production Kubernetes environments, demonstrating cluster health management and upgrades.
  • Experience in responding to production incidents with high stakes and real downtime consequences.
  • Expertise in writing and maintaining Terraform modules, understanding state and drift management.
  • Familiarity with GitOps environments, including Helm chart organization and ArgoCD patterns.
  • Proficient in balancing operational tasks and development work, ensuring production stability.
  • Background in observability practices, including building dashboards and managing alert systems.

Responsibilities

  • Own and manage production Kubernetes clusters on Amazon EKS, focusing on security and lifecycle management.
  • Design and maintain infrastructure-as-code with Terraform, contributing to standards and libraries.
  • Manage Helm chart definitions and implement ArgoCD workflows for global SaaS deployments.
  • Develop and maintain observability infrastructure, optimizing alert systems and log pipelines.
  • Identify and remediate security vulnerabilities, contributing to compliance efforts like FedRAMP support.
  • Create and maintain operational documentation and change management procedures.
  • Lead incident response efforts, including root cause analysis and post-incident reviews.

Benefits

  • Generous time off policies
  • Competitive compensation
  • Volunteer opportunities
  • Reward and recognition programs
  • Employee committees & resource groups
  • Healthcare and retirement benefits
Full Job Description
Role Overview

We're seeking a Senior Site Reliability Engineer to join our SaaS-Ops team within Shared Services Engineering. The team owns reliability and operational excellence for our highly available SaaS platform, a production Kubernetes environment serving law enforcement and government customers globally.

This role requires deep AWS expertise, infrastructure-as-code discipline, and CI/CD best practices. You'll work closely with Application, Platform, and Security teams to drive secure-by-design architectures and improve automation and reliability across our cloud environments. You'll ship infrastructure as code, respond to production incidents with discipline, and drive platform modernization through deliberate roadmap execution.

As part of the SaaS-Ops team, you'll work in a high-performing environment where members take ownership of outcomes and operate with a strong sense of trust and autonomy. You'll identify challenges, contribute to solutions, raise concerns proactively, support improvements, and navigate situations requiring timely decision-making. If you're looking for your next challenge where infrastructure quality directly impacts real-world outcomes, this role could be a great fit!

Note: This role includes participation in an on-call rotation.

What You'll Do

  • Own and operate production Kubernetes clusters (Amazon EKS) including upgrades, scaling, security hardening, and cluster lifecycle management;
  • Design, implement, and maintain infrastructure-as-code using Terraform; contribute to shared module libraries and enforce IaC standards across the team;
  • Manage and evolve Helm chart definitions and ArgoCD GitOps workflows for multi-region SaaS deployments;
  • Operate and maintain observability infrastructure including Grafana, alerts, dashboards, and log pipelines. Act to eliminate noise and surface signal;
  • Contribute to pipeline reliability: identify flaky stages, reduce build times, improve developer experience across CI/CD pipelines;
  • Remediate security vulnerabilities (CVEs) in container images and infrastructure components; participate in compliance work including FedRAMP support activities;
  • Develop and maintain runbooks, change management procedures, and operational documentation;
  • Ensure alignment with internal policies and frameworks such as ISO 27001, SOC2, and NIST;
  • Contribute to AI-assisted tooling and automation (e.g., Claude-based Terraform agents, automated triage tools) as part of the team's operational efficiency roadmap;
  • Participate in on-call incident response rotation; lead or support incident command during active production incidents including root cause analysis and post-incident review.


What We're Looking For

  • 5+ years of industry experience with a trajectory that demonstrates growing depth in cloud infrastructure and SRE practices;
  • Managed production Kubernetes environments at scale: not just deployed workloads, but owned cluster health, upgrades, and failure modes;
  • Responded to production incidents in high-stakes environments where downtime has real consequences;
  • Written and maintained Terraform at the module level, not just as a consumer: understands state, dependencies, and the operational burden of drift;
  • Operated in an environment that uses GitOps: has a good understanding of Helm chart organization, ArgoCD app-of-apps patterns, or equivalent;
  • Balanced reactive operational work with proactive roadmap delivery; knows how to protect time for improvements while keeping production stable;
  • Worked with observability as a first-class discipline: built meaningful dashboards, eliminated alert fatigue, and used metrics to make operational decisions;
  • Contributed to security hardening in a regulated or compliance-adjacent environment: FedRAMP, SOC 2, or similar frameworks are a strong asset.


Compensation & Benefits

The Compensation range is for the primary location for which the job is posted. Please note that the actual compensation may vary depending on location and job-related factors such as qualifications, experience, knowledge and skills. If you are applying for this role outside of the primary location and you are selected for an interview, the Talent Acquisition Partner can share more information with you. If the compensation structure for the role includes an incentive component (i.e. most Sales roles) the range below represents total target compensation (TTC) (base salary + variable).

$110,000 - $160,000 CAD (CDN) a year

Position Type: Current Vacancy

Magnet is proud to offer benefits such as:

- Generous time off policies

- Competitive compensation

- Volunteer opportunities

- Reward and recognition programs

- Employee committees & resource groups

- Healthcare and retirement benefits

Indicators of Success

We're looking for someone who checks off most, but not all, of the boxes listed in "skills and experiences". It's more important to us to find candidates who can display indicators of success through skills they have developed and experiences they have been a part of, than to find folks who have "been there, done that". We want to be part of your development journey, and we'll learn as much from you as you learn from us.

How We Work

At Magnet Forensics, we take a hybrid-flexible approach to support your productivity and work-life balance. If you're within a comfortable travel distance to one of our offices, you'll occasionally join us in person. How often you'll come in depends on your department and team needs, typically ranging from weekly to monthly. These in-person moments help us build stronger connections, spark new ideas, and celebrate our successes together. Most days, you can choose what works best for you, while staying in tune with your team's goals.

We're excited to welcome you to our team and look forward to achieving great things together - both in the office and wherever you work best!

Similar Jobs

More Jobs at Magnet Forensics

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: