Site Reliability Engineer II

Kastle Systems • $90K — $120K *

Orlando, FL 32828In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

4-6 years in SRE, Platform Engineering, or Infrastructure Engineering with production system ownership.
Hands-on experience managing production infrastructure in Azure (AKS, Azure Monitor, etc.) or willingness to operate in Azure if AWS/GCP background.
Deep operational experience with Kubernetes, including resource management and debugging live issues.
Experience with GitOps tools like ArgoCD or Flux and familiarity with progressive delivery patterns.
Proven track record with Infrastructure as Code tools (Terraform, OpenTofu) maintaining drift-free states.
Hands-on experience with Prometheus, Grafana, and OpenTelemetry for observability configurations.
Proficiency in Python or Go for automation, with strong Bash scripting capabilities.

Responsibilities

Own and evolve multi-stage deployment pipelines using ArgoCD.
Maintain release governance standards across the engineering organization.
Manage the lifecycle of feature flags in coordination with product and QA teams.
Provision and manage Azure infrastructure using Terraform while adhering to GitOps principles.
Oversee Kubernetes cluster operations including scheduling and cost governance.
Define and enforce SLIs and SLOs in partnership with product engineering teams.
Participate in on-call rotation and lead incident mitigation efforts.

Benefits

Engagement in meaningful platform evolution and architecture discussions.
Collaborative environment with a focus on blameless post-incident reviews.
Opportunity to directly impact operational reliability and scalability.
Use of modern toolsets and practices like GitOps and Infrastructure as Code.
Work alongside engineering teams from different time zones promoting diverse collaboration.

Full Job Description

Site Reliability Engineer II

The SRE II sits at the intersection of software engineering and platform operations. You will own the reliability, scalability, and operational hygiene of Kastle's core infrastructure - engineering away toil, hardening deployment pipelines, and partnering with product engineering teams to make new services production-ready from day one.

This is a mid-level individual contributor role. You are expected to execute technical work independently, drive reliability improvements end-to-end, and participate meaningfully in architecture discussions. You will carry on-call responsibilities as part of a shared rotation with a well-defined escalation model and a strong blameless post-incident review culture.

The team is in the middle of a meaningful platform evolution: formalizing multi-tier release pipelines (Dev • QA • Integration • UAT • Prod) with ArgoCD-based approval gates, building out SLI/SLO frameworks, and migrating toward full GitOps. You will be a hands-on contributor to all of it.

Key Responsibilities:

Release Engineering & GitOps

Own and evolve the multi-stage deployment pipeline using ArgoCD, including approval gates, promotion policies, and rollback mechanisms.
Maintain trunk-based branching discipline and enforce release governance standards across the engineering organization.
Manage feature flag lifecycle - from creation and gradual rollout to deprecation - in coordination with product and QA teams.
Build and maintain CI/CD pipelines that enable safe, frequent, and auditable deployments.

Infrastructure as Code & Cloud Operations

Provision and manage Azure infrastructure using Terraform or OpenTofu, maintaining drift-free state aligned with GitOps principles.
Own Kubernetes cluster operations including workload scheduling, resource optimization, RBAC, network policy, and cost governance.
Identify and act on infrastructure cost optimization opportunities (compute rightsizing, storage tier selection, idle resource elimination).
Support Crossplane or similar operator patterns for Kubernetes-native infrastructure management where applicable.

Reliability & Observability

Define, instrument, and enforce SLIs and SLOs in partnership with product engineering teams.
Build and maintain observability infrastructure - metrics, logs, and distributed traces - using Prometheus, Grafana, OpenTelemetry, or equivalent tooling.
Conduct proactive capacity planning and performance tuning across multi-tenant, distributed environments.
Establish and maintain runbooks, dashboards, and alerting policies that reduce cognitive overhead during incidents.

Incident Management

Participate in shared on-call rotation covering core platform and infrastructure services; on-call load is balanced across the team with structured handoff practices.
Lead mitigation of live production incidents with a focus on minimizing MTTR and clear stakeholder communication under pressure.
Facilitate blameless post-incident reviews and drive preventative engineering to closure - not just documentation.

Engineering Partnership

Embed with product engineering teams during design and architecture phases to establish reliability, scalability, and security requirements before code is written.
Maintain clear, comprehensive documentation for infrastructure architecture, operational procedures, and onboarding guides.
Push back constructively when proposed designs compromise reliability or operability, proposing alternatives rather than just raising concerns.

Responsibilities

Experience: 4-6 years in an SRE, Platform Engineering, or Infrastructure Engineering role, with demonstrated ownership of production systems.
Cloud - Azure: Hands-on experience managing production infrastructure in Azure: AKS, Azure Container Registry, Azure Monitor, Cosmos DB, Key Vault, Azure Front Door, or equivalent services. AWS/GCP backgrounds considered with clear willingness to operate in Azure.
Kubernetes: Deep operational experience with Kubernetes in production: resource management, network policies, RBAC, HPA/VPA, persistent volumes, and debugging live workload issues.
GitOps & Release Tooling: Experience with ArgoCD, Flux, or equivalent GitOps deployment tools. Familiarity with multi-stage progressive delivery and approval gate patterns is a strong plus.
Infrastructure as Code: Proven track record with Terraform, OpenTofu, or Pulumi in a production GitOps context - not just writing HCL, but maintaining drift-free state and managing state backends safely.
Observability: Hands-on configuration of Prometheus, Grafana, OpenTelemetry, and/or ELK/OpenSearch. Ability to go from symptom to instrumentation to dashboard without hand-holding.
Programming & Scripting: Proficiency in Python or Go for automation and tooling; strong Bash scripting. Ability to read and reason about application code when debugging production issues. Proficiency in C# and SQL for reviewing deliverables and participating in triage.
Linux & Networking: Solid understanding of Linux internals, TCP/IP, DNS, TLS, and HTTP semantics. Comfortable debugging at the network and OS layer.

Qualifications

Experience with Crossplane or other Kubernetes-native infrastructure operators.
Familiarity with feature flag platforms (LaunchDarkly, Flagsmith, or similar) and gradual rollout strategies.
Background in IoT, physical security, access control, or other latency-sensitive, event-driven domains.
Comfort with async collaboration across distributed time zones (US + India team structure).
Experience with AI-assisted development tooling and an appetite to incorporate it into engineering workflows.
Knowledge of CMMC 2.0, SOC 2, or FedRAMP compliance postures as they apply to infrastructure and access control.

About Kastle Systems

Kastle Systems is a security services company that provides access control and video surveillance solutions to commercial and residential properties. The company's products and services include keyless entry systems, visitor management systems, and remote video monitoring. Kastle Systems was founded in 1972 and is headquartered in Falls Church, Virginia. The company has a team of experienced professionals who are dedicated to providing high-quality security solutions to their clients. Kastle Systems has been recognized for its innovative products and services, and has won numerous awards for its work.

Learn more about Kastle Systems

Size

500 employees

Industry

Aerospace & Defense

* Ladders Estimates

Similar Jobs

Senior Application Support Engineer (SRE)
$100K — $130K *
DTCC
Tampa, FL 33647 (Hillsborough County)
Today
Principal Systems Engineer / Senior Principal Systems Engineer
$98K — $184K *
Northrop Grumman
Melbourne, FL 32935 (Brevard County)
Today
Sr. Project Engineer, SDS
$100K — $130K *
Fujifilm Manufacturing USA, Inc
Remote
Reposted Today
Specialist, Systems Engineer 1
$85K — $110K *
Level 3 Communications, Inc
Palm Bay, FL 32907 (Brevard County)
Yesterday
Senior Product Support Technician - Top Secret
$103K — $181K *
Appcast
Tampa, FL 33647 (Hillsborough County)
Yesterday
Staff Autonomy Engineer (Drone)
$120K — $150K *
Gather AI
Remote
Yesterday

Get Ready For Your
Next Interview

More Jobs at Kastle Systems

Electronic Security Project Manager
$75K — $95K *
Plymouth, MI 48170 (Wayne County)
Reposted Today
Technical Services
In-Person
Electronic Security Project Manager
$75K — $95K *
Phoenix, AZ 85032 (Maricopa County)
Reposted Today
Technical Services
In-Person
Electronic Security Project Manager
$75K — $95K *
Plymouth, MI 48170 (Wayne County)
Reposted Today
Technical Services
In-Person
Project Manager
$80K — $110K *
Falls Church, VA 22042 (Fairfax County)
Reposted Yesterday
Technical Services
In-Person
Electronic Security Operations Superintendent
$75K — $95K *
South Bend, IN 46614 (St Joseph County)
Reposted Yesterday
Technical Services
In-Person

More Information Technology Jobs

Chief Executive Officer
The Mitalmor Group
San Francisco, CA 94102 (San Francisco County)
2 weeks ago
Senior Director of IT Infrastructure
$150K — $180K *
Hanger
Austin, TX 78745 (Travis County)
Today
Data Analyst
$75K — $110K *
Kearny Bank
Fairfield, NJ 07004 (Essex County)
Reposted Today
Data Engineer
$90K — $120K *
Kimley-Horn and Associates, Inc.
Dallas, TX 75217 (Dallas County)
Today
Workplace Experience Systems Analyst
$75K — $90K *
Latham & Watkins LLP
New York, NY 10025 (New York County)
Today

Find similar Site Reliability Engineer II jobs:

Nationwide Orlando, FL

Site Reliability Engineer II

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Site Reliability Engineer II jobs:

Get Ready For Your
Next Interview