Site Reliability Engineer IIThe SRE II sits at the intersection of software engineering and platform operations. You will own the reliability, scalability, and operational hygiene of Kastle's core infrastructure - engineering away toil, hardening deployment pipelines, and partnering with product engineering teams to make new services production-ready from day one.
This is a mid-level individual contributor role. You are expected to execute technical work independently, drive reliability improvements end-to-end, and participate meaningfully in architecture discussions. You will carry on-call responsibilities as part of a shared rotation with a well-defined escalation model and a strong blameless post-incident review culture.
The team is in the middle of a meaningful platform evolution: formalizing multi-tier release pipelines (Dev • QA • Integration • UAT • Prod) with ArgoCD-based approval gates, building out SLI/SLO frameworks, and migrating toward full GitOps. You will be a hands-on contributor to all of it.
Key Responsibilities:Release Engineering & GitOps- Own and evolve the multi-stage deployment pipeline using ArgoCD, including approval gates, promotion policies, and rollback mechanisms.
- Maintain trunk-based branching discipline and enforce release governance standards across the engineering organization.
- Manage feature flag lifecycle - from creation and gradual rollout to deprecation - in coordination with product and QA teams.
- Build and maintain CI/CD pipelines that enable safe, frequent, and auditable deployments.
Infrastructure as Code & Cloud Operations- Provision and manage Azure infrastructure using Terraform or OpenTofu, maintaining drift-free state aligned with GitOps principles.
- Own Kubernetes cluster operations including workload scheduling, resource optimization, RBAC, network policy, and cost governance.
- Identify and act on infrastructure cost optimization opportunities (compute rightsizing, storage tier selection, idle resource elimination).
- Support Crossplane or similar operator patterns for Kubernetes-native infrastructure management where applicable.
Reliability & Observability- Define, instrument, and enforce SLIs and SLOs in partnership with product engineering teams.
- Build and maintain observability infrastructure - metrics, logs, and distributed traces - using Prometheus, Grafana, OpenTelemetry, or equivalent tooling.
- Conduct proactive capacity planning and performance tuning across multi-tenant, distributed environments.
- Establish and maintain runbooks, dashboards, and alerting policies that reduce cognitive overhead during incidents.
Incident Management- Participate in shared on-call rotation covering core platform and infrastructure services; on-call load is balanced across the team with structured handoff practices.
- Lead mitigation of live production incidents with a focus on minimizing MTTR and clear stakeholder communication under pressure.
- Facilitate blameless post-incident reviews and drive preventative engineering to closure - not just documentation.
Engineering Partnership- Embed with product engineering teams during design and architecture phases to establish reliability, scalability, and security requirements before code is written.
- Maintain clear, comprehensive documentation for infrastructure architecture, operational procedures, and onboarding guides.
- Push back constructively when proposed designs compromise reliability or operability, proposing alternatives rather than just raising concerns.
Responsibilities- Experience: 4-6 years in an SRE, Platform Engineering, or Infrastructure Engineering role, with demonstrated ownership of production systems.
- Cloud - Azure: Hands-on experience managing production infrastructure in Azure: AKS, Azure Container Registry, Azure Monitor, Cosmos DB, Key Vault, Azure Front Door, or equivalent services. AWS/GCP backgrounds considered with clear willingness to operate in Azure.
- Kubernetes: Deep operational experience with Kubernetes in production: resource management, network policies, RBAC, HPA/VPA, persistent volumes, and debugging live workload issues.
- GitOps & Release Tooling: Experience with ArgoCD, Flux, or equivalent GitOps deployment tools. Familiarity with multi-stage progressive delivery and approval gate patterns is a strong plus.
- Infrastructure as Code: Proven track record with Terraform, OpenTofu, or Pulumi in a production GitOps context - not just writing HCL, but maintaining drift-free state and managing state backends safely.
- Observability: Hands-on configuration of Prometheus, Grafana, OpenTelemetry, and/or ELK/OpenSearch. Ability to go from symptom to instrumentation to dashboard without hand-holding.
- Programming & Scripting: Proficiency in Python or Go for automation and tooling; strong Bash scripting. Ability to read and reason about application code when debugging production issues. Proficiency in C# and SQL for reviewing deliverables and participating in triage.
- Linux & Networking: Solid understanding of Linux internals, TCP/IP, DNS, TLS, and HTTP semantics. Comfortable debugging at the network and OS layer.
Qualifications- Experience with Crossplane or other Kubernetes-native infrastructure operators.
- Familiarity with feature flag platforms (LaunchDarkly, Flagsmith, or similar) and gradual rollout strategies.
- Background in IoT, physical security, access control, or other latency-sensitive, event-driven domains.
- Comfort with async collaboration across distributed time zones (US + India team structure).
- Experience with AI-assisted development tooling and an appetite to incorporate it into engineering workflows.
- Knowledge of CMMC 2.0, SOC 2, or FedRAMP compliance postures as they apply to infrastructure and access control.