Full Job Description
Roles and Responsibilities:
• Design and implement observability-as-code solutions using Terraform to deploy monitoring pipelines, dashboards, and alerting strategies across distributed systems.
• Drive observability improvements leveraging industry-leading tools (Dynatrace, ELK, Splunk, PagerDuty) to achieve real-time performance insights and comprehensive system visibility.
• Instrument applications for end-to-end observabilityimplementing distributed tracing, metrics collection, and log aggregation across Node.js and .NET microservices and event-driven architectures.
• Troubleshoot complex incidents in production environments, diagnosing root causes across multiple service layers, databases, caches, and APIs under load using SLISLO frameworks.
• Investigate and resolve Azure Kubernetes Service (AKS) infrastructure, ensuring reliability and scalability of containerized workloads with deep proficiency in Terraform and Azure managed services (SQL MI, Redis, Functions, Event Grid).
• Translate business requirements into observable, resilient systems that meet defined SLIsSLOs and drive reliability improvements.
• Automate operational tasks to reduce toil and improve system resilience through infrastructure-as-code and CICD best practices.
• Lead incident response and remediation for mission-critical systems, conducting blameless postmortems and building resilience through chaos engineering and tabletop exercises.
• Collaborate cross-functionally with development, platform, and business teams to improve service availability, scalability, and operational excellence.
Skills Required:
• Must-have8 years hands-on experience in observability, SRE, or DevOps roles with proven expertise across infrastructure and application-level reliability.
• Deep expertise in observabilit y tooling Dynatrace, ELK, Splunk, and PagerDuty demonstrated understanding of observability principles (instrumentation, correlation IDs, SLISLO frameworks).
• Advanced proficiency with Azure Kubernetes Service (AKS), Terraform, and Azure
Salary Range - CA$ 100,000 - CA$ 120,000 Per Year
TCS does not use artificial intelligence tools for candidate screening or evaluation. This post is for a current vacancy. The hiring process includes an initial screening, followed by a technical evaluation and managerial discussion.