Full Job Description
ABOUT THE ROLE
We are seeking an experienced Observability Engineer to join our Enterprise Kubernetes Platform team at a leading financial services organization.
Youll own the complete observability stack across 50+ production Kubernetes clusters, providing metrics, logging, tracing, and alerting capabilities that ensure exceptional reliability and performance for mission-critical applications.
This role combines deep technical expertise in modern observability tools with emerging AI/ML capabilities to build intelligent monitoring solutions, predictive alerting, and self-healing infrastructure.
WHAT YOULL DO
• Design, deploy, and maintain enterprise-scale observability infrastructure including Prometheus, Grafana, Thanos, Loki, and modern collection agents
• Manage observability deployments using GitOps principles and infrastructures code
• Implement long-term metrics storage solutions with cloud object storage
• Maintain and upgrade observability components across development, QA, UAT, production, and DR environments
• Configure distributed observability architecture spanning multiple datacenters and cloud providers
METRICS & MONITORING
• Design and implement Prometheus monitoring strategies for Kubernetes infrastructure and containerized applications
• Create Service Monitors, Pod Monitors for automated metrics collection
• Develop rules for intelligent alerting with minimal false positives
• Configure multi-cluster metrics federation and aggregation
• Optimize metrics cardinality, storage deficiency, and query performance.
Salary Range - CA$ 100,000 - CA$ 120,000 Per Year
TCS does not use artificial intelligence tools for candidate screening or evaluation. This post is for a current vacancy. The hiring process includes an initial screening, followed by a technical evaluation and managerial discussion.