Role Overview:This role is for a highly skilled Site Reliability Engineer with strong expertise in Kubernetes and Google Cloud Platform (GCP), specifically GKE. The position requires a deep understanding of infrastructure as code (IaC) using Terraform, Helm, and GitHub Actions, alongside proficiency in Python, Ansible, and Node.js. The engineer will be crucial in maintaining and enhancing observability stacks with Prometheus and Grafana, ensuring robust Linux systems and networking fundamentals, and contributing to automation and CI/CD pipelines. A significant aspect of the role involves applying AI/ML concepts and AIOps practices to improve system reliability and incident management.
Key Responsibilities:- Manage incidents, provide on-call support, and perform production triage to ensure system stability.
- Develop and maintain automation scripts and CI/CD pipelines for efficient software delivery and infrastructure management.
- Implement and manage infrastructure using IaC principles with Terraform, Helm, and GitHub Actions.
- Monitor system performance and health using Prometheus and Grafana observability tools.
- Apply AI/ML concepts and AIOps practices, including model lifecycle management, monitoring, and AI-driven alerting, to enhance operational efficiency.
- Support and operate ML/AI platforms or pipelines (MLOps) and integrate AI-driven automation into monitoring and incident response.
Required Skills:- Strong experience with Kubernetes and GCP (GKE).
- Strong experience in IaC (Terraform), Helm, and GitHub Actions.
- Proficiency in Python, Ansible, Node.js.
- Strong experience with Prometheus and Grafana observability stack.
- Solid understanding of Linux systems and networking fundamentals.
- Experience in incident management, on-call support, and production triage.
- Hands-on experience with automation and CI/CD pipelines.
- Strong understanding of AI/ML concepts and AIOps practices (model lifecycle, monitoring, or AI-driven alerting).
Qualifications:- 10+ years of experience in Site Reliability Engineering or a related field.
- Google Cloud Architect Certification (Preferred).
- Certified Kubernetes Administrator (CKA) (Preferred).
Preferred Skills:- Experience in Java/J2EE, Spring Boot.
- Experience supporting or operating ML/AI platforms or pipelines (MLOps).
- Exposure to AIOps tools, anomaly detection, or predictive analytics systems.
- Experience with large-scale distributed systems and microservices architecture.
- Experience with GPU-based workloads or ML infrastructure on GCP.
- Knowledge of Kubeflow, Vertex AI, or ML pipelines.
- Experience integrating AI-driven automation into monitoring and incident response.