Function
Cloud & Data Engineering
Job descriptionMeet Our TeamJoin our Site Reliability Engineering (SRE) Operations team, where reliability, automation, and operational excellence are at the heart of everything we do. We ensure the stability, availability, and performance of enterprise applications running across modern cloud-native and hybrid platforms, including Kubernetes, APIs, cloud services, databases, Kafka, and API gateways.
As an L1 SRE Operations Engineer, you'll be the first line of defense, monitoring production environments, responding to alerts, executing operational runbooks, and partnering with senior engineers to maintain highly available and resilient platforms. This is an excellent opportunity for professionals looking to build hands-on experience in cloud operations, DevOps, and Site Reliability Engineering.
What You'll Be Doing- Monitor enterprise applications, infrastructure, dashboards, logs, and alerts across cloud and on-premises environments.
- Perform first-level incident triage by analyzing alerts, collecting logs and metrics, and determining whether issues are application or platform related.
- Execute standardized operational runbooks for incident resolution, deployments, maintenance activities, and routine operational tasks.
- Monitor and support Kubernetes environments by validating pod health, deployments, namespaces, logs, and service endpoints.
- Troubleshoot infrastructure and application issues using Linux utilities, networking tools, and monitoring platforms.
- Escalate complex incidents to L2/L3 engineering teams with complete diagnostic information to accelerate resolution.
- Support API gateways, web application firewalls (WAF), Kafka platforms, databases, and cloud infrastructure across AWS, Azure, and GCP.
- Maintain accurate incident documentation, operational records, and knowledge base updates while identifying opportunities to improve runbooks and automation.
- Collaborate with development, platform engineering, and infrastructure teams during incident response and production support.
- Assist with onboarding new applications into the operational support framework while ensuring monitoring, alerting, and operational readiness.
- Contribute to continuous improvement by identifying repetitive manual activities suitable for automation.
- Provide timely and professional communication to stakeholders during production incidents and operational events.
What You'll Bring to the TeamRequired Qualifications- 2–5 years of experience in IT Operations, NOC, SRE, DevOps, or Infrastructure Support.
- Working knowledge of Kubernetes administration and day-to-day cluster operations.
- Good understanding of Linux administration and command-line troubleshooting.
- Familiarity with cloud platforms such as AWS, Microsoft Azure, or Google Cloud Platform.
- Experience with observability and monitoring tools such as Prometheus, Grafana, Splunk, ELK Stack, Datadog, Argos, or AIOps platforms.
- Ability to execute operational runbooks and follow structured incident response procedures.
- Experience using Kubernetes CLI (kubectl) to verify pod health, deployments, namespaces, and application logs.
- Basic scripting knowledge in Python, Bash, or PowerShell for operational automation.
- Understanding of networking fundamentals including DNS, HTTP/HTTPS, TCP/IP, firewalls, WAF, proxies, connectivity troubleshooting, and diagnostic tools such as ping, curl, netstat, and traceroute.
- Strong analytical and troubleshooting skills using structured problem-solving techniques such as 5 Whys and Fishbone Analysis.
- Excellent documentation, communication, and stakeholder management skills.
Preferred Qualifications- Experience working with API gateways such as Apigee or Gloo API Gateway.
- Basic knowledge of SQL and NoSQL databases with the ability to validate database connectivity.
- Familiarity with messaging platforms such as Apache Kafka.
- Experience with ITSM and incident management tools including ServiceNow, Jira, xMatters, or similar platforms.
- Exposure to automation and self-service operations initiatives.
- Experience using AI-assisted operational tools or chatbots for runbook search, log summarization, and incident analysis.
- Understanding of cloud-native application architectures, CI/CD pipelines, and production support best practices.
- Passion for continuous learning, operational excellence, and improving system reliability through automation.