Job Title: Site Reliability Engineer (SRE)
Job Summary: We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical applications and infrastructure. The ideal candidate will combine software engineering and operations expertise to build automated solutions, improve system resilience, and minimize service disruptions. The SRE will work closely with development, DevOps, cloud, and support teams to enhance system stability and operational excellence.
Key Responsibilities: - Design, implement, and maintain highly available and scalable infrastructure solutions.
- Monitor application and infrastructure performance to ensure optimal system health.
- Develop automation tools to streamline deployment, monitoring, incident response, and operational tasks.
- Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
- Perform root cause analysis (RCA) for production incidents and implement preventive measures.
- Collaborate with development teams to improve application reliability and performance.
- Manage capacity planning and infrastructure scaling strategies.
- Build and maintain observability solutions including monitoring, logging, and alerting systems.
- Participate in incident management, on-call rotations, and disaster recovery planning.
- Implement security, compliance, and operational best practices.
- Drive continuous improvement initiatives to reduce operational overhead through automation.
Required Skills: - Strong understanding of Linux/Unix systems administration.
- Expertise in monitoring, alerting, and observability practices.
- Experience with cloud platforms and distributed systems.
- Strong troubleshooting and performance optimization skills.
- Knowledge of networking, security, and system architecture.
- Excellent problem-solving and communication abilities.
Technical Skills: - Operating Systems: Linux, Unix, Windows Server
- Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP)
- Containerization: Docker, Kubernetes, OpenShift
- Infrastructure as Code (IaC): Terraform, CloudFormation, Ansible
- Monitoring Tools: Prometheus, Grafana, Datadog, New Relic, Dynatrace
- Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
- CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps
- Programming/Scripting: Python, Go, Bash, PowerShell
- Databases: PostgreSQL, MySQL, MongoDB, Redis
- Version Control: Git, GitHub, GitLab, Bitbucket
Qualifications: - Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
- Relevant certifications are preferred:
- AWS Certified DevOps Engineer
- Google Professional Cloud DevOps Engineer
- Microsoft Azure DevOps Engineer Expert
- Certified Kubernetes Administrator (CKA)
Experience: - 4-8 years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or Infrastructure Operations.
- Hands-on experience supporting production environments and cloud-native applications.
- Experience with Kubernetes, container orchestration, and automation frameworks.
- Experience implementing monitoring and observability solutions.
Preferred Qualifications: - Experience managing large-scale distributed systems and microservices architectures.
- Knowledge of chaos engineering and reliability testing practices.
- Experience with performance tuning and capacity planning.
- Familiarity with security best practices and compliance standards.
- Experience with serverless and event-driven architectures.
Preferred Qualities: - Strong ownership mindset and accountability.
- Ability to remain calm and effective during critical incidents.
- Excellent analytical and debugging skills.
- Strong collaboration and cross-functional communication abilities.
- Passion for automation, reliability, and continuous improvement.
Employment Type: Full-Time
Location: Remote / Hybrid / On-site
Nice to Have: - Experience with SaaS, FinTech, Healthcare, E-commerce, or HR Tech platforms.
- Knowledge of AI-driven observability and incident management tools.
- Experience implementing self-healing infrastructure and automated remediation.
- Familiarity with cost optimization strategies in cloud environments.
- Experience mentoring engineers and driving reliability best practices across teams.