Site Reliability Engineer (SRE)

Ova Technologies

$120K — $150K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science or related field
  • 4-8 years in Site Reliability Engineering or similar role
  • Hands-on experience with Kubernetes and container orchestration
  • Familiarity with monitoring and observability solutions
  • Relevant certifications (AWS, Google, Azure, CKA) preferred

Responsibilities

  • Design and maintain scalable infrastructure solutions
  • Monitor application performance for optimal health
  • Develop automation tools for various operational tasks
  • Define and manage SLIs, SLOs, and SLAs
  • Perform root cause analysis for production incidents
  • Collaborate with development teams to enhance reliability
  • Build observability solutions including monitoring and alerting

Benefits

  • Remote or hybrid work options available
  • Opportunity for continuous learning and improvement
  • Involvement in innovative technologies like cloud and automation
  • Potential for professional development through certifications
  • Exposure to diverse projects across industries like FinTech and E-commerce
Full Job Description
Job Title:

Site Reliability Engineer (SRE)

Job Summary:

We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical applications and infrastructure. The ideal candidate will combine software engineering and operations expertise to build automated solutions, improve system resilience, and minimize service disruptions. The SRE will work closely with development, DevOps, cloud, and support teams to enhance system stability and operational excellence.

Key Responsibilities:
  • Design, implement, and maintain highly available and scalable infrastructure solutions.
  • Monitor application and infrastructure performance to ensure optimal system health.
  • Develop automation tools to streamline deployment, monitoring, incident response, and operational tasks.
  • Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
  • Perform root cause analysis (RCA) for production incidents and implement preventive measures.
  • Collaborate with development teams to improve application reliability and performance.
  • Manage capacity planning and infrastructure scaling strategies.
  • Build and maintain observability solutions including monitoring, logging, and alerting systems.
  • Participate in incident management, on-call rotations, and disaster recovery planning.
  • Implement security, compliance, and operational best practices.
  • Drive continuous improvement initiatives to reduce operational overhead through automation.

Required Skills:
  • Strong understanding of Linux/Unix systems administration.
  • Expertise in monitoring, alerting, and observability practices.
  • Experience with cloud platforms and distributed systems.
  • Strong troubleshooting and performance optimization skills.
  • Knowledge of networking, security, and system architecture.
  • Excellent problem-solving and communication abilities.

Technical Skills:
  • Operating Systems: Linux, Unix, Windows Server
  • Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP)
  • Containerization: Docker, Kubernetes, OpenShift
  • Infrastructure as Code (IaC): Terraform, CloudFormation, Ansible
  • Monitoring Tools: Prometheus, Grafana, Datadog, New Relic, Dynatrace
  • Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
  • CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps
  • Programming/Scripting: Python, Go, Bash, PowerShell
  • Databases: PostgreSQL, MySQL, MongoDB, Redis
  • Version Control: Git, GitHub, GitLab, Bitbucket

Qualifications:
  • Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
  • Relevant certifications are preferred:
    • AWS Certified DevOps Engineer
    • Google Professional Cloud DevOps Engineer
    • Microsoft Azure DevOps Engineer Expert
    • Certified Kubernetes Administrator (CKA)

Experience:
  • 4-8 years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or Infrastructure Operations.
  • Hands-on experience supporting production environments and cloud-native applications.
  • Experience with Kubernetes, container orchestration, and automation frameworks.
  • Experience implementing monitoring and observability solutions.

Preferred Qualifications:
  • Experience managing large-scale distributed systems and microservices architectures.
  • Knowledge of chaos engineering and reliability testing practices.
  • Experience with performance tuning and capacity planning.
  • Familiarity with security best practices and compliance standards.
  • Experience with serverless and event-driven architectures.

Preferred Qualities:
  • Strong ownership mindset and accountability.
  • Ability to remain calm and effective during critical incidents.
  • Excellent analytical and debugging skills.
  • Strong collaboration and cross-functional communication abilities.
  • Passion for automation, reliability, and continuous improvement.

Employment Type:

Full-Time

Location:

Remote / Hybrid / On-site

Nice to Have:
  • Experience with SaaS, FinTech, Healthcare, E-commerce, or HR Tech platforms.
  • Knowledge of AI-driven observability and incident management tools.
  • Experience implementing self-healing infrastructure and automated remediation.
  • Familiarity with cost optimization strategies in cloud environments.
  • Experience mentoring engineers and driving reliability best practices across teams.

Similar Jobs

More Jobs at Ova Technologies

More Information Technology Jobs

Find similar Site Reliability Engineer (SRE) jobs: