Site Reliability Engineer (SRE)

HTP Solutions

$80K — $130K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 3-7+ years as a Site Reliability Engineer, DevOps Engineer, or Cloud Engineer
  • Hands-on experience with Microsoft Azure services
  • Proficiency in Terraform for infrastructure management
  • Experience with CI/CD tools like Azure DevOps or Jenkins
  • Strong understanding of Linux/Windows systems administration
  • Familiarity with containerization and Kubernetes (AKS)
  • Experience with monitoring tools such as Prometheus or Grafana
  • Proficient in scripting languages like PowerShell, Bash, Python

Responsibilities

  • Design, implement, and maintain secure cloud infrastructure on Azure
  • Automate infrastructure provisioning and configuration using Terraform
  • Develop and enforce SRE principles including SLAs, SLOs, and error budgets
  • Build observability systems with Azure Monitor and Prometheus
  • Respond to production incidents and drive incident resolution
  • Implement CI/CD pipelines with automated testing and deployment
  • Collaborate with DevOps and IT operations teams to enhance reliability
  • Optimize cost and performance of cloud infrastructure

Benefits

  • Health, dental, and vision insurance
  • 401(k) plan with company match
  • Flexible working hours and remote work opportunities
  • Generous paid time off and holidays
  • Professional development and certification reimbursement
Full Job Description
We are seeking a highly skilled Site Reliability Engineer (SRE) with expertise in Microsoft Azure and Infrastructure as Code (IaC) using Terraform. As an SRE, you will be responsible for maintaining high availability, scalability, and performance of critical systems while driving infrastructure automation and reliability best practices.

Key Responsibilities:
• Design, implement, and maintain scalable, resilient, and secure cloud infrastructure on Azure.
• Automate provisioning, configuration, and monitoring of infrastructure using Terraform.
• Develop and enforce SRE principles like SLAs, SLOs, SLIs, and error budgets.
• Build observability and monitoring systems using tools like Azure Monitor, Log Analytics, Prometheus, Grafana, etc.
• Respond to production incidents, conduct root cause analysis, and drive incident resolution.
• Implement CI/CD pipelines and integrate automated testing and deployment.
• Collaborate with software engineering, DevOps, and IT operations teams to enhance system performance and reliability.
• Optimize cost, performance, and security of cloud infrastructure.
• Maintain runbooks, documentation, and conduct regular disaster recovery and failover exercises.

Required Qualifications:
3-7+ years of experience as a Site Reliability Engineer, DevOps Engineer, or Cloud Engineer.
• Strong hands-on experience with Microsoft Azure services (VMs, AKS, App Services, Azure SQL, Storage, etc.).
• Proficiency with Terraform for infrastructure provisioning and configuration management.
• Experience with CI/CD tools such as Azure DevOps, GitHub Actions, Jenkins, or similar.
• Strong understanding of Linux/Windows systems administration.
• Familiarity with containerization and Kubernetes (especially AKS).
• Experience with monitoring and logging tools (e.g., Azure Monitor, App Insights, ELK, Prometheus, Grafana).
• Proficient in scripting languages (e.g., PowerShell, Bash, Python).
• Strong troubleshooting and incident management skills.

Preferred Qualifications:
• Certifications: Microsoft Certified: Azure Administrator / DevOps Engineer Associate.
• Experience with GitOps, Helm, or Service Mesh (Istio, Linkerd).
• Knowledge of security best practices and compliance in cloud environments.
• Exposure to Agile/DevOps culture and practices.

Soft Skills:
• Strong communication and collaboration skills.
• Ability to thrive in a fast-paced, dynamic environment.
• Problem-solving mindset with attention to detail.
• Team player with a proactive attitude.

Similar Jobs

More Jobs at HTP Solutions

More Information Technology Jobs

Find similar Site Reliability Engineer (SRE) jobs: