Site Reliability Engineer (SRE)

System One Holdings, LLC

$100K — $130K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Minimum four years of experience with AWS cloud platforms focused on reliability and scalability.
  • Bachelor's degree required, or four additional years of relevant experience in lieu of degree.
  • Hands-on experience with CI/CD and Infrastructure as Code tools like Terraform and Ansible.
  • Strong scripting abilities, especially in Python; familiarity with PowerShell and Bash is also valuable.
  • Experience with Windows and Linux environments is essential.
  • Understanding of networking concepts and cloud incident response is required.
  • Experience in Agile methodologies such as Scrum or Kanban.

Responsibilities

  • Establish and enhance SRE practices within an Agile Scrum framework.
  • Conduct system design reviews to identify potential reliability and scalability issues.
  • Improve operational readiness through contributions to code and deployment reviews.
  • Engage in incident management, focusing on root cause analysis and post-incident strategies.
  • Automate processes to enhance reliability and minimize manual labor.
  • Manage AWS operations inclusive of monitoring, logging, and security concerns.
  • Collaborate with various teams to foster best practices in DevOps and reliability.

Benefits

  • Health and welfare benefits including medical, dental, and vision coverage.
  • Access to spending accounts and life insurance options.
  • Participation in a 401(k) retirement plan.
Full Job Description
Site Reliability Engineer (SRE)

Remote
No sponsorship available. Must be able to obtain a Public Trust clearance.

What You Will Do

We are seeking a Site Reliability Engineer (SRE) to support the SBA Disaster Lending Platform modernization effort in a remote capacity. This role will help establish and mature SRE practices across AWS cloud environments, with a focus on reliability, automation, scalability, observability, incident response, and operational excellence.

In this role, you will work closely with engineering, DevOps, cloud, security, and product teams to improve system resilience, reduce downtime, strengthen deployment practices, and support reliable cloud-based application delivery in an Agile environment.

Responsibilities include:
• Help establish and mature SRE practices within an Agile Scrum delivery environment.
• Support system design reviews to identify reliability risks, failure points, scalability concerns, and opportunities for automation.
• Improve operational readiness by contributing to code reviews, deployment reviews, monitoring practices, and reliability-focused engineering standards.
• Support incident management activities, including troubleshooting, root-cause analysis, mitigation planning, and post-incident improvements.
• Build and maintain automation to improve reliability, reduce manual effort, and support self-healing cloud infrastructure.
• Support AWS cloud platform operations across monitoring, logging, security, scalability, and availability.
• Work with CI/CD and Infrastructure as Code tools to support repeatable, secure, and reliable deployments.
• Create and maintain clear technical documentation for systems, processes, runbooks, and operational procedures.
• Collaborate with cross-functional teams and stakeholders to promote DevOps, automation, and reliability best practices.

What You Will Need
• Minimum of four years of experience supporting the reliability, scalability, security, and operational excellence of AWS cloud platforms.
• Bachelor's degree required, or four additional years of relevant experience in lieu of a degree.
• Hands-on experience with CI/CD and Infrastructure as Code tools such as Terraform, Ansible Automation Platform, GitLab, Artifactory, and Packer.
• Strong scripting and automation experience using Python, PowerShell, and Bash; Python experience is preferred.
• Experience supporting Windows and Linux environments.
• Strong understanding of networking concepts, cloud troubleshooting, monitoring, logging, and incident response.
• Experience designing, deploying, or supporting cloud-based systems with a focus on reliability, scalability, security, and performance.
• Knowledge of source control best practices.
• Experience working in Agile delivery environments, including Scrum, Kanban, SAFe, or similar methodologies.
• Strong analytical, troubleshooting, and problem-solving skills, including the ability to resolve complex technical issues in high-pressure situations.
• Strong communication skills and the ability to collaborate effectively with technical teams, stakeholders, and cross-functional partners.
• Must be authorized to work in the United States without sponsorship and able to obtain a Public Trust clearance.

Nice to Have
• Current or prior government contracting experience.
• Red Hat, CompTIA, AWS, or related technical certifications.
• Experience mentoring technical teams or helping promote DevOps/SRE practices across engineering groups.

System One not only serves as a valued partner for our clients, but we offer eligible employees health and welfare benefits coverage options including medical, dental, vision, spending accounts, life insurance, voluntary plans, as well as participation in a 401(k) plan.

#M1
#LI-CS1
Ref: #851-Rockville-S1

Similar Jobs

More Jobs at System One Holdings, LLC

More Information Technology Jobs

Find similar Site Reliability Engineer (SRE) jobs: