Applied Research Associates Inc

Senior Site Reliability Engineer

Applied Research Associates Inc$100K — $130K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 8+ years in Site Reliability Engineering, DevOps, or related fields
  • Expertise in Linux systems administration
  • Proficient in managing on-prem Kubernetes platforms and related tools
  • Experience with deployment tools like Helm and Kustomize
  • Familiar with DevOps tools such as GitLab and Jira
  • Scripting skills in Python, Go, or Bash
  • Strong knowledge of reliability engineering principles
  • Must be able to obtain a U.S. security clearance

Responsibilities

  • Collaborate with developers and IT to enhance system design and support
  • Maintain operational standards and support procedures
  • Assess architecture changes for performance and compliance
  • Enhance platform stability and availability
  • Provide advanced troubleshooting for complex platform issues

Benefits

  • Flexible work options: fully remote, hybrid, or onsite availability
  • Opportunity to work in a collaborative and impact-driven environment
  • Access to advanced tools and technologies in a leading-edge sector
  • Engagement with diverse teams and stakeholders in technology
  • Potential for career advancement with a focus on continuous improvement
Full Job Description
Essential Functions:

  • Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness.
  • Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives.
  • Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs.
  • Drive continuous improvement in platform stability, maintenance, and availability.
  • Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders.


Experience and Skills Required:

  • 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services.
  • Strong experience with Linux systems administration and troubleshooting in enterprise environments.
  • Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins.
  • Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling.
  • Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence.
  • Experience with GitOps tools such as FluxCD or ArgoCD.
  • Proficiency scripting with at least one of Python, Go, or Bash.
  • Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs.
  • Strong understanding of reliability engineering concepts:
    • Service health indicators
    • High availability design, failure reduction, and testing
    • Operational readiness practices, including developing documentation, runbooks, and architectural descriptions
    • Incident response, root cause analysis, remediation/recovery
  • Ability to obtain a security clearance, which includes U.S. citizenship.


Preferred:

  • Experience with multiple Linux distributions including Ubuntu.
  • Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes.
  • Experience with cloud platforms such as AWS and Azure.
  • Experience with infrastructure automation and configuration management.
  • Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow.
  • Experience with security and compliance considerations in regulated environments.
  • DoD experience.
  • Active or inactive Secret Security Clearance.


Education:

  • Bachelor's degree in CS, Software Engineering or other IT-related field or equivalent experience


REMOTE WORK NOTICE: This position may be performed fully remote, hybrid, or onsite at an ARA office. Preference will be given to candidates located onsite in the Albuquerque area.

About Applied Research Associates Inc

Applied Research Associates, Inc. (ARA) is an employee-owned research and engineering company that provides technical solutions to complex problems in various fields, including defense, homeland security, intelligence, transportation, and energy. The company was founded in 1979 and is headquartered in Albuquerque, NM. ARA has over 1,600 employees and operates from more than 20 locations across the United States and Canada. The company's services include research and development, engineering, testing and evaluation, and consulting.
Learn more about Applied Research Associates Inc
Size
1,600 employees
Industry
Founded
1979

Similar Jobs

More Jobs at Applied Research Associates Inc

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: