Ellucian

Senior Site Reliability Engineer

Ellucian$120K — $150K *
US-AnywhereRemote in Virginia, US
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years in Site Reliability Engineering, DevOps, or similar roles
  • Hands-on expertise with DataDog for APM, logs, metrics, dashboards, and alerting (mandatory)
  • Experience with cloud platforms like AWS, Azure, or GCP
  • Proficient in CI/CD and Infrastructure as Code tools such as Terraform
  • Strong troubleshooting and root cause analysis skills in distributed systems
  • Familiar with containers and orchestration technologies like Docker and Kubernetes
  • Scripting or programming experience in Python, Bash, or similar languages

Responsibilities

  • Own and enhance system reliability, availability, and performance in production environments
  • Design and manage monitoring and observability using DataDog
  • Lead incident response and post-incident reviews
  • Conduct root cause analysis to implement long-term fixes
  • Collaborate with teams to create scalable and resilient infrastructure
  • Automate operations to improve efficiency and minimize risks
  • Analyze and optimize cloud-related costs

Benefits

  • Comprehensive health coverage including medical, dental, and vision
  • Flexible time off policy
  • Thrive Flex Lifestyle Account for health, financial, or learning contributions
  • 401k with matching and financial planning assistance via BrightPlan
  • Parental leave offered
  • 5 charitable days per year
  • Telemedicine access
  • Wellness programs like Headspace Care for mental health and Wellbeats for fitness
  • Caregiver support through RethinkCare and Wellthy
  • Diversity and inclusion programs with access to employee resource groups
  • Employee referral bonuses
  • Education Assistance Program and professional development opportunities
Full Job Description
About the Opportunity

We are seeking a Senior Site Reliability Engineer (SRE) to ensure the reliability, performance, and cost-efficiency of our production systems. This role requires deep expertise in DataDog for observability and will focus on DevOps practices, incident management, root cause analysis, and cost optimization across cloud infrastructure and services.

Where You Will Make an Impact
  • Own and improve system reliability, availability, and performance for production environments
  • Design, implement, and manage monitoring, alerting, and observability using DataDog (required)
  • Lead incident response efforts, including troubleshooting, mitigation, and post-incident reviews
  • Perform detailed root cause analysis (RCA) and drive permanent resolutions
  • Partner with engineering and DevOps teams to build scalable, resilient infrastructure
  • Automate operational processes to improve efficiency and reduce risk
  • Analyze and optimize infrastructure and application costs
  • Define and manage SLIs/SLOs to meet reliability targets
  • Continuously improve deployment, monitoring, and operational practices

What You Will Bring
  • 5+ years of experience in Site Reliability Engineering, DevOps, or similar roles
  • Mandatory: Strong, hands-on expertise with DataDog (APM, logs, metrics, dashboards, alerting)
  • Experience with cloud platforms (AWS, Azure, or GCP)
  • Proficiency in DevOps practices and tools (CI/CD, Infrastructure as Code such as Terraform)
  • Strong troubleshooting skills and experience conducting root cause analysis in distributed systems
  • Experience with containers and orchestration (Docker, Kubernetes)
  • Scripting or programming experience (Python, Bash, or similar)
  • Proven ability to analyze and optimize cloud costs

Preferred Qualifications
  • Experience with cost management tools (e.g., AWS Cost Explorer, Azure Cost Management)
  • Familiarity with cloud security and compliance best practices
  • Experience supporting high-availability, customer-facing systems
  • Strong collaboration and communication skills

What Success Looks Like
  • Improved system reliability and reduced incident frequency
  • Faster incident detection and resolution (MTTR)
  • Effective, actionable observability driven by DataDog
  • Measurable cost savings and optimized infrastructure usage


  • Comprehensive health coverage: medical, dental, and vision
  • Flexible time off
  • Thrive Flex Lifestyle Account (LSA) that allows you to contribute towards your health, financial or learning interests
  • 401k w/ match & BrightPlan - to help you save for the future
  • Parental Leave
  • 5 charitable days to support the community that supports us
  • Telemedicine
  • Wellness
    • Headspace Care (mental health)
    • Wellbeats (virtual fitness classes)
  • RethinkCare & Wellthy- caregiver support
  • Diversity and inclusion programs which provide access to internal employee resource groups
  • Employee referral bonuses to encourage the addition of great new people to the team
  • We Foster a learning culture with:
    • Education Assistance Program
    • Professional development opportunities

#LI-RB1
#LI-Remote

About Ellucian

Ellucian is a provider of software and services to higher education institutions. The company was founded in 1968 and offers a range of solutions, including student information systems, financial management systems, and analytics. Ellucian's technology is designed to help colleges and universities improve their operations, enhance the student experience, and achieve their strategic goals. The company has a global presence and serves more than 2,700 institutions in over 50 countries.
Learn more about Ellucian
Size
3,000 employees
Industry
Founded
1968

Similar Jobs

More Jobs at Ellucian

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: