EPAM Systems

Site Reliability Engineer (SRE)

EPAM Systems$120K — $150K *
US-AnywhereRemote in Canada
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science or related field or equivalent experience
  • 5+ years of experience in DevOps or SRE teams
  • Proven track record in supporting production infrastructure
  • Strong knowledge of CI/CD principles and pipelines
  • Hands-on experience with Dynatrace and Splunk
  • Experience with a major cloud provider (AWS, Azure, or GCP)
  • Familiarity with operating high-availability and fault-tolerant systems in production

Responsibilities

  • Implement and promote DevOps and SRE best practices
  • Drive technology roadmap discussions for the SRE team
  • Define and maintain SLIs and SLOs, along with key metrics
  • Design and manage monitoring and observability solutions
  • Conduct performance assessments and recommend enhancements
  • Collaborate with application teams on SLAs and performance availability
  • Participate in on-call rotation for production events and outages
  • Lead troubleshooting, incident management, and root cause analysis

Benefits

  • Opportunities for career advancement in a growing company
  • Collaborative work environment with cross-functional teams
  • Access resources for professional development and training
  • Flexible work arrangements to promote work-life balance
  • Engagement in cutting-edge technology projects
Full Job Description
A large Wealth Management firm operating under a Broker-Dealer model is seeking an experienced Site Reliability Engineer to support feature development on its newly built Trading Platform. The platform has been in development for two years and is currently in a stabilization phase, with a production launch targeted in four months. Req# [redacted] Responsibilities Implement and champion DevOps and SRE best practices across the organization Drive technology roadmap discussions for the SRE team Define, craft, and maintain SLIs and SLOs, along with key metrics including MTTR, Lead Time for Change, Deployment Frequency, and Change Failure Rate Design, develop, and manage monitoring, alerting, and observability solutions using Dynatrace, Splunk, and Grafana Conduct performance assessments, identify bottlenecks, and recommend enhancements to improve system performance Partner with application teams to enforce performance and availability SLAs Collaborate with product owners to manage error budgets, prioritize toil backlogs, and validate against team, application, and incident metrics Participate in an on-call rotation to respond to production events and outages Continuously improve CI/CD pipelines and deployment processes Lead troubleshooting efforts, incident management, and root cause analysis Identify and build automated processes wherever possible Implement cybersecurity measures through ongoing vulnerability assessments and risk management Provide periodic progress reports to management and stakeholders Partner with application teams to support and ease their adoption of the platform Facilitate clear coordination and communication within the team and with customers Analyze existing systems and develop plans for enhancements and improvements Requirements Bachelor's degree in Computer Science or a related field, and/or equivalent work experience 5+ years of experience working within DevOps or SRE teams Proven experience supporting production infrastructure Strong knowledge of CI/CD principles and pipelines Solid understanding of observability concepts, including monitoring, logging, and tracing Hands-on experience with Dynatrace and Splunk Experience with at least one major cloud provider (AWS, Azure, or GCP) Demonstrated experience operating high-availability, fault-tolerant, scalable, and distributed systems in production

About EPAM Systems

EPAM Systems, Inc. is a leading global provider of digital platform engineering and development services. The company has a strong presence in North America, Europe, and Asia, and serves clients in a variety of industries, including financial services, healthcare, and retail. EPAM's services include software engineering, product development, and digital platform engineering, and the company has a reputation for delivering high-quality solutions that help its clients achieve their business goals. EPAM has been recognized as a leader in the digital services industry by a number of independent research firms, and the company has won numerous awards for its work.
Learn more about EPAM Systems
Size
58,824 employees
Market Cap
$18.2 billion
Industry
Net Income
$327.1 million
Founded
1993
5 Year Trend
+26.5%
Revenue
$2.6 billion
NASDAQ

Similar Jobs

More Jobs at EPAM Systems

More Information Technology Jobs

Find similar Site Reliability Engineer (SRE) jobs: