Site Reliability Engineer (SRE)

Ova Technologies

• $120K — $150K *

New York, NY 10025In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Bachelor's degree in Computer Science or related field
4-8 years in Site Reliability Engineering or similar role
Hands-on experience with Kubernetes and container orchestration
Familiarity with monitoring and observability solutions
Relevant certifications (AWS, Google, Azure, CKA) preferred

Responsibilities

Design and maintain scalable infrastructure solutions
Monitor application performance for optimal health
Develop automation tools for various operational tasks
Define and manage SLIs, SLOs, and SLAs
Perform root cause analysis for production incidents
Collaborate with development teams to enhance reliability
Build observability solutions including monitoring and alerting

Benefits

Remote or hybrid work options available
Opportunity for continuous learning and improvement
Involvement in innovative technologies like cloud and automation
Potential for professional development through certifications
Exposure to diverse projects across industries like FinTech and E-commerce

Full Job Description

Job Title:

Site Reliability Engineer (SRE)

Job Summary:

We are seeking a highly skilled Site Reliability Engineer (SRE) to ensure the reliability, scalability, performance, and availability of mission-critical applications and infrastructure. The ideal candidate will combine software engineering and operations expertise to build automated solutions, improve system resilience, and minimize service disruptions. The SRE will work closely with development, DevOps, cloud, and support teams to enhance system stability and operational excellence.

Key Responsibilities:

Design, implement, and maintain highly available and scalable infrastructure solutions.
Monitor application and infrastructure performance to ensure optimal system health.
Develop automation tools to streamline deployment, monitoring, incident response, and operational tasks.
Define and manage Service Level Indicators (SLIs), Service Level Objectives (SLOs), and Service Level Agreements (SLAs).
Perform root cause analysis (RCA) for production incidents and implement preventive measures.
Collaborate with development teams to improve application reliability and performance.
Manage capacity planning and infrastructure scaling strategies.
Build and maintain observability solutions including monitoring, logging, and alerting systems.
Participate in incident management, on-call rotations, and disaster recovery planning.
Implement security, compliance, and operational best practices.
Drive continuous improvement initiatives to reduce operational overhead through automation.

Required Skills:

Strong understanding of Linux/Unix systems administration.
Expertise in monitoring, alerting, and observability practices.
Experience with cloud platforms and distributed systems.
Strong troubleshooting and performance optimization skills.
Knowledge of networking, security, and system architecture.
Excellent problem-solving and communication abilities.

Technical Skills:

Operating Systems: Linux, Unix, Windows Server
Cloud Platforms: AWS, Azure, Google Cloud Platform (GCP)
Containerization: Docker, Kubernetes, OpenShift
Infrastructure as Code (IaC): Terraform, CloudFormation, Ansible
Monitoring Tools: Prometheus, Grafana, Datadog, New Relic, Dynatrace
Logging Tools: ELK Stack (Elasticsearch, Logstash, Kibana), Splunk
CI/CD Tools: Jenkins, GitHub Actions, GitLab CI/CD, Azure DevOps
Programming/Scripting: Python, Go, Bash, PowerShell
Databases: PostgreSQL, MySQL, MongoDB, Redis
Version Control: Git, GitHub, GitLab, Bitbucket

Qualifications:

Bachelor's degree in Computer Science, Information Technology, Engineering, or a related field.
Relevant certifications are preferred:
- AWS Certified DevOps Engineer
- Google Professional Cloud DevOps Engineer
- Microsoft Azure DevOps Engineer Expert
- Certified Kubernetes Administrator (CKA)

Experience:

4-8 years of experience in Site Reliability Engineering, DevOps, Cloud Engineering, or Infrastructure Operations.
Hands-on experience supporting production environments and cloud-native applications.
Experience with Kubernetes, container orchestration, and automation frameworks.
Experience implementing monitoring and observability solutions.

Preferred Qualifications:

Experience managing large-scale distributed systems and microservices architectures.
Knowledge of chaos engineering and reliability testing practices.
Experience with performance tuning and capacity planning.
Familiarity with security best practices and compliance standards.
Experience with serverless and event-driven architectures.

Preferred Qualities:

Strong ownership mindset and accountability.
Ability to remain calm and effective during critical incidents.
Excellent analytical and debugging skills.
Strong collaboration and cross-functional communication abilities.
Passion for automation, reliability, and continuous improvement.

Employment Type:

Full-Time

Location:

Remote / Hybrid / On-site

Nice to Have:

Experience with SaaS, FinTech, Healthcare, E-commerce, or HR Tech platforms.
Knowledge of AI-driven observability and incident management tools.
Experience implementing self-healing infrastructure and automated remediation.
Familiarity with cost optimization strategies in cloud environments.
Experience mentoring engineers and driving reliability best practices across teams.

* Ladders Estimates

Similar Jobs

Microsoft Systems SME
$100K — $130K *
Flexsteel Industries
Remote
Today
Market Data Platform Engineer
$87K — $168K *
Wells Fargo
Iselin, NJ 08830 (Middlesex County)
Reposted Today
Systems Engineer (0039)
$100K — $125K *
OCT Consulting, LLC
Washington, DC 20011 (District Of Columbia County)
Today
Senior Systems Engineer - Electronic Warfare SME
MAG Aerospace
Aberdeen, MD 21001 (Harford County)
Reposted Today
Eng Sr Prin II - Sys
$120K — $150K *
BAE Systems
Nashua, NH 03060 (Hillsborough County)
Today
Engineer Manufacturing Systems
$79K — $127K *
Johnson & Johnson
Danvers, MA 01923 (Essex County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at Ova Technologies

Technical Project Manager
$90K — $130K *
New York, NY 10025 (New York County)
Today
Information Technology
In-Person
Cloud Security Engineer
$120K — $150K *
New York, NY 10025 (New York County)
Today
Information Technology
In-Person
Technical Project Manager
$90K — $130K *
Remote
Today
Information Technology
Remote in New York, NY
Site Reliability Engineer (SRE)
$120K — $150K *
New York, NY 10025 (New York County)
Today
Information Technology
In-Person
Business Analyst (IT)
$80K — $120K *
New York, NY 10025 (New York County)
Today
Information Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
5 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Sr Network/Systems Architect
$225K — $256K *
GovCIO
Remote
Today
Service Center Manager
$150K — $199K *
GovCIO
Fairfax, VA 22030 (Fairfax City County)
Today

Find similar Site Reliability Engineer (SRE) jobs:

Nationwide New York, NY

Site Reliability Engineer (SRE)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Site Reliability Engineer (SRE) jobs:

Get Ready For Your
Next Interview