SITE RELIABILITY ENGINEER

United States Cold Storage

• $130K — $150K *

Camden, NJ 08105In-Person

Information Technology

Less than 5 years of experience

2 months ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

3+ years in SRE, DevOps, Systems Engineering, or related roles
Strong Linux and Windows systems administration skills
Hands-on experience with automation and scripting
Experience in designing monitoring and observability solutions
Practical experience in Azure environments
Experience supporting warehouse management systems or industrial automation platforms

Responsibilities

Ensure reliability of Phenix WMS and facility automation systems
Define and implement meaningful SLIs and SLOs
Enhance observability across cloud and on-premise operations
Automate operational tasks to reduce manual workloads
Develop self-healing behaviors for common failure modes
Lead blameless post-incident reviews as part of on-call rotations
Test disaster recovery strategies across various environments

Benefits

Opportunity to help define reliability frameworks from scratch
Impact daily warehouse operations directly
Hands-on involvement in engineering for physical systems
Chance to innovate and reduce operational toil
Scope for professional growth as an SRE in a foundational role

Full Job Description

Site Reliability Engineer (SRE)
Engineer Reliability into the Systems That Move the Nation's Food Supply

The Role
The Site Reliability Engineer is a founding member of US Cold's SRE practice.
This role exists to move the organization from reactive operations to engineered reliability. You will study how our most critical systems fail - particularly our Phenix WMS and facility automation interfaces - and design controls, automation, and observability that reduce incidents over time.
Success in this role means fewer false alerts, faster recovery, less manual intervention, and systems that heal themselves when possible.
You will work closely with application, infrastructure, and operations teams and participate directly in on-call and incident response.

What You Will Own

Reliability of the Phenix WMS and its integration with facility automation systems (robotics, conveyors, and control interfaces)
Definition and implementation of SLIs and SLOs that measure meaningful system health, not just availability
Observability across the full stack, correlating cloud services, APIs, and on-premise facility operations
Automation to eliminate operational toil, including patching, data corrections, restarts, and recovery tasks
Development of self-healing behaviors for common failure modes
Participation in on-call rotations and leadership of blameless post-incident reviews
Design and execution of disaster recovery tests across SaaS, cloud, and on-premise environments

This is hands-on reliability engineering. The systems you improve will directly impact daily warehouse operations.

Technical Environment

Hybrid environments spanning cloud and on-premise infrastructure
Azure cloud services
Warehouse Management Systems (Phenix WMS) and facility automation interfaces
Observability tooling across logs, metrics, and alerting
Automation using Python, PowerShell, Bash, or Ansible
CI/CD tools and modern deployment practices
Exposure to containerized and distributed systems environments

What We're Looking For

3+ years of experience in SRE, DevOps, Systems Engineering, or related roles
Strong Linux and Windows systems administration and troubleshooting skills
Hands-on experience with automation and scripting
Experience designing and operating monitoring, alerting, and observability solutions
Practical experience working in Azure environments
Strong analytical skills and a bias toward eliminating root causes, not symptoms
Ability to collaborate across application, infrastructure, and operations teams
Experience supporting warehouse management systems or industrial automation platforms
Exposure to Kubernetes, microservices, or container orchestration
Hands on experience with infrastructure-as-code tools such as Terraform or Ansible
Understanding of distributed systems and high-availability design
Experience with SRE practices such as SLO-based operations, runbook automation, or chaos testing

Why This Role Is Different
This is not an inherited SRE function.
There is no mature framework to maintain.
You will:

Help define what reliability means at US Cold
Work on systems that operate in the physical world
Engineer solutions that reduce toil and operational load
See the direct impact of your work on warehouse uptime and performance
Build practices that scale as the platform modernizes

This is an opportunity to grow as an SRE while helping establish the reliability foundation of a mission-critical platform.
Compensation & Structure

Location: Hybrid - Camden NJ

Reports to: IT - Site Reliability Engineering Manager

Salary Range: $130,000- $150,000

Operational Context

Systems operate continuously across warehouse facilities
Reliability failures have physical and operational consequences
On-call participation is part of the role
Work occurs across cloud, SaaS, and on-premise environments

* Ladders Estimates

Similar Jobs

VMware Engineer, Senior
$144K — $200K *
Elluminates Software
Springfield, VA 22153 (Fairfax County)
Today
Senior Systems Engineer - Linux
$129K — $166K *
Parexel
Newton, MA 02458 (Middlesex County)
Today
Senior Virtualization & Systems Engineer (Hybrid, Temp to Hire - Marcus, Hook, PA)
$100K — $130K *
Arctiq, Inc.
Blue Bell, PA 19422 (Montgomery County)
Today
Platform Engineer
$100K — $130K *
VTG
Chantilly, VA 20152 (Loudoun County)
Today
Senior System Engineer - Linux
$145K — $180K *
ST Engineering
Middle River, MD 21220 (Baltimore County)
Today
Senior System Engineer - Linux
$145K — $180K *
ST Engineering
Baltimore, MD 21215 (Baltimore City County)
Today

Get Ready For Your
Next Interview

More Jobs at United States Cold Storage

Security Engineer
$100K — $120K *
Camden, NJ 08105 (Camden County)
Yesterday
Information Technology
In-Person
Business Development Manager (Southeast)
$110K — $120K *
Lumberton, NC 28358 (Robeson County)
3 weeks ago
Transportation
In-Person
Business Development Manager (Southeast)
$110K — $120K *
Warsaw, NC 28398 (Duplin County)
3 weeks ago
Transportation
In-Person
Business Development Manager (Southeast)
$110K — $120K *
Mcdonough, GA 30253 (Henry County)
3 weeks ago
Transportation
In-Person
CHIEF ENGINEER
$90K — $120K *
Hebron, IN 46341 (Porter County)
1 month ago
Manufacturing & Automotive
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
2 weeks ago
Software Engineer
$74K — $130K *
Appcast
Colorado Springs, CO 80918 (El Paso County)
Today
Vulnerability Research Engineer (TS/SCI)
$148K — $266K *
Appcast
Remote
Today
Software Engineer, Senior
$112K — $196K *
Appcast
Aberdeen, MD 21001 (Harford County)
Today
AI/ML Engineer
$103K — $181K *
Appcast
Aberdeen, MD 21001 (Harford County)
Today

Find similar SITE RELIABILITY ENGINEER jobs:

Nationwide Camden, NJ

SITE RELIABILITY ENGINEER

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar SITE RELIABILITY ENGINEER jobs:

Get Ready For Your
Next Interview