Site Reliability Engineer

Optimum

$100K — $130K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Telecommunications, Computer Engineering, or related field
  • 2-4 years in mobile network operations or systems engineering roles
  • Deep proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX)
  • Hands-on experience with Google Cloud Platform (IAM, VPC, Compute Engine)
  • Proven experience using Terraform and Ansible for managing environments
  • Proficiency in Storage Protocols like Fiber Channel and iSCSI
  • Strong scripting ability in Python or Go.

Responsibilities

  • Audit, harden, and standardize Unix and Linux environments across GCP and on-premises servers.
  • Architect and manage enterprise-grade SAN/NAS environments and optimize for low latency.
  • Serve as the engineering lead for Eastern U.S. data centers, ensuring hardware health and security standards.
  • Design and maintain automation pipelines to eliminate configuration drift between environments.
  • Establish a sustainable, automated patching cadence to enhance fleet security.
  • Implement and scale a monitoring stack to provide real-time health metrics across the hybrid estate.
  • Participate in on-call rotation and lead blameless post-mortems after incidents.

Benefits

  • Opportunity to work on hybrid cloud infrastructure
  • Focus on automation and modern SRE practices
  • Engage in continuous learning through blameless post-mortems
  • Participation in cutting-edge storage and security technologies
  • Collaborative team environment fostering innovation
Full Job Description
Job Summary

As a Site Reliability Engineer II, you will be a primary driver in the long-term management and stabilization of our Hybrid Cloud infrastructure. We maintain a permanent dual-hosting strategy, operating both Google Cloud Platform (GCP) and mission-critical On-Premises Unix/Linux footprint. You will bridge the gap between physical hardware and modern cloud-native operations, applying software engineering principles to ensure our systems are scalable, secure, and predictable across all platforms.

The Mission: Hybrid Reliability & Stabilization

Your mission is to unify our GCP and On-Premises environments into a single, reliable platform. Your first 12 months will focus on Stabilization and Observability. You will lead the transition away from "toil" (manual, repetitive operations) toward high-leverage automation, aggressively addressing on-prem technical debt while implementing modern SRE practices across our global data centers and cloud projects.

Responsibilities

  • Hybrid Platform Standardization: Audit, harden, and standardize Unix (Solaris/AIX) and Linux (RHEL/Ubuntu) environments across both GCP Compute Engine and physical bare-metal servers.
  • Storage Engineering (Specialization): Architect and manage enterprise-grade SAN/NAS environments alongside GCP Cloud Storage/Persistent Disk.
    Optimize for low latency and high IOPS while ensuring all data-at-rest complies with our Annual Encryption Strategy.
  • Infrastructure Stewardship (DC Support): Serve as the engineering lead for our Eastern U.S. data centers; ensure hardware health, power redundancy, and physical security standards are enforced through code and automated checks.
  • Automation of Toil: Design and maintain robust automation pipelines (Ansible, Terraform, Python) to ensure configuration parity and eliminate drift between cloud and on-premises environments.
  • Vulnerability Management: Transition the fleet from a "vulnerable" state to a "reliable" one by establishing a sustainable, automated monthly patching cadence.
  • Unified Observability: Implement and scale a "single pane of glass" monitoring stack (Prometheus, Grafana, Loki) to provide real-time health metrics for the entire hybrid estate.
  • Incident Response & Post-Mortems: Participate in a sustainable on-call rotation. Lead Blameless Post
    Mortems for incidents involving cross-platform dependencies to ensure we "fix the system, not the person."


Qualifications

  • Bachelor's degree in Telecommunications, Computer Engineering, or related technical field
  • 2-4 years of experience in mobile network operations or systems engineering roles
  • OS Internals: Deep proficiency in Linux (RHEL/Ubuntu) and Unix (Solaris/AIX) administration and kernel tuning
    Cloud Proficiency: Hands-on experience with GCP (IAM, VPC, Compute Engine) or equivalent public cloud providers
  • Infrastructure as Code: Proven ability to manage complex environments using Terraform and Ansible
  • Storage Protocols: Proficiency in Fiber Channel, iSCSI, and NFS. Experience with enterprise arrays (NetApp, Dell/EMC, or Pure Storage) is highly preferred
  • Software Engineering: Strong scripting ability in Python or Go to build internal tools and automation
  • Security: Strong understanding of CVE lifecycles and cryptographic standards (AES-256)

All job descriptions and required skills, qualifications and responsibilities for a particular position are subject to modification by the Company from time to time, in the Company's discretion based on business necessity.

Similar Jobs

More Jobs at Optimum

More Information Technology Jobs

Find similar Site Reliability Engineer jobs: