Senior Staff Network Engineer, Operations

Crusoe

• $225K — $275K *

San Francisco, CA 94112In-Person

Telecommunications & Hardware

11 - 15 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

12+ years of production network engineering experience focused on large-scale operations and reliability in hyperscale environments.
Hands-on experience with streaming telemetry, SNMP, NetFlow, and monitoring tools such as Grafana and Prometheus.
Experience operating RDMA/RoCE lossless fabrics for GPU and HPC workloads, including PFC and ECN tuning.
Proven record of owning production reliability at scale and leading post-incident reviews that drive change.
Comfort operating a fleet of 10K+ devices across multi-region environments with on-call responsibility.
Expert knowledge of network protocols such as BGP, OSPF, and MPLS in large data center setups.
Proficiency in Python for operational automation and tooling.

Responsibilities

Own production reliability for Crusoe's global edge, backbone, and GPU cluster networks.
Lead incident response for high-severity network events, ensuring rapid mitigation and communication.
Drive root cause analyses for incidents, authoring remediation plans and tracking issues to resolution.
Define SLIs and SLOs for network reliability metrics, collaborating with Architecture and Site Reliability teams.
Set operational standards by maintaining runbooks, SOPs, and escalation playbooks for the operations team.
Enhance network monitoring and observability using tools like Kentik and ThousandEyes.
Mentor Staff and Senior engineers, fostering a culture of operational excellence and continuous learning.

Benefits

Competitive compensation package and Restricted Stock Units.
Generous paid time off and holidays.
Comprehensive health, dental, and vision insurance coverage.
Employer contributions to HSA accounts and paid parental leave.
Life insurance and short/long-term disability coverage.
Professional development opportunities and tuition reimbursement.
Mental health and wellness support initiatives.
Commuter benefits for parking and transit.
Cell phone stipend provided.
401(k) retirement plan with a company match.
Volunteer time off to support community engagement.

Full Job Description

About this Role

Crusoe Cloud is seeking a Senior Staff Network Operations Engineer to own production reliability across our global network, including edge, backbone, data center fabric, and GPU cluster interconnects. You will drive incident response, root cause analysis, and the operational excellence initiatives that keep our hyperscale AI infrastructure healthy at scale.

This is a senior production ownership role, not architecture, not pre-sales, not purely automation. You will set operational standards, define SLIs and SLOs, mentor Staff and Senior engineers, and serve as the senior escalation point during high-severity events. This is the role that keeps the network up.

What You'll Be Working On

Own Production Reliability: Serve as the senior technical owner for uptime of Crusoe's global edge, backbone, data center, and GPU cluster network, directly affecting the availability of AI workloads running on hundreds of thousands of GPUs.
Lead Incident Response: Own end-to-end response for high-severity network events, including rapid mitigation, stakeholder communication, and postmortem documentation that prevents recurrence.
Drive Root Cause Analysis: Lead RCAs for production incidents, identify systemic issues, author remediation plans, and track them to closure.
Define SLIs and SLOs: Partner with Architecture and Site Reliability to define network reliability metrics and service level objectives, backed by real-time dashboards and alerting.
Set Operational Standards: Author and maintain runbooks, escalation playbooks, and SOPs used by the broader operations team.
Improve Observability: Drive continuous improvement of Crusoe's network monitoring stack including streaming telemetry, SNMP, NetFlow, and tools such as Kentik, Grafana, Prometheus, and ThousandEyes.
Build Operational Automation: Write Python-based auto-remediation tooling that reduces toil and accelerates mean time to resolution for known failure modes.
Mentor and Multiply: Provide technical guidance to Staff and Senior engineers. Drive post-incident learning and build a culture of operational excellence across the team.

What You'll Bring to the Team

12+ years of production network engineering experience with a demonstrated focus on large-scale operations, incident response, and reliability in hyperscale or internet-scale environments.
Observability and Monitoring: Hands-on experience with streaming telemetry, SNMP, NetFlow, sFlow, and tools such as Kentik, Grafana, Prometheus, ThousandEyes, and Arbor.
GPU Cluster and RDMA Networking: Hands-on experience operating RDMA/RoCE (v1 and v2) lossless fabrics for GPU and HPC workloads, including PFC, ECN, and DCQCN tuning. Required at this level.
Demonstrated Technical Leadership: Proven track record owning production reliability at scale, leading RCAs that drove systemic change, and setting operational standards the broader org executes against.
Hyperscale Operational Depth: Comfort operating 10K+ device fleets across multi-region environments with 24/7 on-call responsibility. You have been the senior escalation point during critical network events.
Protocol Fluency: Expert hands-on knowledge of BGP, EVPN-VXLAN, IS-IS, OSPF, MPLS, QoS, and TCP/IP across production DC fabric environments at scale.
Hardware Platform Depth: Expert knowledge of Arista (EOS), Juniper (Junos), and NVIDIA/Mellanox platforms in leaf-spine CLOS architectures across multi-vendor environments.
Operational Automation: Proficiency in Python for auto-remediation scripts, diagnostic tooling, and operational workflows that reduce toil and accelerate incident resolution.
SLI and SLO Ownership: Experience defining and owning network reliability metrics and service level objectives in partnership with engineering and product leadership.
Education: Bachelor's degree in Computer Science, Electrical Engineering, or a related field, or equivalent practical experience in hyperscale or internet-scale environments.

Benefits:

Competitive compensation
Restricted Stock Units
Paid time off & paid holidays
Comprehensive health, dental & vision insurance
Employer contributions to HSA account
Paid parental leave
Paid life insurance, short-term and long-term disability
Professional development & tuition reimbursement
Mental health & wellness support
Commuter benefits (parking & transit)
Cell phone stipend
401(k) Retirement plan with company match up to 4% of salary
Volunteer time off

Compensation:

Compensation will be paid in the range of $225,000 - $275,000 + Bonus. Restricted Stock Units are included in all offers. Compensation to be determined by the applicant's education, experience, knowledge, skills, and abilities, as well as internal equity and alignment with market data.

* Ladders Estimates

Similar Jobs

HPE Networking Proof of Concept Consultant
$136K — $276K *
Hewlett Packard Enterprise Development LP
Roseville, CA 95747 (Placer County)
Reposted Yesterday
Sr. Director, Global IT Networking
$183K — $389K *
Hewlett Packard Enterprise Development LP
Roseville, CA 95747 (Placer County)
Reposted 4 days ago
Sr. Director, Global IT Networking
$183K — $389K *
Hewlett Packard Enterprise Development LP
Sunnyvale, CA 94087 (Santa Clara County)
Reposted 4 days ago
Sr. Director, Global IT Networking
$183K — $389K *
Hewlett Packard Enterprise Development LP
San Jose, CA 95123 (Santa Clara County)
Reposted 4 days ago
Senior/Principal Solutions Architect / Network Engineer - Special Programs Operations, Onsite
$139K — $280K *
Sandia National Laboratories
Livermore, CA 94550 (Alameda County)
Reposted 4 days ago
Principal, Solutions Architect - AMZ9734597
$210K — $284K *
Amazon
San Francisco, CA 94112 (San Francisco County)
5 days ago

Get Ready For Your
Next Interview

More Jobs at Crusoe

Senior Staff Network Engineer, Operations
$225K — $275K *
San Francisco, CA 94112 (San Francisco County)
Today
Telecommunications & Hardware
In-Person
Senior Engineer, Load Integration & Model Development
$175K — $210K *
Remote
Today
Energy & Utilities
Remote in United States
Engineer, Load Integration & Model Development
$120K — $150K *
San Francisco, CA 94112 (San Francisco County)
Today
Energy & Utilities
In-Person
Senior Operations Associate, Strategic Initiatives
$125K — $150K *
San Francisco, CA 94112 (San Francisco County)
Today
Manufacturing & Automotive
In-Person
Senior Staff Network Engineer, Automation
$245K — $295K *
San Francisco, CA 94112 (San Francisco County)
Today
Telecommunications & Hardware
In-Person

More Telecommunications & Hardware Jobs

Telecom Plant Manager
$80K — $100K *
Northern Arkansas Telephone Company
Flippin, AR 72634 (Marion County)
Reposted 2 weeks ago
CPU Implementation Silicon Correlation Engineer
$130K — $180K *
Apple
Santa Clara, CA 95051 (Santa Clara County)
Reposted Today
Hardware Systems Engineer - Board Design
$120K — $150K *
Apple
Austin, TX 78745 (Travis County)
Reposted Today
Hardware Reliability Engineer - Apple Vision Products
$120K — $150K *
Apple
Cupertino, CA 95014 (Santa Clara County)
Today
Director, Services Strategy
$130K — $180K *
OmniOn Power
Plano, TX 75025 (Collin County)
Today

Find similar Senior Staff Network Engineer, Operations jobs:

Nationwide San Francisco, CA

Senior Staff Network Engineer, Operations

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Staff Network Engineer, Operations jobs:

Get Ready For Your
Next Interview