About the RoleWe're building infrastructure that has to perform under real-world scale, reliability, and security demands - and we're looking for an engineer who wants to own the foundation it runs on. This isn't a traditional "keep the lights on" role.
You'll design and operate the global network and reliability layer behind one of the world's fastest private supercomputers - the fabric powering distributed compute, ML workloads, real-time analytics, and mission-critical enterprise systems. You'll work across networking, systems, automation, observability, and reliability engineering to scale a platform where performance genuinely matters, with real influence over architecture decisions.
It's a strong fit if you like solving deep infrastructure problems, building resilient systems, automating everything repetitive, and owning architecture rather than just maintaining it.
What You'll Do- Architect and operate scalable, secure network architecture for high-security requirements and large-scale machine learning workloads.
- Own network device configuration management end to end, ensuring consistency and reliability across the fleet.
- Improve system and network reliability and performance through automation, observability, and proactive capacity planning.
- Implement and manage complex network protocols and connectivity, including BGP, VPNs, and WAN circuits and external peering.
- Build and maintain comprehensive monitoring, alerting, and incident response - SLOs, runbooks, and on-call rotations - and drive post-incident analysis and continuous improvement.
- Ensure security, compliance, and operational readiness across our network and cloud infrastructure.
- Partner across engineering and data science to drive a culture of performance and reliability.
What Will Help You Succeed- 8+ years in network or infrastructure engineering, including 5+ years in datacenter operations and/or systems and network administration.
- A strong background in network security, architecture, design, and operations.
- Extensive hands-on experience with network devices (firewalls, switches, load balancers) and large-scale architectures and protocols - BGP, QoS, MPLS, and IPsec VPNs.
- Experience designing and operating modern datacenter network fabrics (spine-leaf, EVPN/VXLAN, ECMP).
- Network automation and IaC tooling (Ansible, Terraform, Nornir, or similar), plus IPAM/DCIM platforms (NetBox, Infoblox, or similar).
- WAN engineering - carrier circuit provisioning and external network peering.
- Familiarity with Kubernetes networking (CNI plugins, ingress, service networking, network policy) and strong operational experience with Linux-based production infrastructure.
- Experience with monitoring and observability stacks (Prometheus, Grafana, Datadog, ELK, OpenTelemetry).
- Solid scripting (Python, Bash) to debug complex network and system issues and automate solutions, plus excellent cross-functional communication.
Also Helpful- NVIDIA networking technologies - Cumulus Linux, InfiniBand, Spectrum-X, and BlueField DPUs (this is the fabric behind our SuperPOD).
- Familiarity with data-intensive platforms (Spark, Airflow, Kafka) and storage network protocols (NFS, LustreFS, iSCSI).
- Security practices for applications and infrastructure, and experience in high-compliance or SOC 2 environments.
The Role Is Right for You If- You want to own mission-critical network and infrastructure end to end - from architecture to incident management - not just keep it running.
- You'd rather build and automate than direct from a distance, and you want meaningful influence over how a high-performance platform scales.