Job Description:
We are seeking a Kubernetes Platform Engineer (High9Performance Networking) to lead Kubernetes9native, RDMA9class networking for distributed AI inference platforms on HPC clusters. You will own the end9to9end technical design that allows Kubernetes9orchestrated inference workloads (NVIDIA NIMs, vLLM, TensorRT9LLM) to transparently consume high9speed fabrics (e.g., HPE Slingshot/CXI) using Operators, DRA, CDI, Multus/secondary CNI, and Kubernetes networking abstractions9without container rebuilds, privileged pods, or manual tuning. This role is central to transforming a traditionally HPC9centric fabric into a first9class Kubernetes resource, aligned with modern AI Factory and inference9as9a9service deployment models.
Make HPC fabric capabilities consumable from standard containers
Design the mechanisms to expose RDMA9capable NIC resources and required runtime components without baking the fabric into images, including mounting/injecting host user9space libraries (e.g., libcxi + libfabric) in a controlled, supportable way.
Define the reference design and implement for Kubernetes9native RDMA enablement across:
Dynamic Resource Allocation (DRA)
Container Device Interface (CDI)
Multus + secondary CNIs
Operator9driven lifecycle management
Own API and CRD design (ResourceClaims, DeviceClasses, custom CRDs) with long9term compatibility guarantees.
Make and defend architectural tradeoffs between:
Device plugins vs DRA
CDI vs runtime hooks vs admission webhooks
Shared vs exclusive NIC models
Performance vs operability vs isolation
Kubernetes Operator Ownership
Ensure out-of-the-box compatibility with:
NVIDIA NIMs and the NIM Operator
KServe ServingRuntime / InferenceService
GPU Operator (CDI mode)
Publish deployment patterns and validated manifests for inference workloads using RDMA fast paths.
Cloud Architectures, Cross Domain Knowledge, Design Thinking, Development Fundamentals, DevOps, Distributed Computing, Microservices Fluency, Full Stack Development, Security-First Mindset, Solutions Design, Testing & Automation, User Experience (UX)
Health & Wellbeing
We strive to provide our team members and their loved ones with a comprehensive suite of benefits that supports their physical, financial and emotional wellbeing.
Personal & Professional Development
We also invest in your career because the better you are, the better we all are. We have specific programs catered to helping you reach any career goals you have whether you want to become a knowledge expert in your field or apply your skills to another division.
Unconditional Inclusion
We are unconditionally inclusive in the way we work and celebrate individual uniqueness. We know varied backgrounds are valued and succeed here. We have the flexibility to manage our work and personal needs. We make bold moves, together, and are a force for good.
Follow @HPECareers on Instagram to see the latest on people, culture and tech at HPE.
#unitedstates
Job:
Engineering
Job Level:
TCP_03
2The expected salary/wage range for this position is provided below. Actual offer may vary from this range based upon geographic location, work experience, education/training, and/or skill level.
United States of America: Annual Salary USD 111,500 - 211,500 in Colorado // 106,000 - 243,000 in Minnesota & Texas
The listed salary range reflects base salary. Variable incentives may also be offered.
Information about employee benefits offered in the US can be found at https://myhperewards.com/main/new-hire-enrollment.html
The estimated job application period closure is June 4 2026; this timeline is provided for transparency and internal planning purposes.