We are seeking a Senior Engineering Manager to lead our Inference Orchestration team, driving the strategy, execution, and scaling of our Kubernetes-based AI infrastructure. You will be responsible for balancing business needs with technical excellence, ensuring high throughput, optimal GPU utilization, and robust fault tolerance for our next-generation disaggregated inference, fine-tuning, and training workloads.
What You'll Do:- Team Leadership & Development: Recruit, mentor, and coach engineers on the team, fostering a culture of ownership, technical excellence, and continuous improvement.
- Execution & Delivery: Own the team's project execution, translating high-level business goals into clear technical roadmaps, measurable milestones, and successful, on-time delivery.
- Cross-Functional Partnership: Collaborate with Product Management, other engineering teams, and key stakeholders to align priorities, manage dependencies, and communicate progress and risks.
- Operational Health: Ensure the production health, stability, and on-call rotation of all services owned by the Inference Orchestration team.
- Strategic Architecture & Planning: Define the technical roadmap and oversee the architecture of high-throughput scheduling systems for massive Kubernetes clusters (1,000+ nodes, 10,000+ pods), focusing on scalability techniques like multi-scheduler architectures and batch dispatching.
- Maximize GPU Utilization: Eliminate GPU waste in multi-tenant environments by implementing fractional GPU allocation, leveraging mechanisms like KAI-Scheduler's Reservation Pods or hard-isolation tools like HAMi, and configuring time-based fairshare scheduling to balance over-quota pool access.
- Orchestrate Complex Inference: Implement and manage disaggregated AI inference pipelines using frameworks like NVIDIA Grove, coordinating multicomponent deployments (e.g., prefill leaders, decode workers, KV routers) with multilevel autoscaling and explicit startup ordering.
- Optimize Placement & Topology: Deploy topology-aware scheduling to align pod placement with physical hardware dimensions, such as NVLink connections, PCIe lanes, and NUMA nodes, minimizing communication latency for multi-GPU operations.
- Platform Performance & Reliability: Drive initiatives to enhance overall cluster performance, including optimizing scheduling latency, API server load, and implementing fault tolerance mechanisms like Checkpoint/Restore for long-running AI training jobs.
- Manage AI Storage & Fault Tolerance: Orchestrate efficient model weight distribution using OCI Image Volumes and implement Checkpoint/Restore capabilities (via CRIU and NVIDIA cuda-checkpoint) for long-running training fault recovery.
- Security and Isolation: Define and enforce security best practices for AI workloads, ensuring multi-layered isolation environments and agent sandboxes are deployed to safely execute untrusted code (e.g., using Kata Containers, gVisor, or microVMs).
What You'll Bring:- Engineering Leadership Experience: Proven track record of managing and growing high-performing engineering teams, preferably within a distributed systems or infrastructure domain.
- Kubernetes and AI Infrastructure Domain Knowledge: Deep expertise in Kubernetes at scale and a strong foundational understanding of the core challenges in AI workload orchestration, scheduling, and resource management.
- Hardware-Aware Optimization: Strategic knowledge of GPU architectures (NVIDIA and/or AMD), interconnects (like NVLink), and hardware topology and their direct impact on AI training and inference performance.
- Resource and Cost Management: Experience in balancing performance against cost, applying principles like Dominant Resource Fairness (DRF), and directing strategies for maximizing cluster efficiency.
- Systems Engineering & Security: Familiarity with concepts in container runtime internals, system isolation, and security contexts to manage risk in shared infrastructure.
- AI/ML Serving Architectures: Strong understanding of modern LLM serving architectures, disaggregation patterns, and common serving engines (e.g., vLLM, Triton, SGLang).
- Observability and SLOs: Expertise in defining, tracking, and operationalizing deep infrastructure and inference metrics (e.g., TTFT, TPOT) to drive performance improvements and meet service level objectives.
Compensation Range: *This is a hybrid role
#LI-Hybrid