Senior Engineering Manager, Kernel and Virt

DigitalOcean

$200K — $251K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Proven track record of leading high-performing engineering teams in distributed systems or infrastructure.
  • Deep expertise in Kubernetes and AI workload orchestration at scale.
  • Strategic knowledge of GPU architectures and their impact on AI performance.
  • Experience balancing performance and cost, applying principles like Dominant Resource Fairness.
  • Familiarity with container runtime internals and security contexts for shared infrastructure risk management.
  • Strong understanding of LLM serving architectures and disaggregated patterns.
  • Expertise in defining and tracking performance metrics for infrastructure.

Responsibilities

  • Recruit, mentor, and develop engineering talent, fostering a culture of ownership and improvement.
  • Translate high-level business goals into actionable technical roadmaps for timely project delivery.
  • Collaborate with cross-functional teams to align priorities and manage project dependencies.
  • Ensure operational health and stability of production services under the team's purview.
  • Define the technical roadmap for high-throughput scheduling systems utilizing extensive Kubernetes clusters.
  • Implement strategies to maximize GPU utilization in multi-tenant environments with innovative allocation methods.
  • Manage complex AI inference pipelines, coordinating deployment and scaling of various components.

Benefits

  • Hybrid work model providing flexibility between remote and on-site.
  • Opportunities for professional development and continual learning.
  • Access to cutting-edge AI technologies and infrastructure.
  • Collaborative team environment fostering innovation and creativity.
  • Potential for career advancement within a growing organization.
Full Job Description
We are seeking a Senior Engineering Manager to lead our Inference Orchestration team, driving the strategy, execution, and scaling of our Kubernetes-based AI infrastructure. You will be responsible for balancing business needs with technical excellence, ensuring high throughput, optimal GPU utilization, and robust fault tolerance for our next-generation disaggregated inference, fine-tuning, and training workloads.
What You'll Do:
  • Team Leadership & Development: Recruit, mentor, and coach engineers on the team, fostering a culture of ownership, technical excellence, and continuous improvement.
  • Execution & Delivery: Own the team's project execution, translating high-level business goals into clear technical roadmaps, measurable milestones, and successful, on-time delivery.
  • Cross-Functional Partnership: Collaborate with Product Management, other engineering teams, and key stakeholders to align priorities, manage dependencies, and communicate progress and risks.
  • Operational Health: Ensure the production health, stability, and on-call rotation of all services owned by the Inference Orchestration team.
  • Strategic Architecture & Planning: Define the technical roadmap and oversee the architecture of high-throughput scheduling systems for massive Kubernetes clusters (1,000+ nodes, 10,000+ pods), focusing on scalability techniques like multi-scheduler architectures and batch dispatching.
  • Maximize GPU Utilization: Eliminate GPU waste in multi-tenant environments by implementing fractional GPU allocation, leveraging mechanisms like KAI-Scheduler's Reservation Pods or hard-isolation tools like HAMi, and configuring time-based fairshare scheduling to balance over-quota pool access.
  • Orchestrate Complex Inference: Implement and manage disaggregated AI inference pipelines using frameworks like NVIDIA Grove, coordinating multicomponent deployments (e.g., prefill leaders, decode workers, KV routers) with multilevel autoscaling and explicit startup ordering.
  • Optimize Placement & Topology: Deploy topology-aware scheduling to align pod placement with physical hardware dimensions, such as NVLink connections, PCIe lanes, and NUMA nodes, minimizing communication latency for multi-GPU operations.
  • Platform Performance & Reliability: Drive initiatives to enhance overall cluster performance, including optimizing scheduling latency, API server load, and implementing fault tolerance mechanisms like Checkpoint/Restore for long-running AI training jobs.
  • Manage AI Storage & Fault Tolerance: Orchestrate efficient model weight distribution using OCI Image Volumes and implement Checkpoint/Restore capabilities (via CRIU and NVIDIA cuda-checkpoint) for long-running training fault recovery.
  • Security and Isolation: Define and enforce security best practices for AI workloads, ensuring multi-layered isolation environments and agent sandboxes are deployed to safely execute untrusted code (e.g., using Kata Containers, gVisor, or microVMs).
What You'll Bring:
  • Engineering Leadership Experience: Proven track record of managing and growing high-performing engineering teams, preferably within a distributed systems or infrastructure domain.
  • Kubernetes and AI Infrastructure Domain Knowledge: Deep expertise in Kubernetes at scale and a strong foundational understanding of the core challenges in AI workload orchestration, scheduling, and resource management.
  • Hardware-Aware Optimization: Strategic knowledge of GPU architectures (NVIDIA and/or AMD), interconnects (like NVLink), and hardware topology and their direct impact on AI training and inference performance.
  • Resource and Cost Management: Experience in balancing performance against cost, applying principles like Dominant Resource Fairness (DRF), and directing strategies for maximizing cluster efficiency.
  • Systems Engineering & Security: Familiarity with concepts in container runtime internals, system isolation, and security contexts to manage risk in shared infrastructure.
  • AI/ML Serving Architectures: Strong understanding of modern LLM serving architectures, disaggregation patterns, and common serving engines (e.g., vLLM, Triton, SGLang).
  • Observability and SLOs: Expertise in defining, tracking, and operationalizing deep infrastructure and inference metrics (e.g., TTFT, TPOT) to drive performance improvements and meet service level objectives.
Compensation Range:
  • $200,800 - $251,000

*This is a hybrid role



#LI-Hybrid

Similar Jobs

More Jobs at DigitalOcean

More Information Technology Jobs

Find similar Senior Engineering Manager, Kernel and Virt jobs: