Staff Software Engineer - Managed Kubernetes

Lambda • $130K — $180K *

San Jose, CA 95123In-Person

Information Technology

8 - 10 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years of experience in software engineering, platform engineering, or SRE, with 5+ years focused on Kubernetes at scale
Expert-level understanding of Kubernetes internals and extension patterns
Holistic infrastructure expertise across compute, network, storage, and security
Strong software engineering skills in Go and Python
Deep experience with GPU orchestration in Kubernetes
Proven track record of technical leadership in cross-team design and mentorship
Experience designing and operating managed services or multi-tenant platforms

Responsibilities

Drive technical vision for Lambda's Managed Kubernetes platform
Integrate NVIDIA's open-source ecosystem into the platform
Design GPU-aware orchestration systems to enhance performance
Lead development of services that enable managed offerings
Inform networking and storage architecture for AI workloads
Design self-healing systems for incident response and resilience
Shape AIOps vision for automated capacity planning and infrastructure maintenance

Benefits

Collaborative work environment with NVIDIA and open-source communities
Opportunity to shape the next generation of AI cloud infrastructure
Technical leadership in a foundational role
Focus on innovative solutions rather than maintenance
Access to modern tools and AI-assisted development
Commitment to diverse team backgrounds and experiences

Full Job Description

About the Role

Lambda is building the AI Cloud of the future. We are seeking a Staff Engineer to help our development of our Managed Kubernetes platform. Think GKE, but purpose-built for AI workloads and running on bare metal. This is a foundational technical leadership role where you will shape the infrastructure that powers the next generation of AI training and inference at scale.

As a Staff Engineer on our Orchestration team, you will collaborate to help drive the technical vision for Lambda's managed orchestration services, including Managed Kubernetes, Managed Slurm on Kubernetes, and higher-level platform services for inference and AIOps. You'll work at the intersection of distributed systems, GPU-accelerated computing, and Cloud Native infrastructure to build systems that are reliable, performant, and elegantly simple for our customers.

This is not a role for someone who just operates Kubernetes; it is a technical leadership role for an engineer who has synthesized the core domains of infrastructure (compute, network, storage, security) and can design holistic solutions across all of them. You'll be working closely with NVIDIA's open-source ecosystem, and partnering with internal teams across the stack to deliver a world-class managed platform.

What You'll Do:

Product Engineering

Drive technical vision for Lambda's Managed Kubernetes bare-metal platform, including control plane scalability, multi-tenancy, cluster lifecycle management, and high availability
Integrate and extend NVIDIA's open-source ecosystem: GPU Operator, Network Operator, DCGM, NCCL, and emerging projects like AICR and Topograph for topology-aware scheduling and placement
Design GPU-aware orchestration systems
Lead development of services that power our managed services
Inform on and help with networking solutions for AI workloads: CNI integration (Cilium, Multus), high-performance fabrics (InfiniBand, RoCE), RDMA, and GPUDirect. You will work closely with our Network team to define and drive requirements
Inform and help with storage architecture requirements for AI workloads. You will partner with Storage teams on what managed K8s, Slurm, and future services need
Build the foundation for Managed Slurm on Kubernetes, enabling traditional HPC workloads to run seamlessly alongside Kubernetes workload
Design higher-level platform services for inference, including model serving infrastructure, autoscaling based on inference load, and multi-model deployment patterns
Design self-healing systems and automation for incident response, root cause analysis, and platform resilience
Lead chaos engineering efforts to validate system behavior under failure conditions at scale
Establish operational excellence for a managed service: upgrade automation, security patching, and zero-downtime maintenance

Cross-Functional Infrastructure Leadership

Serve as the technical bridge between Orchestration and other infrastructure teams (Network, Storage, Security), translating platform requirements into actionable specifications
Drive infrastructure-wide decisions that enable successful managed services. You're someone who understands what's needed end-to-end, not just at the Kubernetes layer.
Provide input on bare-metal provisioning, network topology, and storage systems to ensure they meet the needs of managed the services being built by the Orchestration organization
Champion consistency and standardization across Lambda's infrastructure stack
Work directly with customers and internal teams to understand existing deployments and chart a path to the managed platform

Technical Leadership

Set technical direction for Kubernetes services across the Orchestration team, influencing roadmap and prioritization
Drive reviews and design sessions, ensuring we build systems that are scalable, maintainable, and aligned with customer needs
Mentor and grow engineers, establishing best practices for Kubernetes development, distributed systems, and Cloud Native engineering
Collaborate cross-functionally with Network, Storage, Security, and Customer Success teams
Engage with NVIDIA and the open-source community to stay current on GPU orchestration technologies and contribute back where appropriate
Represent Lambda externally through technical blog posts, conference talks, and strategic customer engagements
Shape our AIOps vision: design intelligent systems for automated capacity planning, anomaly detection, and predictive maintenance of cloud infrastructure

Who You Are

You are a creative, innovative engineer who operates at high velocity. You don't just solve problems. You find elegant solutions and ship them quickly. You embrace modern tools and AI-assisted development (like Claude Code) to accelerate your productivity and multiply your impact. You're energized by building new things, not maintaining the status quo.

Required Qualifications

10+ years of experience in software engineering, platform engineering, or SRE, with at least 5 years focused on Kubernetes at scale
Expert-level understanding of Kubernetes internals: API machinery, controllers, schedulers, operators, CRDs, CSI, CNI, and the extension patterns that make Kubernetes powerful
Holistic infrastructure expertise: you've synthesized knowledge across compute, networking, storage, and security, not just Kubernetes in isolation. You can build solutions that span the full stack.
Strong software engineering skills in Go (required) and Python; you write production-quality code, not just scripts
Deep experience with GPU orchestration in Kubernetes: NVIDIA GPU Operator, device plugins, DCGM, MIG, time-slicing, and GPU-aware scheduling. Familiarity with NVIDIA Network Operator and GPUDirect is strongly preferred.
Proven track record of technical leadership: driving design decisions across teams, mentoring engineers, and influencing infrastructure direction beyond your immediate scope
Deep experience designing and operating managed services or multi-tenant platforms. You understand what it takes to run infrastructure for external customers
Strong understanding of distributed systems principles: consensus, fault tolerance, consistency models, and graceful degradation
Experience with observability at scale: Prometheus, Grafana, distributed tracing, and building actionable alerting systems
Solid knowledge of Linux systems and networking (L2-L7), including high-performance networking concepts (RDMA, InfiniBand, RoCE)
Experience with infrastructure-as-code and GitOps workflows

Preferred Qualifications

Experience building and operating managed Kubernetes services (GKE, EKS, AKS, or similar) or working on Kubernetes control plane components
Hands-on experience with NVIDIA's open-source ecosystem beyond GPU Operator: Network Operator, NCCL tuning, Topograph, AICR, or similar emerging projects
Familiarity with HPC and traditional job schedulers (Slurm) and Kubernetes-native batch scheduling (KAI, Volcano, Kueue)
Background in confidential computing
Experience migrating customers or workloads from legacy/bespoke infrastructure to standardized platforms
Contributions to CNCF projects, Kubernetes SIGs, or NVIDIA open-source projects
Familiarity with security and compliance in multi-tenant environments: RBAC, Pod Security Standards, network policies, workload isolation
Background in ML infrastructure: training clusters, inference serving, simulation

Salary Range Information

The annual salary range for this position has been set based on market data and other factors. However, a salary higher or lower than this range may be appropriate for a candidate whose qualifications differ meaningfully from those listed in the job description.

A Final Note:

You do not need to match all of the listed expectations to apply for this position. We are committed to building a team with a variety of backgrounds, experiences, and skills.

About Lambda

Lambda is an online education company that offers courses in computer science and software engineering. The company was founded in 2017 by Austen Allred and Ben Nelson. Lambda's courses are designed to be accessible to anyone, regardless of their background or prior experience. The company's mission is to provide high-quality education that leads to well-paying jobs in the tech industry. Lambda has partnerships with a number of companies, including Amazon, Google, and Microsoft, and has helped thousands of students launch careers in tech.

Learn more about Lambda

Size

1,000 employees

Industry

Education, Government & Non-Profit

Net Income

-$5 million

Founded

2017

5 Year Trend

+100%

Revenue

$100 million

NASDAQ

LMBDA

* Ladders Estimates

Similar Jobs

Staff Software Engineer, Capacity Engineering
$177K — $364K *
Pinterest
Remote
Today
Staff Software Engineer, Capacity Engineering
$177K — $364K *
Pinterest
San Francisco, CA 94112 (San Francisco County)
Today
Staff Engineer, Big Data
$100K — $150K *
Nagarro
Remote
Reposted Today
Staff Engineer - Salesforce Developer
$100K — $130K *
Nagarro
Remote
Today
Staff Software Engineer, Backend
$172K — $210K *
Archer Aviation Inc.
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Staff Software Engineer- Growth Performance Marketing
$177K — $364K *
Pinterest
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Lambda

People Systems Lead - HRIS
$120K — $160K *
San Jose, CA 95123 (Santa Clara County)
Yesterday
Business Services
In-Person
People Systems Lead - HRIS
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
Yesterday
Business Services
In-Person
Counsel - Corporate Finance
$150K — $200K *
New York, NY 10025 (New York County)
3 days ago
Legal & Accounting
In-Person
Counsel - Corporate Finance
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
3 days ago
Legal & Accounting
In-Person
Staff Software Engineer - Managed Kubernetes
$150K — $180K *
San Francisco, CA 94112 (San Francisco County)
2 weeks ago
Information Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Senior Data Engineer
$120K — $150K *
ECS
Remote
Today
Engineer I- Software
$70K — $95K *
Microchip Technology
Chandler, AZ 85225 (Maricopa County)
Today
Software Engineer lll - Payments Modernization
$102K — $179K *
Bank of America Corporation
Charlotte, NC 28269 (Mecklenburg County)
Reposted Today

Find similar Staff Software Engineer - Managed Kubernetes jobs:

Nationwide San Jose, CA

Staff Software Engineer - Managed Kubernetes

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff Software Engineer - Managed Kubernetes jobs:

Get Ready For Your
Next Interview