About the RoleWe are seeking a highly technical Vice President of Infrastructure to build and scale the foundational infrastructure powering our AI cloud platform.
This is a hands-on executive leadership role. While you will own infrastructure strategy, organizational growth, and executive-level decision making, we expect you to remain deeply engaged in architecture, design, and engineering execution. You should expect to spend approximately
30-40% of your time directly contributing to technical design, architecture reviews, debugging critical production issues, and partnering with engineers on implementation.The ideal candidate has previously built and scaled cloud platforms, preferably GPU-native cloud infrastructure supporting AI training and inference workloads. You have experience operating at the intersection of executive leadership and hands-on engineering and are excited to help build both the technology and the team.
What You'll OwnCloud Infrastructure Architecture- Lead the design and evolution of our AI cloud platform
- Define the architecture for GPU orchestration, compute scheduling, networking, storage, and distributed systems
- Make critical decisions regarding cloud infrastructure, bare-metal deployments, and platform scalability
- Personally participate in architecture reviews and key technical initiatives
GPU Cloud Platform- Build and scale large GPU clusters supporting customer workloads
- Design systems for GPU provisioning, scheduling, utilization optimization, and capacity management
- Drive platform reliability and performance for AI training and inference workloads
- Partner closely with engineering teams on infrastructure requirements for next-generation AI systems
Technical Leadership- Remain deeply involved in engineering decisions and technical direction
- Contribute directly to infrastructure design and implementation efforts
- Review architecture proposals, system designs, and major infrastructure changes
- Act as the technical escalation point for complex infrastructure challenges
Infrastructure & Reliability- Establish best practices for Kubernetes, observability, CI/CD, security, and operational excellence
- Build SRE and Platform Engineering functions from the ground up
- Define reliability standards including SLOs, SLIs, incident response processes, and capacity planning
- Drive automation across infrastructure operations
Organizational Leadership- Recruit and develop world-class Infrastructure, Platform, and SRE teams
- Build a high-performance engineering culture focused on ownership and execution
- Partner with executive leadership on company strategy and infrastructure investments
- Manage infrastructure budgets, vendor relationships, and capacity planning
Required ExperienceMust-Have Background- 12+ years building and operating large-scale infrastructure systems
- Experience leading infrastructure organizations while remaining hands-on technically
- Previous experience building or operating a cloud platform at scale
- Experience building GPU infrastructure or AI/ML compute platforms
- Proven track record scaling infrastructure in high-growth startup environments
Deep Technical Expertise- Expert-level Kubernetes knowledge
- Experience designing and operating multi-region cloud infrastructure
- Strong understanding of Linux, networking, distributed systems, and storage architecture
- Experience with Infrastructure-as-Code and automation frameworks
- Deep expertise in observability, monitoring, and reliability engineering
- Experience building highly available production systems
Strongly Preferred- Experience with GPU scheduling, Slurm, Kubernetes GPU operators, Ray, or distributed training systems
- Experience managing thousands of GPUs in production environments
- Background supporting AI training and inference platforms