About the RoleWe are building and operating large-scale infrastructure platforms to support high-performance AI workloads across multiple data centers. Our environment includes GPU-intensive systems, high-throughput networking, and rapidly scaling compute clusters.
We are looking for a
Virtualization Operations Engineer to focus on the
day-to-day operation, stability, and performance of our virtualization platforms. This role is responsible for ensuring that our hypervisor environments are reliable, performant, and scalable as we continue to grow.
This is a hands-on operations role working across hypervisors, virtual machines, and underlying infrastructure systems.
What You'll Do- Operate and maintain large-scale virtualization environments (Proxmox and/or KVM-based systems)
- Manage the full lifecycle of virtual machines: provisioning, configuration, migration, decommissioning
- Monitor and respond to platform health issues, including host failures, VM performance degradation, resource contention (CPU, memory, disk, network)
- Troubleshoot and resolve issues across hypervisors, guest operating systems, storage and networking layers
- Execute infrastructure changes safely, including cluster expansions, host maintenance and upgrades, configuration updates
- Work with automation tools to standardize deployments, reduce manual intervention, improve operational consistency
- Collaborate with DevOps (automation and platform tooling), Network Engineering (connectivity and performance), Storage Engineering (I/O performance and reliability)
- Participate in incident response and root cause analysis
- Contribute to runbooks, documentation, and operational best practices
Who You AreRequired Qualifications- 4-7+ years of experience in infrastructure, systems, or platform operations
- Hands-on experience operating Linux-based virtualization platforms, such as KVM/QEMU, Proxmox, VMware (with strong Linux fundamentals)
- Strong Linux systems knowledge, including process management, networking, disk and filesystem management
- Experience troubleshooting CPU and memory contention, disk I/O bottlenecks, network performance issues
- Familiarity with virtualization concepts: VM lifecycle, resource allocation, live migration
- Experience with infrastructure automation tools (e.g., Ansible or similar)
- Ability to work effectively during incidents and production issues
Preferred Qualifications- Experience operating infrastructure at scale (100+ hosts)
- Familiarity with GPU-based systems or high-performance workloads, NUMA awareness and performance tuning
- Exposure to high-throughput networking (bonding, VLANs, SR-IOV), distributed or high-performance storage systems
- Experience working alongside Kubernetes or container platforms
- Experience in cloud or CSP environments
What We Offer- 100% paid Medical, Dental, and Vision insurance for Employees
- Company Health Savings Account Contributions
- 100% paid Short Term and Long Term Disability Insurance for Employees
- Life and Voluntary Supplemental Insurance Options
- Other Insurance Options, such as Pet & Legal Insurance
- Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
- Flexible Spending Account
- Employee Assistance Program