Senior Manager, Platform, Lifecycle, & Troubleshooting

Vultr

• $120K — $140K *

US-AnywhereRemote in United States

Information Technology

8 - 10 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

8+ years of experience in Linux systems administration, platform engineering, or SRE operations in cloud infrastructure.
Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
Proven leadership experience managing technical teams through on-call rotations and complex migrations.
Strong scripting/automation skills in languages like Python, Bash, and Ansible, with familiarity in monitoring tools.
Excellent problem-solving and documentation skills, coupled with strong cross-team communication abilities.
Bachelor's degree in Computer Science, Engineering, or a related field.

Responsibilities

Lead the troubleshooting team in resolving complex platform issues and incidents.
Own and manage server repurposing and OS/distribution upgrades during migrations.
Perform advanced troubleshooting for RDMA links, GPU, storage, and networking.
Ensure validation and handling of complex firmware updates.
Provide leadership for 24/7 on-call support to improve incident responses.
Develop automation and self-healing processes to enhance efficiency and reduce downtime.
Collaborate with hardware and onboarding teams to streamline operations and communications.
Mentor senior engineers while focusing on root-cause analysis to build a high-performing team.
Track and analyze metrics related to uptime and migration success for operational improvements.

Benefits

Opportunities for hands-on technical leadership in a high-growth company.
Involvement in cutting-edge projects related to AI and enterprise workloads.
Chance to leave a significant impact on cloud infrastructure performance and reliability.
Collaborative work environment with cross-functional teams for broader impact.
Mentoring and growth opportunities for both personal and team development.

Full Job Description

We are seeking a highly skilled and experienced Platform & Lifecycle Team Manager to drive deep technical troubleshooting and lifecycle excellence across our expanding server fleet. The ideal candidate is a technical leader with strong Linux/platform expertise and a passion for solving complex issues in high-performance cloud environments (GPU, storage, RDMA, etc.). This is a highly visible role in a high-growth technology company, which will require both hands-on engineering depth and team leadership.

This is your opportunity to join our fast-growing team and leave your mark on Vultr and the future of Cloud Infrastructure. You will lead the team responsible for keeping thousands of production servers running reliably, owning complex platform troubleshooting, large-scale migrations (including OS/distribution changes), and post-onboard lifecycle - directly contributing to Vultr's uptime, performance leadership in GPUs and bare metal, and ability to support demanding AI and enterprise workloads.

Key Responsibilities

Lead the Platform, Lifecycle & Troubleshooting team in resolving complex incidents and platform issues.
Own server repurposing, migrations (e.g., OS/distribution upgrades), and deeper lifecycle management.
Perform and guide advanced troubleshooting for RDMA links, GPU, storage, and server-side networking.
Validate firmware choices and handle complex/ongoing firmware updates.
Provide 24/7 on-call leadership and drive incident response improvements.
Develop runbooks, automation, and self-healing processes to reduce toil and improve MTTR.
Collaborate closely with Hardware and Onboarding teams on handoffs and mixed tickets.
Partner with Engineering, Networking, and Solutions teams on technical escalations and improvements.
Mentor senior engineers and build a high-performing team focused on root-cause analysis.
Track key metrics (uptime, incident trends, migration success) and drive operational maturity.

Qualifications

8+ years of experience in Linux systems administration, platform engineering, or SRE-style operations in cloud or large-scale infrastructure environments.
Deep expertise in troubleshooting GPU, storage, RDMA, and high-performance networking issues.
Proven track record leading technical teams, including on-call rotations and complex migrations.
Strong scripting/automation skills (Python, Bash, Ansible, etc.) and experience with monitoring tools.
Excellent problem-solving, documentation, and cross-team communication abilities.
Bachelor's degree in Computer Science, Engineering, or equivalent experience.

Compensation

$120,000 - $140,000

Final compensation will vary depending on years of experience, background/skill set, location, and applicable laws.

* Ladders Estimates

Similar Jobs

Data Center Infrastructure Delivery Manager , DCC Communities
$78K — $137K *
Amazon
New Carlisle, IN 46552 (St Joseph County)
Reposted Today
Engineering Manager, Integrations
$130K — $180K *
AirOps
San Francisco, CA 94112 (San Francisco County)
Reposted Today
Engineering Manager, Integrations
$130K — $180K *
AirOps
Remote
Reposted Today
Engineering Manager, Integrations
$130K — $180K *
AirOps
New York, NY 10025 (New York County)
Reposted Today
Sr. Manager, Platform & Integration Technical Support
$140K — $155K *
Ingram Micro
Mississauga, ON L4T 0A1
Today
IT Manager
$90K — $120K *
Sea-tac Electric
Milton, WA 98354 (Pierce County)
Today

Get Ready For Your
Next Interview

More Jobs at Vultr

Senior Account Executive, AI Infrastructure Sales
$110K — $125K *
Remote
Yesterday
Enterprise Technology
Remote in United States
Legal Counsel (Real Estate Development)
$140K — $170K *
Remote
1 week ago
Legal & Accounting
Remote in United States
Legal Counsel (Real Estate Financing & Corporate Ventures)
$140K — $170K *
Remote
Reposted 1 week ago
Legal & Accounting
Remote in West Palm Beach, FL
Business Intelligence Architect
$125K — $135K *
Remote
1 week ago
Enterprise Technology
Remote in United States
Senior Technical Product Manager, Observability
$130K — $165K *
Remote
1 week ago
Information Technology
Remote in United States

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
4 days ago
IT Infrastructure Construction Project Manager - Bilingual in Korean MUST
$100K — $130K *
SBT Global, Inc.
Englewood Cliffs, NJ 07632 (Bergen County)
Reposted Today
Data Analyst/Engineer
$105K — $110K *
Quest Global Manufacturing
Sunnyvale, CA 94087 (Santa Clara County)
Today
Principal Platform Engineer
$120K — $150K *
Scotiabank
Toronto, ON M3C 0E3
Today
Software Developer
$90K — $120K *
Scotiabank
Toronto, ON M3C 0E3
Today

Find similar Senior Manager, Platform, Lifecycle, & Troubleshooting jobs:

Nationwide Remote

Senior Manager, Platform, Lifecycle, & Troubleshooting

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Manager, Platform, Lifecycle, & Troubleshooting jobs:

Get Ready For Your
Next Interview