Full Job Description
As a Principal Software Engineer leading Fleet Management, you will be the overall technical lead across three pods and the person who sets the technical direction for the fleet management layer of Roblox. This is a hands-on, deeply technical leadership role that owns all of Roblox's compute capacity end to end: from low-level provisioning and the data plane, up through the control planes that operate it, and all the way to the UI and internal-facing products that let teams self-serve capacity. Your org centralizes security, maintenance operations, and the uptime of every Roblox Kubernetes cluster, and governs the internal customer contracts that drive automation across the fleet spanning Roblox data centers and cloud providers. You will guide architecture, raise the engineering bar, and make sure compute capacity supply and demand stay in balance as the fleet grows.
You will:
• Serve as the overall technical lead for three Fleet Management pods, setting and aligning the technical direction across low-level provisioning, the data plane, and the control plane and product surfaces above them.
• Architect the declarative, Kubernetes-style control planes that operate Roblox's compute fleet across on-prem and cloud, and define how capacity is provisioned, reconciled, and exposed at scale.
• Own the design of the internal customer contracts and APIs that govern automation across the fleet, so that every infrastructure team can operate capacity safely and predictably.
• Drive the strategy for self-serve capacity, including the internal-facing products and UIs that let teams request, manage, and reason about the compute they depend on.
• Centralize and raise the bar on security, maintenance operations, and the uptime of all Roblox Kubernetes clusters, defining how fleet-wide changes ship reliably without impacting production.
• Partner broadly with stakeholders inside and outside infrastructure to understand compute needs and drive innovation for our backend services, AI, and edge computing.
• Write code daily, staying deep in the systems your org owns and leading by example on the hardest design and implementation problems.
You Have:
• 10+ years of experience building and operating large-scale distributed systems and infrastructure.
• A track record as the technical anchor an organization relies on, with the leadership to set direction across multiple teams and up-level the engineers around you.
• Strong proficiency in Go, with deep experience designing and operating production services at fleet scale.
• Hands-on experience building declarative, Kubernetes-style control planes and the reconciliation patterns behind them.
• Strong proficiency with gRPC for service-to-service APIs and with SQL and Postgres for durable, high-scale state.
• Experience operating compute capacity across both on-prem data centers and cloud providers, and a feel for the realities of running fleets at the scale of hundreds of thousands of instances.
• A history of being highly cross-functional, partnering with stakeholders across and beyond infrastructure to design systems that keep compute supply and demand in balance.
For roles that are based at our headquarters in San Mateo, CA: The starting base pay for this position is as shown below. The actual base pay is dependent upon a variety of job-related factors such as professional background, training, work experience, location, business needs and market demand. Therefore, in some circumstances, the actual salary could fall outside of this expected range. This pay range is subject to change and may be modified in the future. All full-time employees are also eligible for equity compensation and for benefits as described on this page.
Annual Salary Range
$345,040-$399,420 USD
Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).