Principal Software Engineer, GPU Compute

ROBLOX Corporation • $345K — $399K *

San Mateo, CA 94403In-Person

Information Technology

8 - 10 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years of experience in large-scale distributed systems and infrastructure.
Deep hands-on GPU expertise at the machine management layer and above.
Proven track record in scalable GPU or accelerator infrastructure.
Strong proficiency in Go or similar programming languages.
Experience with GPU and AI workloads in production settings.
Familiarity with Kubernetes for GPU workloads and bare-metal concepts is a plus.
History of being a key technical resource for challenging GPU and compute issues.

Responsibilities

Serve as GPU technical leader for the Compute team, collaborating with various engineering teams.
Own GPU host lifecycle management above basic fleet management.
Architect the exposure of GPU capacity to compute platforms.
Drive reliability and performance for GPUs at fleet scale, automating detection and repair.
Evaluate and onboard new GPU/AI accelerator platforms and networking topologies.
Establish standards and APIs for efficient and safe GPU compute consumption.

Benefits

Eligibility for equity compensation.
Flexible onsite work schedule with remote options on certain days.

Full Job Description

As a Principal Software Engineer on the Compute team, you will be the technical anchor for Roblox's GPU and AI accelerator capabilities. This is a battle-tested GPU expert role focused on the machine management layer and above: how GPU hosts are made production-ready, kept healthy, and turned into reliable compute for the workloads that depend on them. You will own the hard problems that show up only at scale, from driver and firmware management to GPU health, reliability, and performance across a rapidly growing fleet of accelerators spanning Roblox data centers and cloud environments. You will set the technical direction for GPU compute and up-level the entire organization's GPU expertise. You will: • Serve as the GPU technical leader for the Compute team, partnering across Kubernetes, Machine Bootstrap, Networking, and Cloud to drive GPU strategy end to end. • Own the GPU host lifecycle above raw fleet management: driver, firmware, and CUDA stack management, GPU health and telemetry, and remediation of GPU-specific failures (XID errors, ECC, thermal, NVLink and fabric faults). • Architect how GPU capacity is exposed to compute platforms, including scheduling, isolation, and integration with Kubernetes for GPU and AI workloads. • Drive GPU reliability and performance at fleet scale, defining the detection, diagnosis, and automated repair of unhealthy accelerators before they impact production. • Evaluate and onboard new GPU and AI accelerator platforms, networking topologies (NVLink, InfiniBand, RoCE), and multi-node training and inference patterns. • Establish the standards, tooling, and APIs that let other engineering teams consume GPU compute safely and efficiently, reducing toil and raising the bar for the org. You have: • 10+ years of experience building and operating large-scale distributed systems and infrastructure. • Deep, hands-on GPU expertise at the machine management layer and above: GPU host provisioning, driver and firmware lifecycle, GPU health and reliability, and the realities of running accelerators in production. • A track record as an expert for compute, not just fleet management, with the scars to prove you have scaled GPU or accelerator infrastructure that other teams depend on. • Strong proficiency in Go or other well-structured programming languages. • Experience operating GPU and AI workloads in production, including familiarity with CUDA, GPU scheduling, and high-performance networking (NVLink, InfiniBand, RoCE). • Familiarity with Kubernetes for GPU workloads and with bare-metal concepts (firmware, BMC/IPMI/Redfish, OS imaging) is a strong plus. • A history of being the anchor expert that an organization relies on for its hardest GPU and compute problems, and the leadership to up-level the engineers around you. For roles that are based at our headquarters in San Mateo, CA: The starting base pay for this position is as shown below. The actual base pay is dependent upon a variety of job-related factors such as professional background, training, work experience, location, business needs and market demand. Therefore, in some circumstances, the actual salary could fall outside of this expected range. This pay range is subject to change and may be modified in the future. All full-time employees are also eligible for equity compensation and for benefits as described on this page. Annual Salary Range $345,040-$399,420 USD Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).

About ROBLOX Corporation

Roblox Corporation is a video game company that operates a massively multiplayer online game platform. The platform allows users to create and play games in a virtual world, with a focus on user-generated content. Roblox was founded in 2004 and is headquartered in San Mateo, California. The company has grown rapidly in recent years, and now has over 100 million monthly active users. In 2021, Roblox went public through a direct listing on the New York Stock Exchange.

Learn more about ROBLOX Corporation

Size

960 employees

Market Cap

$15.6 billion

Industry

Retail & Consumer Goods

Net Income

-$242.8 million

Founded

2004

Revenue

$727 million

NASDAQ

RBLX

* Ladders Estimates

Get Ready For Your
Next Interview

More Jobs at ROBLOX Corporation

Principal Software Engineer, GPU Compute
$345K — $399K *
San Mateo, CA 94403 (San Mateo County)
Today
Information Technology
In-Person
Senior Growth Marketing Manager
$176K — $216K *
San Mateo, CA 94403 (San Mateo County)
2 days ago
Consumer Technology
In-Person
Engineering Manager, Observability Infrastructure
$295K — $345K *
San Mateo, CA 94403 (San Mateo County)
2 days ago
Consumer Technology
In-Person
Senior Data Scientist - Family Experience
$221K — $263K *
San Mateo, CA 94403 (San Mateo County)
2 days ago
Consumer Technology
In-Person
Senior Data Scientist - Consumer Frontend
$263K — $322K *
San Mateo, CA 94403 (San Mateo County)
2 days ago
Consumer Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
2 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Scrum Master/Team Coach III
$90K — $120K *
General Communication
Anchorage, AK 99504 (Anchorage County)
Today
Software Developer, Design System
$90K — $120K *
SAS
Cary, NC 27513 (Wake County)
Today

Find similar Principal Software Engineer, GPU Compute jobs:

Nationwide San Mateo, CA

Principal Software Engineer, GPU Compute

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Principal Software Engineer, GPU Compute jobs:

Get Ready For Your
Next Interview