ROBLOX Corporation

Senior Machine Learning Engineer, Reliability

ROBLOX Corporation$196K — $243K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in Machine Learning Engineering, focusing on production systems.
  • Expertise in various machine learning modeling techniques and customizing them for use cases.
  • Strong understanding of distributed systems and high throughput architectures.
  • Proven ability to architect infrastructures for scalable ML systems.
  • Experience with data pipelines and integrating various streams of production data.

Responsibilities

  • Define the machine learning roadmap for improving production systems reliability at Roblox.
  • Enhance anomaly detection in real-time using advanced ML techniques.
  • Create data pipelines to process metrics, logs, traces, and change management data.
  • Develop reasoning frameworks for root cause analysis of production issues.
  • Predict capacity and traffic patterns to inform automated scaling decisions.

Benefits

  • Equity compensation for full-time employees.
  • Comprehensive health, dental, and vision insurance.
  • Generous paid time off and holiday policies.
  • Flexible work arrangements with remote work options.
  • Professional development opportunities and training support.
Full Job Description
Why Reliability?

Roblox serves over 100 million people every day across a platform that is constantly evolving - and behind every experience is infrastructure that has to work, every time, at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability of the platform is a key company goal. We are hiring our first Senior Machine Learning engineer within our team.

As a Senior Machine Learning Engineer within Reliability, you will help set the direction for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes, to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR). You will have the opportunity to cross functionally collaborate with other similar teams at Roblox to define best practices and software.

You will:
  • Help define the roadmap for leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
  • Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
  • Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
  • Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
  • Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling

You have:
  • Beyond off the shelf: We are looking for an expert who has knowledge of various modeling techniques, ability to to go deep and fine tune models to fit our use cases.
  • Ability to propose and architect the infrastructure that allows us to implement systems that learn from user and/or automated feedback.
  • Good distributed systems fundamentals and understanding of large scale high throughput systems

You are:
  • Comfortable with Ambiguity: You thrive in undefined or open-ended problem spaces, providing structure, clarity, and decisive direction to your teams.
  • A Pragmatic Builder: You are scrappy and impact-oriented. You view undefined data and messy systems as opportunities to build structure rather than blockers to progress.
  • An Executive Communicator: Highly effective at communicating complex technical concepts to both engineering teams and non-technical executive leadership.
  • Data & System Oriented: You understand that robust data and systems are the foundation of any production application, and you design infrastructure for scale, correctness, and reliability.
  • Curious & Creative: You enjoy tackling hard problems, exploring new technologies, and driving continuous improvements in both systems and workflows.


For roles that are based at our headquarters in San Mateo, CA: The starting base pay for this position is as shown below. The actual base pay is dependent upon a variety of job-related factors such as professional background, training, work experience, location, business needs and market demand. Therefore, in some circumstances, the actual salary could fall outside of this expected range. This pay range is subject to change and may be modified in the future. All full-time employees are also eligible for equity compensation and for benefits as described on this page.

Annual Salary Range

$196,750-$243,290 USD

Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).

About ROBLOX Corporation

Roblox Corporation is a video game company that operates a massively multiplayer online game platform. The platform allows users to create and play games in a virtual world, with a focus on user-generated content. Roblox was founded in 2004 and is headquartered in San Mateo, California. The company has grown rapidly in recent years, and now has over 100 million monthly active users. In 2021, Roblox went public through a direct listing on the New York Stock Exchange.
Learn more about ROBLOX Corporation
Size
960 employees
Market Cap
$15.6 billion
Industry
Net Income
-$242.8 million
Founded
2004
Revenue
$727 million
NASDAQ

Similar Jobs

More Jobs at ROBLOX Corporation

More Information Technology Jobs

Find similar Senior Machine Learning Engineer, Reliability jobs: