Why Reliability?Roblox serves over 100 million people every day across a platform that is constantly evolving - and behind every experience is infrastructure that has to work, every time, at massive scale. The Reliability team at Roblox operates at the depth and breadth of the Roblox stack. Availability of the platform is a key company goal. We are hiring our first Senior Machine Learning engineer within our team.
As a Senior Machine Learning Engineer within Reliability, you will help set the direction for how machine learning systems/practices can be leveraged to improve the reliability of the overall Roblox platform. You will own the architectural and execution roadmap of leveraging massive data across - logs, traces, metrics, production changes, to proactively detect issues before they become real problems (MTTD) and/or reduce time to resolve incidents (MTTR). You will have the opportunity to cross functionally collaborate with other similar teams at Roblox to define best practices and software.
You will:- Help define the roadmap for leveraging Machine Learning Engineering to improve Production Systems Reliability at Roblox.
- Improve realtime anomaly detection capabilities by leveraging various state of the art ML techniques, thereby directly contributing to improving Mean Time to Detect Production issues.
- Develop methods to build pipelines to consume various streams of data (metrics, logs, traces, change management systems etc.).
- Build a reasoning layer that interacts with the streams of data to find possible root causes of problems happening in production.
- Build time-series models to predict capacity exhaustion and seasonal traffic spikes to drive automated scaling
You have:- Beyond off the shelf: We are looking for an expert who has knowledge of various modeling techniques, ability to to go deep and fine tune models to fit our use cases.
- Ability to propose and architect the infrastructure that allows us to implement systems that learn from user and/or automated feedback.
- Good distributed systems fundamentals and understanding of large scale high throughput systems
You are:- Comfortable with Ambiguity: You thrive in undefined or open-ended problem spaces, providing structure, clarity, and decisive direction to your teams.
- A Pragmatic Builder: You are scrappy and impact-oriented. You view undefined data and messy systems as opportunities to build structure rather than blockers to progress.
- An Executive Communicator: Highly effective at communicating complex technical concepts to both engineering teams and non-technical executive leadership.
- Data & System Oriented: You understand that robust data and systems are the foundation of any production application, and you design infrastructure for scale, correctness, and reliability.
- Curious & Creative: You enjoy tackling hard problems, exploring new technologies, and driving continuous improvements in both systems and workflows.
For roles that are based at our headquarters in San Mateo, CA: The starting base pay for this position is as shown below. The actual base pay is dependent upon a variety of job-related factors such as professional background, training, work experience, location, business needs and market demand. Therefore, in some circumstances, the actual salary could fall outside of this expected range. This pay range is subject to change and may be modified in the future. All full-time employees are also eligible for equity compensation and for benefits as described on
this page.
Annual Salary Range
$196,750-$243,290 USD
Roles that are based in an office are onsite Tuesday, Wednesday, and Thursday, with optional presence on Monday and Friday (unless otherwise noted).