Staff ML Software Engineer (L6) — Platform Systems, AIMS Engineering

Netflix • $500K+*

Los Gatos, CA 95032In-Person

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years designing and operating large-scale AI/ML systems
Experienced in migrating AI/ML systems across tech generations
Proficient in Python and at least one JVM language (Scala or Java)
Track record of improving AI/ML system reliability and cost efficiency
Built observability systems for AI/ML workloads
Background in distributed systems and real-time serving
Strong cross-functional collaboration skills
High technical judgment in architectural decisions

Responsibilities

Define architecture for AIMS AI/ML stack modernization
Drive migration to a modern Python-native platform
Build tooling and abstractions to simplify adoption for teams
Ensure scalability of training and data pipelines
Design observability systems for model and pipeline health
Optimize training and serving infrastructure costs
Architect reliability improvements across the AI/ML stack
Prototype GenAI-powered tools for operational automation

Benefits

Health Plans
Mental Health support
401(k) Retirement Plan with employer match
Stock Option Program
Disability Programs
Health Savings and Flexible Spending Accounts
Family-forming benefits
Paid time off with flexible options

Full Job Description

About the Job

AI for Member Systems (AIMS) runs the AI systems behind every recommendation, search result, and personalized experience for 300M+ members. The stack powering it is large and battle-tested, built to meet the demands of its time, and remarkably effective at doing so. But AI/ML is moving fast, and the infrastructure that got us here needs to evolve to meet what's next: new model paradigms, tighter cost and efficiency expectations, and the operational maturity that comes with running AI at this scale. Migrating to a next-generation AI/ML platform is one of the highest-leverage programs in AIMS. So is building the observability and cost infrastructure that makes that platform trustworthy. This role owns that problem end-to-end.

Platform Systems is the engineering foundation of AIMS, owning reliability, scalability, cost efficiency, and developer experience across the org. We are looking for a Staff ML Software Engineer to own the technical health of the AIMS AI/ML stack — modernizing it, and building the observability and cost infrastructure that makes that modernization trustworthy. This is a high-leverage, cross-cutting role — the work you do here will define how AIMS builds AI/ML systems for the next decade. While the initial migration marks our first major initiative, our ongoing goal is to establish sustainable practices for the long term.

Responsibilities

Define the end-state architecture for the modernized AIMS AI/ML stack: how it is organized, what contracts each layer exposes, and what the migration path looks like across training pipelines, AI frameworks, and data infrastructure
Drive end-to-end migration of AIMS AI/ML systems onto a modern, Python-native platform, coordinating across multiple AIMS teams and external platform partners, with dozens of production models in flight
Build migration tooling and shared abstractions that reduce the cost of adoption for individual teams, so modernization does not require each team to solve the same problems independently
Own scalability across training throughput and data pipelines, ensuring AIMS AI/ML systems stay performant as model complexity and member traffic grow
Design and build observability systems that give AIMS AIMS ML practitioners deep visibility into model behavior, training pipeline health, serving latency, and data quality, making issues detectable and diagnosable before they become incidents
Identify and drive cost optimization across AIMS training and serving infrastructure, developing frameworks and tooling that make compute efficiency a first-class concern, not an afterthought
Architect reliability improvements across the AIMS AI/ML stack, reducing toil, improving on-call ergonomics, and setting the standard for operational excellence across the org
Prototype and productionize GenAI-powered tooling for anomaly detection, root cause analysis, and operational automation, applying LLM-based systems to the problems of AI/ML reliability and cost at scale
Surface systemic cost, reliability, and migration gaps by embedding with AI/ML teams across AIMS, and translate their friction into concrete engineering investments with org-wide leverage
Set technical standards for the modernized stack and raise the engineering bar across AIMS through design reviews, architectural guidance, and leading by example
Own the long-term architectural evolution of the AIMS AI/ML stack — continuously evaluating emerging infrastructure patterns, model paradigms, and platform capabilities, and translating them into a forward-looking roadmap before they become urgent migrations

What We're Looking For

Significant experience designing, building, and operating large-scale production AI/ML systems, including training pipelines and familiarity with model serving and online inference at high-traffic scale
Hands-on experience migrating production AI/ML systems across technology generations; you have done this before and understand where it goes wrong
Strong software engineering fundamentals with deep Python expertise and working proficiency in at least one JVM language (Scala or Java)
Proven track record of improving AI/ML system reliability, reducing infrastructure costs, and improving operational scalability
Experience building observability and monitoring systems for AI/ML workloads; you understand what good visibility looks like across training, serving, and data pipelines
Strong distributed systems background, including large-scale batch processing and real-time serving infrastructure
Collaborate with partner teams to drive cross-functional technical programs, setting direction, managing dependencies, and building consensus without formal authority
High technical judgment: able to identify common patterns, build reusable frameworks, and make pragmatic calls on what to migrate, what to rewrite, and what to leave alone
Comfortable operating without full information; you can scope a problem, define an approach, and course-correct as you learn more

Preferred Qualifications

Experience with compute and cost optimization for AI/ML workloads at scale, including capacity management and efficiency tooling
Hands-on experience building GenAI-powered tooling for operational automation, root cause analysis, or anomaly detection in AI/ML systems
Experience building developer tooling or platform abstractions that improve AI/ML practitioner velocity
Applied experience in personalization domains such as recommendation systems, search, or discovery
Familiarity with modern AI/ML infrastructure patterns including feature stores, model serving platforms, and experiment frameworks

Generally, our compensation structure consists solely of an annual salary; we do not have bonuses. You choose each year how much of your compensation you want in salary versus stock options. To determine your personal top of market compensation, we rely on market indicators and consider your specific job family, background, skills, and experience to determine your compensation in the market range. The range for this role is $600,000.00 - $1,066,000.00.

Netflix provides comprehensive benefits including Health Plans, Mental Health support, a 401(k) Retirement Plan with employer match, Stock Option Program, Disability Programs, Health Savings and Flexible Spending Accounts, Family-forming benefits, and Life and Serious Injury Benefits. We also offer paid leave of absence programs. Full-time hourly employees accrue 35 days annually for paid time off to be used for vacation, holidays, and sick paid time off. Full-time salaried employees are immediately entitled to flexible time off. See more details about our Benefits here.

About Netflix

Netflix, Inc. is an American media company founded on August 29, 1997 by Reed Hastings and Marc Randolph in Scotts Valley, California, and currently based in Los Gatos, California, with production offices and stages at the Los Angeles-based Hollywood studios (formerly old Warner Brothers studios) and the Albuquerque Studios (formerly ABQ studios). It operates an eponymous over-the-top subscription video on-demand service, which showcases acquired and original programming as well as third-party content licensed from other production companies and distributors. Netflix is also the first streaming media company to be a member of the Motion Picture Association.

Learn more about Netflix

Size

11,300 employees

Market Cap

$127.6 billion

Industry

Hospitality & Recreation

Net Income

$2.7 billion

Founded

1997

5 Year Trend

+27.5%

Revenue

$24.9 billion

NASDAQ

NFLX

* Ladders Estimates

Similar Jobs

Member of Technical Staff, Evaluation Execution
$285K — $500K+*
METR
Berkeley, CA 94704 (Alameda County)
Yesterday
Research Engineer, Code RL (Reinforcement Learning)
$500K+*
Anthropic
San Francisco, CA 94112 (San Francisco County)
2 weeks ago
Software Engineer 5 – Agent Platform, AI Platform
$466K — $500K+*
Netflix
Remote
3 weeks ago
Member of Technical Staff
$200K — $500K *
Fulcrum
San Francisco, CA 94112 (San Francisco County)
Reposted 1 month ago

Get Ready For Your
Next Interview

More Jobs at Netflix

Engineering Manager - Live Ads Supply
$436K — $500K+*
Los Gatos, CA 95032 (Santa Clara County)
Today
Media
In-Person
Engineering Manager - Live Ads Supply
$436K — $500K+*
New York, NY 10025 (New York County)
Today
Media
In-Person
Manager, Expenditures Accounting (Sales & Marketing, CPX)
$210K — $350K *
Los Angeles, CA 90011 (Los Angeles County)
Today
Business Services
In-Person
Software Engineering 5 - Ads Conversion Attribution
$388K — $500K+*
New York, NY 10025 (New York County)
Today
Consumer Technology
In-Person
Software Engineering 5 - Ads Conversion Attribution
$388K — $500K+*
Los Gatos, CA 95032 (Santa Clara County)
Today
Consumer Technology
In-Person

More Information Technology Jobs

SDET (Software Development Engineer In Test)
Confidential Company
Washington, DC 20001 (District Of Columbia County)
2 weeks ago
FMS Application Technician
$107K — $147K *
3M Health Care Business Group
Remote
Today
Sr Vulnerability Management Engineer
$143K — $196K *
3M Health Care Business Group
Remote
Today
Application Security Engineer
$125K — $172K *
3M Health Care Business Group
Remote
Today
Manager Machine Learning Engineering
$173K — $321K *
Paylocity
Remote
Today

Find similar Staff ML Software Engineer (L6) — Platform Systems, AIMS Engineering jobs:

Nationwide Los Gatos, CA

Staff ML Software Engineer (L6) — Platform Systems, AIMS Engineering

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff ML Software Engineer (L6) — Platform Systems, AIMS Engineering jobs:

Get Ready For Your
Next Interview