Inference Optimization ML Engineer

Rhoda AI

• $130K — $180K *

Mountain View, CA 94040In-Person

Information Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

3+ years of experience in inference optimization or ML systems
Deep hands-on experience with PyTorch (JAX is a plus)
Strong understanding of compute, memory bandwidth, and I/O bottlenecks
Experience with optimization techniques like quantization and pruning
Familiarity with inference serving frameworks such as Triton and TensorRT
Exceptional debugging and measurement skills
High ownership mindset and adaptability to fast-paced environments

Responsibilities

Own inference performance by diagnosing and improving latency, throughput, and efficiency of models
Build systematic performance attribution to identify and prioritize bottlenecks
Apply and develop optimization techniques such as quantization and model compilation
Optimize attention mechanisms and memory layouts for multimodal models
Work with kernel-level tooling to identify hotspots and implement optimizations
Build benchmarking infrastructure for latency baselines and regression detection
Collaborate with research engineers to translate innovations into optimized implementations

Benefits

Leverage research velocity directly impacting real-world robot performance
Ownership over optimization processes with significant real-world implications
Work in a small, elite team with high impact on product outcomes

Full Job Description

Were looking for an Inference Optimization MLE to help build and operate the systems that make our foundation models run fast and efficiently in production. Youll be responsible for squeezing maximum performance out of large multimodal models, across cloud and on-robot deployment targets. You will working closely with research and robotics teams to close the gap between training and real-world deployment.

What Youll Do

Own inference performance end-to-end - diagnose and improve latency, throughput, and efficiency of large foundation models in production
Build systematic performance attribution: latency decomposition (compute vs. memory bandwidth vs. I/O), bottleneck identification, and prioritization across model families
Apply and develop optimization techniques including quantization, pruning, distillation, operator fusion, and model compilation (e.g., TensorRT, torch.compile, XLA)
Optimize attention mechanisms, KV caching, and memory layouts for large multimodal models (vision, video, language, proprioception)
Work with kernel-level tooling (e.g., CUDA, Triton) to identify hotspots and implement or tune custom kernels where needed
Build benchmarking and regression detection infrastructure: latency baselines, throughput curves, and automated detection of performance regressions across model versions
Collaborate closely with research engineers to translate model innovations into optimized, deployment-ready implementations

What Were Looking For

3+ years of experience in inference optimization, ML systems, or a closely related field
Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
Strong understanding of compute, memory bandwidth, and I/O bottlenecks in large model inference
Experience with model optimization techniques: quantization (INT8/FP8/AWQ), distillation, pruning, and compilation
Familiarity with inference serving frameworks (e.g., Triton, TensorRT, vLLM, TorchServe)
Exceptional debugging and measurement ability: turn "inference is slow" into clear bottlenecks, experiments, and validated improvements
High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)

GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
Experience with multimodal or video model inference (variable-length sequences, packing/bucketing)
Familiarity with edge/cloud hybrid deployment patterns and on-robot inference constraints
Experience with speculative decoding, continuous batching, or other LLM serving optimizations
Background in streaming or low-latency systems relevant to real-time robot control

Why This Role

Direct leverage on research velocity and real-world robot performance - every efficiency gain you make accelerates model iteration and tightens the loop between model and robot behavior
Own the optimization layer that determines how quickly and efficiently our foundation models run in the real world - high ownership, high impact, small elite team

* Ladders Estimates

Similar Jobs

Sr Software Engineer, AI Platform
$150K — $180K *
NRG Energy
Remote
Today
Senior AI Platform Engineer - Frisco
$107K — $176K *
McAfee
San Jose, CA 95123 (Santa Clara County)
Today
AI Engineer - Remote
$100K — $150K *
Huzzle
Remote
Reposted Today
Staff AI Engineer
$145K — $220K *
Unqork
Remote
Today
Member of Technical Staff - Science, Frontier AI & Robotics (FAR)
$150K — $300K *
Amazon
San Francisco, CA 94112 (San Francisco County)
Reposted Today
AI/ML Engineer
$100K — $150K *
VXForward LLC
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Rhoda AI

Inference Optimization ML Engineer
$130K — $180K *
Mountain View, CA 94040 (Santa Clara County)
Today
Information Technology
In-Person
Research Scientist / Engineer - Efficient Modeling
$120K — $160K *
Mountain View, CA 94040 (Santa Clara County)
Today
Enterprise Technology
In-Person
Research Scientist / Engineer - Dexterous Manipulation
$120K — $150K *
Mountain View, CA 94040 (Santa Clara County)
Today
Technical Services
In-Person
Research Scientist / Engineer - Robot Learning Data
$120K — $150K *
Mountain View, CA 94040 (Santa Clara County)
Today
Consumer Technology
In-Person
Robot Software Engineer
$120K — $160K *
Mountain View, CA 94040 (Santa Clara County)
Today
Consumer Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
UX Architect/Lead
$130K — $200K *
HP Development Company, L.P.
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer III
$90K — $180K *
Walmart, Inc.
Bentonville, AR 72712 (Benton County)
Reposted Today
Site Reliability Engineer
$90K — $120K *
Tecsys
Montreal, QC H1A 0A1
Reposted Today
Client Onboarding Manager
$75K — $95K *
Global Data Consultants
Lafayette, LA 70506 (Lafayette County)
Reposted Today

Find similar Inference Optimization ML Engineer jobs:

Nationwide Mountain View, CA

Inference Optimization ML Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Inference Optimization ML Engineer jobs:

Get Ready For Your
Next Interview