Inference Optimization ML Engineer

Rhoda AI

$130K — $180K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 3+ years of experience in inference optimization or ML systems
  • Deep hands-on experience with PyTorch (JAX is a plus)
  • Strong understanding of compute, memory bandwidth, and I/O bottlenecks
  • Experience with optimization techniques like quantization and pruning
  • Familiarity with inference serving frameworks such as Triton and TensorRT
  • Exceptional debugging and measurement skills
  • High ownership mindset and adaptability to fast-paced environments

Responsibilities

  • Own inference performance by diagnosing and improving latency, throughput, and efficiency of models
  • Build systematic performance attribution to identify and prioritize bottlenecks
  • Apply and develop optimization techniques such as quantization and model compilation
  • Optimize attention mechanisms and memory layouts for multimodal models
  • Work with kernel-level tooling to identify hotspots and implement optimizations
  • Build benchmarking infrastructure for latency baselines and regression detection
  • Collaborate with research engineers to translate innovations into optimized implementations

Benefits

  • Leverage research velocity directly impacting real-world robot performance
  • Ownership over optimization processes with significant real-world implications
  • Work in a small, elite team with high impact on product outcomes
Full Job Description
Were looking for an Inference Optimization MLE to help build and operate the systems that make our foundation models run fast and efficiently in production. Youll be responsible for squeezing maximum performance out of large multimodal models, across cloud and on-robot deployment targets. You will working closely with research and robotics teams to close the gap between training and real-world deployment.

What Youll Do
  • Own inference performance end-to-end - diagnose and improve latency, throughput, and efficiency of large foundation models in production
  • Build systematic performance attribution: latency decomposition (compute vs. memory bandwidth vs. I/O), bottleneck identification, and prioritization across model families
  • Apply and develop optimization techniques including quantization, pruning, distillation, operator fusion, and model compilation (e.g., TensorRT, torch.compile, XLA)
  • Optimize attention mechanisms, KV caching, and memory layouts for large multimodal models (vision, video, language, proprioception)
  • Work with kernel-level tooling (e.g., CUDA, Triton) to identify hotspots and implement or tune custom kernels where needed
  • Build benchmarking and regression detection infrastructure: latency baselines, throughput curves, and automated detection of performance regressions across model versions
  • Collaborate closely with research engineers to translate model innovations into optimized, deployment-ready implementations

What Were Looking For
  • 3+ years of experience in inference optimization, ML systems, or a closely related field
  • Deep hands-on experience with modern ML stacks (PyTorch required; JAX a plus)
  • Strong understanding of compute, memory bandwidth, and I/O bottlenecks in large model inference
  • Experience with model optimization techniques: quantization (INT8/FP8/AWQ), distillation, pruning, and compilation
  • Familiarity with inference serving frameworks (e.g., Triton, TensorRT, vLLM, TorchServe)
  • Exceptional debugging and measurement ability: turn "inference is slow" into clear bottlenecks, experiments, and validated improvements
  • High ownership mindset and comfort in a fast-moving environment

Nice to Have (But Not Required)
  • GPU kernel or compiler-level experience (CUDA, Triton, graph capture, operator fusion)
  • Experience with multimodal or video model inference (variable-length sequences, packing/bucketing)
  • Familiarity with edge/cloud hybrid deployment patterns and on-robot inference constraints
  • Experience with speculative decoding, continuous batching, or other LLM serving optimizations
  • Background in streaming or low-latency systems relevant to real-time robot control

Why This Role
  • Direct leverage on research velocity and real-world robot performance - every efficiency gain you make accelerates model iteration and tightens the loop between model and robot behavior
  • Own the optimization layer that determines how quickly and efficiently our foundation models run in the real world - high ownership, high impact, small elite team

Similar Jobs

More Jobs at Rhoda AI

More Information Technology Jobs

Find similar Inference Optimization ML Engineer jobs: