Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco

Plaud

• $180K — $270K *

Enterprise Technology

Less than 5 years of experience

3 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Qualifications

Hands-on experience with high-throughput, low-latency inference engines for large language or speech models.
Understanding of latency, throughput, and Time-To-First-Token in real-time streaming services.
Experience with continuous batching and KV cache management crucial for conversational AI.
Strong knowledge of GPU architectures and memory hierarchies to resolve hardware issues.
Ability to communicate and collaborate across teams effectively.
Comfortable in dynamic environments with a focus on performance optimization.
Passion for developing AI systems that enhance productivity through natural speech understanding.

Responsibilities

Benefits

Opportunity to contribute as an early member of the SpeechLLM lab with significant impact.
Top-tier healthcare coverage for employees and dependents.
401(k) retirement plan with company matching.
Unlimited paid time off and 13 paid holidays.
12 weeks of paid leave for new parents, irrespective of gender.
Hybrid work environment promoting collaboration with in-office days.
Access to premium tech equipment and additional company perks.

You may be a good fit if you:

Have hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
Understand the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
Have practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
Possess a deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
Communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
Thrive in fast-moving environments and genuinely enjoy the systems-engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
Are obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.

Strong candidates may also have experience with:

Frontier Serving Frameworks: Deep, under-the-hood familiarity with modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open-source contributions to these repositories).
Real-Time Audio Streaming: Experience handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
Advanced Inference Techniques: Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
Model Compression & Quantization: Hands-on experience with post-training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy.
Large-Scale Distributed Systems: Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines, and managing autoscaling infrastructure using Kubernetes.

What We Offer

Founding Team Initiative: Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.
Competitive Compensation: $180K - $270K base salary + performance bonus + Equity.
Comprehensive Benefits: Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
Retirement Planning: 401(k) plan for full-time employees with company matching.
Paid Time Off: Unlimited PTO, plus 13 paid holidays.
New Parent Leave: 12 weeks of paid time off to spend time with your new family, regardless of gender.
Hybrid Office: Minimum of 3x in-office per week to foster highly collaborative, fast-paced research.
Gear & Perks: Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.

Similar Jobs

Software Development Engineer AI/ML, Inference Serving, AWS Neuron
$193K — $261K *
Amazon
Cupertino, CA 95014 (Santa Clara County)
Today
Senior Software Engineer, AI Platform & Agents
$134K — $199K *
Coalition
Remote
Today
Senior AI Product Engineer, Frameworks
$166K — $225K *
Drata
San Francisco, CA 94112 (San Francisco County)
Today
Senior Machine Learning Engineer
$229K — $360K *
Roku
San Jose, CA 95123 (Santa Clara County)
Today
Senior AI Engineer - Health Intelligence
$172K — $203K *
Oura
San Francisco, CA 94112 (San Francisco County)
Today
Research Scientist / Engineer - Performance Optimization
$187K — $395K *
Gem.com
Redwood City, CA 94061 (San Mateo County)
Today

More Jobs at Plaud

More Enterprise Technology Jobs

Release Train Engineer-RTE (Remote)
$170K — $180K *
GovCIO
Remote
Today
Director, CX Platform & Data Strategy
$137K — $182K *
HealthEquity, Inc.
Remote
Reposted Today
Application Architect
$135K — $148K *
McCarthy Tetrault
Toronto, ON M3C 0E3
Today
CDO AI Strategy & Delivery Leader
$130K — $180K *
Wells Fargo
Charlotte, NC 28269 (Mecklenburg County)
Today
Enterprise Account Executive - Territory Accounts
$120K — $180K *
Plaid
San Francisco, CA 94112 (San Francisco County)
Today

Find similar Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco jobs: