Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco

Plaud

$180K — $270K *
Enterprise Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Hands-on experience with high-throughput, low-latency inference engines for large language or speech models.
  • Understanding of latency, throughput, and Time-To-First-Token in real-time streaming services.
  • Experience with continuous batching and KV cache management crucial for conversational AI.
  • Strong knowledge of GPU architectures and memory hierarchies to resolve hardware issues.
  • Ability to communicate and collaborate across teams effectively.
  • Comfortable in dynamic environments with a focus on performance optimization.
  • Passion for developing AI systems that enhance productivity through natural speech understanding.

Responsibilities

  • Build and deploy inference engines for cutting-edge language and speech models.
  • Manage and optimize real-time streaming performance parameters.
  • Implement frameworks for efficient conversational AI interactions.
  • Collaborate with machine learning and backend teams on infrastructure needs.
  • Drive performance improvements on GPU clusters in real-time applications.
  • Conceptualize and execute AI voice technology that improves user productivity.

Benefits

  • Opportunity to contribute as an early member of the SpeechLLM lab with significant impact.
  • Top-tier healthcare coverage for employees and dependents.
  • 401(k) retirement plan with company matching.
  • Unlimited paid time off and 13 paid holidays.
  • 12 weeks of paid leave for new parents, irrespective of gender.
  • Hybrid work environment promoting collaboration with in-office days.
  • Access to premium tech equipment and additional company perks.
Full Job Description
You may be a good fit if you:
  • Have hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
  • Understand the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
  • Have practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
  • Possess a deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
  • Communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
  • Thrive in fast-moving environments and genuinely enjoy the systems-engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
  • Are obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.


Strong candidates may also have experience with:
  • Frontier Serving Frameworks: Deep, under-the-hood familiarity with modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open-source contributions to these repositories).
  • Real-Time Audio Streaming: Experience handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
  • Advanced Inference Techniques: Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
  • Model Compression & Quantization: Hands-on experience with post-training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy.
  • Large-Scale Distributed Systems: Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines, and managing autoscaling infrastructure using Kubernetes.


What We Offer
  • Founding Team Initiative: Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.
  • Competitive Compensation: $180K - $270K base salary + performance bonus + Equity.
  • Comprehensive Benefits: Top-tier healthcare for employees and dependents, including dental and vision, and a generous employer subsidy.
  • Retirement Planning: 401(k) plan for full-time employees with company matching.
  • Paid Time Off: Unlimited PTO, plus 13 paid holidays.
  • New Parent Leave: 12 weeks of paid time off to spend time with your new family, regardless of gender.
  • Hybrid Office: Minimum of 3x in-office per week to foster highly collaborative, fast-paced research.
  • Gear & Perks: Choice of top-of-the-line laptops/workstations, annual offsites, and a fully stocked office.

Similar Jobs

More Jobs at Plaud

More Enterprise Technology Jobs

Find similar Machine Learning Engineer, Inference & Serving (Speech LLM) - San Francisco jobs: