Hands-on experience with high-throughput, low-latency inference engines for large language or speech models.
Understanding of latency, throughput, and Time-To-First-Token in real-time streaming services.
Experience with continuous batching and KV cache management crucial for conversational AI.
Strong knowledge of GPU architectures and memory hierarchies to resolve hardware issues.
Ability to communicate and collaborate across teams effectively.
Comfortable in dynamic environments with a focus on performance optimization.
Passion for developing AI systems that enhance productivity through natural speech understanding.
Responsibilities
Build and deploy inference engines for cutting-edge language and speech models.
Manage and optimize real-time streaming performance parameters.
Implement frameworks for efficient conversational AI interactions.
Collaborate with machine learning and backend teams on infrastructure needs.
Drive performance improvements on GPU clusters in real-time applications.
Conceptualize and execute AI voice technology that improves user productivity.
Benefits
Opportunity to contribute as an early member of the SpeechLLM lab with significant impact.
Top-tier healthcare coverage for employees and dependents.
401(k) retirement plan with company matching.
Unlimited paid time off and 13 paid holidays.
12 weeks of paid leave for new parents, irrespective of gender.
Hybrid work environment promoting collaboration with in-office days.
Access to premium tech equipment and additional company perks.
Full Job Description
You may be a good fit if you:
Have hands-on experience building and deploying high-throughput, ultra-low-latency inference engines for large language models or foundational speech models.
Understand the intricate tradeoffs between latency, throughput, and Time-To-First-Token (or Time-To-First-Audio) in real-time streaming environments.
Have practical experience with continuous batching, KV cache management (e.g., PagedAttention), and stateful connections necessary for real-time conversational AI.
Possess a deep understanding of GPU architectures (NVIDIA Ampere/Hopper) and the memory hierarchy, allowing you to identify and eliminate hardware bottlenecks.
Communicate clearly and collaborate effectively, as you will sit at the critical intersection between the core ML training team and the backend infrastructure team.
Thrive in fast-moving environments and genuinely enjoy the systems-engineering challenge of squeezing every last drop of performance out of a cluster of GPUs.
Are obsessed with building AI systems that natively understand and generate speech, ultimately creating a hardware-software AI companion that amplifies human productivity.
Strong candidates may also have experience with:
Frontier Serving Frameworks: Deep, under-the-hood familiarity with modern LLM serving frameworks like vLLM, TensorRT-LLM, SGLang, or NVIDIA Triton Inference Server (bonus points for active open-source contributions to these repositories).
Real-Time Audio Streaming: Experience handling continuous audio streams over WebSockets or WebRTC, deploying neural audio codecs, and managing chunked audio generation to minimize conversational latency.
Advanced Inference Techniques: Implementing cutting-edge generation algorithms such as speculative decoding, lookahead decoding, or chunked prefill.
Model Compression & Quantization: Hands-on experience with post-training quantization (PTQ), deploying models in FP8, INT8, AWQ, or GPTQ, without degrading audio naturalness or ASR accuracy.
Large-Scale Distributed Systems: Deploying multi-GPU (Tensor Parallelism) and multi-node inference pipelines, and managing autoscaling infrastructure using Kubernetes.
What We Offer
Founding Team Initiative: Opportunity to be an early, foundational member of our core SpeechLLM lab, with meaningful ownership and impact on a fast-growing startup.