Lilt

AI Researcher / ML Engineer (ASR & Speech Specialist)

Lilt$120K — $150K *
Consumer Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Master's or Ph.D. in Computer Science, Electrical Engineering, Computational Linguistics, Data Science or a related field.
  • 3-5 years of experience developing Automatic Speech Recognition (ASR) systems.
  • Proficiency with deep learning frameworks like PyTorch and specialized speech toolkits.
  • Experience running PyTorch models on mobile inference runtimes such as ExecuTorch or TensorFlow Lite.
  • Strong software engineering skills in Python and understanding of complex multilingual tokenization.
  • Familiarity with large-scale audio datasets and data augmentation techniques.

Responsibilities

  • Architect, train, and evaluate advanced ASR models across multiple languages.
  • Design scalable algorithms for dynamic vocabulary insertion and customer-specific terminology.
  • Implement automated evaluations to benchmark model performance against established metrics.
  • Develop multilingual benchmarks for end-to-end conversational AI agents.
  • Collaborate with teams to build and optimize high-throughput speech processing systems.
  • Refine components of the speech processing pipeline, ensuring high performance.
  • Translate product requirements into actionable AI technical roadmaps.

Benefits

  • Opportunity to work on cutting-edge AI and speech technologies.
  • Collaborative cross-functional team environment.
  • Access to ongoing professional development and training resources.
  • Flexibility in work arrangements, promoting a healthy work-life balance.
Full Job Description
Role Summary

We are seeking a highly skilled and visionary Senior AI Researcher / Machine Learning Engineer specializing in Automatic Speech Recognition (ASR) to anchor our core speech intelligence and benchmarking initiatives. In this role, you will serve as our principal subject matter expert in AI speech data processing, responsible for architecting, training, and scaling high-performance, multilingual ASR models, as well as developing rigorous quality benchmarks for agentic conversational AI.

A critical component of this position involves developing robust domain-adaptation frameworks that allow our models to dynamically incorporate proprietary customer terminology, specialized industry jargon, and multilingual nuances. You will collaborate with the Engineering, Product, and AI Research teams to transform state-of-the-art speech research into production-ready systems powering on-device real-time streaming translation and novel frontier model benchmarks.

Key Challenge: Scaling ASR models capable of dynamic vocabulary insertion for enterprise-grade, ultra-low-latency, real-time environments, and end-to-end agentic AI benchmarking that goes beyond surface metrics.

Key Responsibilities
  • Model Development & Innovation: Architect, train, fine-tune, and evaluate state-of-the-art speech representations and ASR models (e.g., End-to-End Conformer, Whisper, RNN-T, and hybrid CTC/Attention architectures) across multiple global languages.
  • Customization & Domain Adaptation: Design and deploy highly scalable algorithms for dynamic vocabulary insertion, contextual biasing, and language model (LM) personalization to precisely capture customer-specific terminology, acronyms, and product names.
  • Evaluation: Implement automated framework evaluations to benchmark model performance, rigorously tracking Word Error Rate (WER), Character Error Rate (CER), embedding-based metrics, latency budgets (RTF), and computing efficiency profiles under varying acoustic environments.
  • Agentic Benchmarking: Develop pioneering multilingual benchmarks for end-to-end conversational AI agents, including speech-to-text and text-to-speech components, and targeting the weaknesses of state-of-the-art frontier models.
  • Real-Time & Batch Speech Systems: Partner with core engineering teams to build, optimize, and maintain high-throughput pipelines optimized for both ultra-low latency real-time streaming inference and high-efficiency asynchronous (batch) multi-channel speech analysis.
  • Speech Pipeline Engineering: Develop and refine standard auxiliary components of the speech processing chain, including Voice Activity Detection (VAD), speaker diarization, punctuation restoration, noise/acoustic normalization, and audio pre-processing filters.
  • Cross-Functional Productization: Translate product requirements into technical AI roadmaps, working hand-in-hand with Product Managers to ship speech-to-text, simultaneous translation, and semantic speech analytics features.


Required Technical Qualifications
  • Education: Master's or Ph.D. degree in Computer Science, Electrical Engineering, Computational Linguistics, Data Science, or a related quantitative field with an emphasis on speech processing or deep learning (or equivalent proven industry track record).
  • Speech Domain Expertise: Minimum of 3-5 years of dedicated professional experience developing ASR systems, speech-to-text translation pipelines, or advanced audio processing models.
  • Deep Learning Frameworks: Advanced proficiency with PyTorch or equivalent frameworks, along with extensive experience utilizing dedicated speech toolkits such as Whisper, NVIDIA NeMo, Hugging Face Transformers, Kaldi, ESPnet, or SpeechBrain.
  • On-device runtimes: Hands-on experience converting and running PyTorch models on at least one mobile inference runtime: ExecuTorch, LiteRT (formerly TensorFlow Lite), or ONNX Runtime Mobile. You have personally taken a non-trivial model through conversion, including resolving unsupported operations and dynamic-shape or decoder-loop issues.
  • Software & Infrastructure: Strong software engineering principles in Python, with a clear understanding of data structures, algorithm optimization, and handling complex multilingual text/audio tokenization schemas.
  • Data Pipeline Mastery: Proven experience working with large-scale audio datasets, audio augmentation techniques (e.g., SpecAugment, noise injection), and text normalization/inverse text normalization (ITN) pipelines.
Preferred & Specialization Qualifications
  • High-Performance and on-device Inference: Experience optimizing models for constrained on-device and production environments using quantization (INT4/INT8/FP16), distillation, ONNX Runtime, TensorRT, or Triton Inference Server.
  • Research Footprint: Peer-reviewed publications in premier speech and machine learning conferences (e.g., ICASSP, INTERSPEECH, NeurIPS, ICLR, ACL) are a strong plus, or an active contribution footprint to open-source speech communities.
  • Hardware acceleration: Working knowledge of mobile NPU/DSP acceleration on the Android SoC landscape (Qualcomm QNN / Hexagon, GPU, and NNAPI delegates) and the trade-offs across Snapdragon, MediaTek, and Google Tensor.
  • Streaming Architectures: Deep technical familiarity with streaming neural architectures (e.g., block-processing, streaming transformers, or transducer models) and real-time network transport constraints (WebSockets, gRPC).
  • Multilingual Engineering: Professional exposure to building zero-shot multilingual speech systems or managing cross-lingual acoustic phonology data.


Core Competencies & Soft Skills
  • Analytical Problem Solving: Ability to break down ambiguous business or product requirements into deterministic, actionable machine learning experimentation frameworks.
  • Collaborative Communication: Strong capability to communicate intricate technical machine learning complexities to non-technical stakeholders across product, design, and executive leadership.
  • Ownership Mindset: Comfortable working in a fast-paced environment, taking accountability from initial algorithmic hypothesis and exploratory research through to final production monitoring.

About Lilt

Industry
Founded
2015

Similar Jobs

More Jobs at Lilt

More Consumer Technology Jobs

Find similar AI Researcher / ML Engineer (ASR & Speech Specialist) jobs: