Job DescriptionWant to own the data infrastructure behind some of the most naturalistic voice models in production?
You'll be joining a well-funded speech AI startup - just closed their Series A - with strong enterprise traction and revenue that more than doubled last quarter. They're building ultra-realistic voice technology that handles natural laughter, breathing, seamless language switching, and accurate pronunciation across languages and accents. Their models are powering hundreds of millions of conversations monthly.
Before training a single model, they built their own corpus - full-duplex, studio-quality conversational speech annotated by PhD linguists. As their MLE, you'll own the pipelines that turn that raw material into clean, training-ready data.
What you'll do- Own end-to-end data pipelines from raw audio ingestion through to versioned, training-ready datasets
- Build quality systems that catch annotation errors and alignment issues before they reach a training run
- Maintain the training infrastructure that keeps GPUs fed - dataloaders, streaming datasets, multi-modal batching
- Build and iterate on tooling across speech representations including neural codecs, semantic tokens and mel features
- Handle full- and half-duplex pipeline work including two-channel alignment and overlap handling
What you'll bring- Strong engineering fundamentals with experience building ML data pipelines at scale
- Hands-on experience with speech or audio data
- Solid understanding of speech representations and the tradeoffs between them
- Experience with multi-channel audio data including diarisation and alignment
Nice to have- Experience with multilingual data pipelines
- Large-scale training infrastructure experience - FSDP, DeepSpeed, Ray
- Annotation tooling and human-in-the-loop systems
Remote-friendly. Competitive base plus stock.