Research Scientist (Model Evaluation)

Sanas

$120K — $150K *
Consumer Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 4+ years of research experience in speech, audio, or NLP focusing on evaluation methodology
  • Proficient in various speech and audio quality metrics such as MOS, PESQ, and WER
  • Skilled in designing and conducting statistically rigorous human evaluation studies
  • Strong engineering capabilities to build production-quality evaluation pipelines, proficient in Python and PyTorch
  • Innovative in creating new quantitative metrics for subjective and behavioral assessments
  • Ability to translate open-ended research questions into reliable evaluation systems
  • Curious and rigorous approach to measurement of progress in AI models.

Responsibilities

  • Design and manage evaluation frameworks for all model portfolios, focusing on meaningful progress measurement
  • Develop novel quantitative metrics to evaluate subjective qualities in speech AI
  • Build robust evaluation systems combining automated metrics and human judgment
  • Define accurate evaluation splits and test sets reflecting diverse production conditions
  • Establish continuous automated evaluation pipelines to detect regressions early
  • Implement model quality monitoring in production across various conditions
  • Communicate evaluation results effectively to diverse stakeholders.

Benefits

  • Collaborative environment at the intersection of research, product, and infrastructure
  • Impactful role that shapes evaluation practices across model teams
  • Opportunity to innovate within the field of speech AI evaluation
  • Direct involvement in translating research insights into practical applications
  • Diverse challenges across various dimensions of speech technology
Full Job Description
About the Role

Progress in speech AI is only as meaningful as our ability to measure it. At Sanas, model quality spans dimensions that automated metrics struggle to capture - accent naturalness, perceptual clarity, speaker identity preservation, noise suppression without speech distortion, translation fluency under real-world disfluency. We're looking for a Research Scientist who can define what "better" actually means across all of Sanas's model families, build the evaluation infrastructure to measure it rigorously, and close the loop between research progress and real-world impact. This role sits at the intersection of research, product, and infrastructure - and directly shapes how every model team at Sanas measures progress.

Job Description

Evaluation framework design

  • Design and own evaluation frameworks across Sanas's full model portfolio - Accent Translation, Noise Cancellation, Speech Enhancement, and Language Translation, and more - ensuring each captures meaningful progress, not just benchmark performance.
  • Develop novel quantitative metrics for subjective and perceptual qualities: accent similarity, naturalness, speaker identity preservation, intelligibility under noise, and translation fluency in spoken-language domains.
  • Build evaluation systems that bridge automated metrics and human judgment - designing listening studies, MOS/MUSHRA protocols, and preference tests that are statistically rigorous and operationally scalable.
  • Define evaluation splits, test sets, and benchmark suites that accurately reflect production conditions - diverse accents, languages, noise environments, recording devices, and telephony codecs.

Evaluation infrastructure & tooling

  • Build and maintain automated evaluation pipelines that run continuously against model checkpoints - surfacing regressions early and tracking quality trends across training runs.
  • Develop reference-based and reference-free metrics calibrated to Sanas's specific model tasks: SI-SDR, PESQ, STOI, DNSMOS, speaker similarity, WER delta, COMET, and task-specific custom metrics where off-the-shelf measures fall short.
  • Instrument model quality monitoring in production - detecting degradation across language pairs, accent profiles, and acoustic conditions in live customer traffic.
  • Build tooling that allows research scientists and ML engineers to run rigorous ablations, compare model versions, and understand quality tradeoffs without needing to design the evaluation from scratch each time.

Human evaluation & research

  • Design and operate human evaluation programs - listener panels, crowdsourced annotation, and expert evaluator workflows - that produce reliable signal on dimensions automated metrics cannot capture.
  • Conduct research into evaluation methodology itself: when do automated metrics correlate with human perception, when do they diverge, and what does that tell us about model behavior?
  • Partner directly with research scientists across model teams to translate open-ended quality questions into concrete, measurable evaluation protocols.

Cross-functional impact

  • Work closely with ML research, product, and customer success teams to ensure evaluation reflects what customers actually experience - not just what lab conditions optimize for.
  • Feed evaluation insights back into data acquisition and model training priorities - identifying which failure modes require more data, architectural changes, or training procedure improvements.
  • Communicate evaluation results clearly to both technical and non-technical stakeholders, translating metric movements into product quality narratives that inform roadmap decisions.

Qualifications

  • 4+ years of research or applied research experience in speech, audio, or NLP, with a demonstrated focus on evaluation methodology and quality measurement.
  • Deep familiarity with speech and audio quality metrics - perceptual (MOS, MUSHRA, PESQ, STOI), signal-level (SI-SDR, SNR), and task-specific (WER, speaker similarity, DNSMOS) - and an understanding of when each is and isn't the right tool.
  • Experience designing and running human evaluation studies - listener panels, crowdsourced annotation, inter-annotator agreement analysis - with statistical rigor.
  • Strong engineering skills: you can build production-quality evaluation pipelines, not just run scripts. Proficiency in Python and PyTorch or equivalent.
  • Creativity in defining novel quantitative metrics for subjective or behavioral qualities - you've identified gaps in existing evaluation approaches and built something better.
  • Ability to take open-ended research questions and translate them into concrete, measurable evaluation systems that run reliably at scale.
  • Curiosity and rigor in equal measure - you're as motivated by discovering the right way to measure progress as by the progress itself.

Bonus

  • Experience evaluating models across multiple speech tasks - ASR, TTS, speech enhancement, speaker verification, or machine translation.
  • Familiarity with real-time or streaming model evaluation - latency-quality tradeoffs, codec-degraded audio, telephony channel conditions.
  • Background in psychoacoustics or perceptual audio quality - understanding of how humans perceive speech naturalness, noise, and distortion.
  • Experience with multilingual evaluation - cross-lingual quality metrics, language-specific annotation challenges, low-resource language evaluation.
  • Published research at INTERSPEECH, ICASSP, ACL, EMNLP, or equivalent venues on evaluation methodology, speech quality, or related topics.

Similar Jobs

More Jobs at Sanas

More Consumer Technology Jobs

Find similar Research Scientist (Model Evaluation) jobs: