Research Scientist (Model Evaluation)

Sanas

• $120K — $150K *

Palo Alto, CA 94303In-Person

Consumer Technology

Less than 5 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

4+ years of research experience in speech, audio, or NLP focusing on evaluation methodology
Proficient in various speech and audio quality metrics such as MOS, PESQ, and WER
Skilled in designing and conducting statistically rigorous human evaluation studies
Strong engineering capabilities to build production-quality evaluation pipelines, proficient in Python and PyTorch
Innovative in creating new quantitative metrics for subjective and behavioral assessments
Ability to translate open-ended research questions into reliable evaluation systems
Curious and rigorous approach to measurement of progress in AI models.

Responsibilities

Design and manage evaluation frameworks for all model portfolios, focusing on meaningful progress measurement
Develop novel quantitative metrics to evaluate subjective qualities in speech AI
Build robust evaluation systems combining automated metrics and human judgment
Define accurate evaluation splits and test sets reflecting diverse production conditions
Establish continuous automated evaluation pipelines to detect regressions early
Implement model quality monitoring in production across various conditions
Communicate evaluation results effectively to diverse stakeholders.

Benefits

Collaborative environment at the intersection of research, product, and infrastructure
Impactful role that shapes evaluation practices across model teams
Opportunity to innovate within the field of speech AI evaluation
Direct involvement in translating research insights into practical applications
Diverse challenges across various dimensions of speech technology

Full Job Description

About the Role

Progress in speech AI is only as meaningful as our ability to measure it. At Sanas, model quality spans dimensions that automated metrics struggle to capture - accent naturalness, perceptual clarity, speaker identity preservation, noise suppression without speech distortion, translation fluency under real-world disfluency. We're looking for a Research Scientist who can define what "better" actually means across all of Sanas's model families, build the evaluation infrastructure to measure it rigorously, and close the loop between research progress and real-world impact. This role sits at the intersection of research, product, and infrastructure - and directly shapes how every model team at Sanas measures progress.

Job Description

Evaluation framework design

Design and own evaluation frameworks across Sanas's full model portfolio - Accent Translation, Noise Cancellation, Speech Enhancement, and Language Translation, and more - ensuring each captures meaningful progress, not just benchmark performance.
Develop novel quantitative metrics for subjective and perceptual qualities: accent similarity, naturalness, speaker identity preservation, intelligibility under noise, and translation fluency in spoken-language domains.
Build evaluation systems that bridge automated metrics and human judgment - designing listening studies, MOS/MUSHRA protocols, and preference tests that are statistically rigorous and operationally scalable.
Define evaluation splits, test sets, and benchmark suites that accurately reflect production conditions - diverse accents, languages, noise environments, recording devices, and telephony codecs.

Evaluation infrastructure & tooling

Build and maintain automated evaluation pipelines that run continuously against model checkpoints - surfacing regressions early and tracking quality trends across training runs.
Develop reference-based and reference-free metrics calibrated to Sanas's specific model tasks: SI-SDR, PESQ, STOI, DNSMOS, speaker similarity, WER delta, COMET, and task-specific custom metrics where off-the-shelf measures fall short.
Instrument model quality monitoring in production - detecting degradation across language pairs, accent profiles, and acoustic conditions in live customer traffic.
Build tooling that allows research scientists and ML engineers to run rigorous ablations, compare model versions, and understand quality tradeoffs without needing to design the evaluation from scratch each time.

Human evaluation & research

Design and operate human evaluation programs - listener panels, crowdsourced annotation, and expert evaluator workflows - that produce reliable signal on dimensions automated metrics cannot capture.
Conduct research into evaluation methodology itself: when do automated metrics correlate with human perception, when do they diverge, and what does that tell us about model behavior?
Partner directly with research scientists across model teams to translate open-ended quality questions into concrete, measurable evaluation protocols.

Cross-functional impact

Work closely with ML research, product, and customer success teams to ensure evaluation reflects what customers actually experience - not just what lab conditions optimize for.
Feed evaluation insights back into data acquisition and model training priorities - identifying which failure modes require more data, architectural changes, or training procedure improvements.
Communicate evaluation results clearly to both technical and non-technical stakeholders, translating metric movements into product quality narratives that inform roadmap decisions.

Qualifications

4+ years of research or applied research experience in speech, audio, or NLP, with a demonstrated focus on evaluation methodology and quality measurement.
Deep familiarity with speech and audio quality metrics - perceptual (MOS, MUSHRA, PESQ, STOI), signal-level (SI-SDR, SNR), and task-specific (WER, speaker similarity, DNSMOS) - and an understanding of when each is and isn't the right tool.
Experience designing and running human evaluation studies - listener panels, crowdsourced annotation, inter-annotator agreement analysis - with statistical rigor.
Strong engineering skills: you can build production-quality evaluation pipelines, not just run scripts. Proficiency in Python and PyTorch or equivalent.
Creativity in defining novel quantitative metrics for subjective or behavioral qualities - you've identified gaps in existing evaluation approaches and built something better.
Ability to take open-ended research questions and translate them into concrete, measurable evaluation systems that run reliably at scale.
Curiosity and rigor in equal measure - you're as motivated by discovering the right way to measure progress as by the progress itself.

Bonus

Experience evaluating models across multiple speech tasks - ASR, TTS, speech enhancement, speaker verification, or machine translation.
Familiarity with real-time or streaming model evaluation - latency-quality tradeoffs, codec-degraded audio, telephony channel conditions.
Background in psychoacoustics or perceptual audio quality - understanding of how humans perceive speech naturalness, noise, and distortion.
Experience with multilingual evaluation - cross-lingual quality metrics, language-specific annotation challenges, low-resource language evaluation.
Published research at INTERSPEECH, ICASSP, ACL, EMNLP, or equivalent venues on evaluation methodology, speech quality, or related topics.

* Ladders Estimates

Similar Jobs

Customer Success Physicist- San Francisco
$100K — $150K *
Quantum Machines
San Francisco, CA 94112 (San Francisco County)
Today
Research Scientist/Engineer (Science of Scheming)
$150K — $270K *
Apollo Research
San Francisco, CA 94112 (San Francisco County)
Today
Senior Research Associate- Diagnostic Assay Development
$97K — $132K *
10X Genomics
Pleasanton, CA 94566 (Alameda County)
Reposted Today
Expert Pharma Molecular Biologist
$90K — $130K *
Cogent Scientific
South San Francisco, CA 94080 (San Mateo County)
Today
Hunyuan AIGC Algorithm Researcher (World Model Foundation Direction)
$134K — $253K *
LightSpeed Retail
Palo Alto, CA 94303 (Santa Clara County)
Reposted Today
Hunyuan AIGC Algorithm Researcher (World Model Foundation Direction)
$149K — $279K *
LightSpeed Retail
Palo Alto, CA 94303 (Santa Clara County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at Sanas

Staff+ Data Engineer (ML Infrastructure)
$130K — $180K *
Palo Alto, CA 94303 (Santa Clara County)
Reposted 2 weeks ago
Information Technology
In-Person
Research Scientist (Model Evaluation)
$120K — $150K *
Palo Alto, CA 94303 (Santa Clara County)
2 weeks ago
Consumer Technology
In-Person
Principal ML Engineer
$250K — $350K *
Palo Alto, CA 94303 (Santa Clara County)
2 weeks ago
Consumer Technology
In-Person

More Consumer Technology Jobs

Principal Digital Design Engineer (ASIC)
$159K — $239K *
Analog Devices, Inc
Somerset, NJ 08873 (Somerset County)
Today
Principal Product Manager, Conversational Chatbot
$146K — $229K *
Geico
Palo Alto, CA 94303 (Santa Clara County)
Today
Kids and Preschool Go to Market Senior Manager
$134K — $202K *
The LEGO Group
Boston, MA 02115 (Suffolk County)
Today
Analytics Engineer 5 - Ads Measurement DSE
$330K — $500K+*
Netflix
Remote
Today
Associate Product Manager
$93K — $136K *
Entrust Datacard
Shakopee, MN 55379 (Scott County)
Today

Find similar Research Scientist (Model Evaluation) jobs:

Nationwide Palo Alto, CA

Research Scientist (Model Evaluation)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Scientist (Model Evaluation) jobs:

Get Ready For Your
Next Interview