The RoleRo is building a team focused on shipping LLM-powered products across the patient experience, clinical operations, and internal tooling.
We're hiring a Senior Applied AI Scientist to own the evaluation, measurement, and optimization of our AI systems. This role sits at the intersection of data science, applied machine learning, and product engineering. You'll design the frameworks that tell us whether our AI systems are actually working and use those insights to continuously improve them.
This is not a research role. You'll work closely with engineers and product teams to evaluate production systems, run experiments, identify failure modes, and ensure our AI products become more accurate, reliable, and cost-effective over time.
What You'll Do- Design and own evaluation frameworks for production LLM features, including LLM-as-a-judge evaluations, regression suites, synthetic datasets, golden datasets, and human review workflows.
- Analyze production behavior to identify quality issues, hallucinations, latency bottlenecks, cost regressions, and emerging failure modes.
- Design and run experiments including prompt variations, workflow changes, retrieval improvements, and model comparisons; and quantify their impact on quality, operational metrics, and user outcomes.
- Define the metrics that matter and build dashboards that make AI performance visible across the organization.
- Partner with engineering to determine which optimizations should be productionized and how to measure ongoing success.
- Mentor teammates on experimental design, statistical rigor, evaluation methodology, and measurement best practices.
Who You Are- 5+ years of experience in data science, applied machine learning, experimentation, or a closely related field, with at least the last year focused on applied LLMs or AI evaluation.
- Strong Python and SQL skills with experience working on production data pipelines and experimentation.
- You have experience designing reproducible evaluation frameworks rather than relying on manual spot checks or qualitative assessments.
- You have strong statistical intuition: you think in terms of distributions, confidence intervals, variance, and sample sizes rather than anecdotes.
- You're comfortable working closely with engineers and product teams to translate experimental findings into production improvements
- Bonus: Experience with evaluation platforms (e.g. Braintrust, LangSmith, OpenAI Evals), experimentation platforms, causal inference, healthcare, or operations-heavy environments.
A note on reporting structure This is a new function at Ro, and we're being deliberate about not over-defining it. Your manager and where you sit on the org chart will depend on the specific shape of the team we end up with. We'd rather find the right people and figure out the lines around them than pre-draw boxes and miss great candidates. If that ambiguity is a deal-breaker, this isn't the right role; if it sounds like an opportunity, we want to talk.
The target base salary for this position ranges from $182,300 to $220,000, in addition to a competitive equity and benefits package (as applicable). When determining compensation, we analyze and carefully consider several factors, including location, job-related knowledge, skills and experience. These considerations may cause your compensation to vary.