Pitchbook

Research Engineer - Evals

Pitchbook • $120K — $160K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in evaluation frameworks for AI systems
  • Strong understanding of model capability and agent behavior
  • Demonstrated ability to translate technical language for non-technical stakeholders
  • Background in designing human-rated evaluation rubrics
  • Experience with real-user behavior instrumentation and analysis

Responsibilities

  • Develop eval suites for every model and agent release
  • Create dashboards and tools for efficient researcher and leadership decision-making
  • Establish criteria for model readiness and shipping
  • Support research by ensuring accurate measurement of desired outcomes
  • Collaborate with product engineers to track user behavior on devices
  • Assist in translating evaluation criteria for OEM partnerships

Benefits

  • Comprehensive relocation and immigration support
  • Opportunity to work in a collaborative, fast-paced environment
  • Engagement with cutting-edge AI technology and research
  • Potential for meaningful equity in the company
  • In-person work environment in San Francisco
Full Job Description
You decide what "better" means.

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust - and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI - across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

🤩 Tasks you will own
  • The eval suites that gate every model and agent release - capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss
  • The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy
  • The bar - what counts as ready to ship, and how we know
🤚 Areas where you will assist
  • Research, by making sure what we measure is what we want
  • Product engineers, by instrumenting real-user behavior on real devices
  • Partnerships, by translating "did it get better" into language an OEM partner can hold us to
Skills you'll be expected to teach
  • How to measure non-deterministic systems - agent eval, tool use, long-horizon tasks, multilingual behavior
  • How to push back on a metric that's being gamed without breaking the team
Skills you'll be expected to learn
  • On-device perf trade-offs and how they show up in real-user evals
  • What QA-ing AI at OEM scale actually looks like
  • The realities of shipping consumer agents to production partners
Timeline of success

After 30 days - You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days - You've stood up a new eval surface - agentic, on-device, or behavioral - and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days - Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

Compensation

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

How to apply

Send a link to an eval, benchmark, or measurement system you built - and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.

About Pitchbook

PitchBook is a financial data and software company that provides research, analysis, and data on private equity, venture capital, and M&A transactions. The company's platform offers a range of tools and services, including market research, deal sourcing, due diligence, and portfolio management. PitchBook serves a variety of clients, including investment banks, private equity firms, venture capital firms, and corporate development teams. The company was founded in 2007 and is headquartered in Seattle, Washington.
Learn more about Pitchbook
Size
1,000 employees
Industry
Founded
2007

Similar Jobs

More Jobs at Pitchbook

  • Pitchbook
    iOS Engineer
    $130K — $180K *
    San Francisco, CA 94112 (San Francisco County)
    Consumer Technology
    In-Person
  • Pitchbook
    Research Engineer - Evals
    $120K — $160K *
    San Francisco, CA 94112 (San Francisco County)
    Information Technology
    In-Person

More Information Technology Jobs

Find similar Research Engineer - Evals jobs: