Research Engineer - Evals

Pitchbook • $120K — $160K *

San Francisco, CA 94112In-Person

Information Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of experience in evaluation frameworks for AI systems
Strong understanding of model capability and agent behavior
Demonstrated ability to translate technical language for non-technical stakeholders
Background in designing human-rated evaluation rubrics
Experience with real-user behavior instrumentation and analysis

Responsibilities

Develop eval suites for every model and agent release
Create dashboards and tools for efficient researcher and leadership decision-making
Establish criteria for model readiness and shipping
Support research by ensuring accurate measurement of desired outcomes
Collaborate with product engineers to track user behavior on devices
Assist in translating evaluation criteria for OEM partnerships

Benefits

Comprehensive relocation and immigration support
Opportunity to work in a collaborative, fast-paced environment
Engagement with cutting-edge AI technology and research
Potential for meaningful equity in the company
In-person work environment in San Francisco

Full Job Description

You decide what "better" means.

Models, agents, and product features all ship behind one question: did this actually get better? Without a strong evals function, the lab ships vibes. With one, every training run, every prompt change, every agent capability moves a number we trust - and the team makes decisions on real signal, not the loudest opinion in the room.

You'll build the eval harness for AGI - across model capability, agentic behavior, on-device performance, and end-user experience. You'll set the bar for what counts as "shipped" and protect it from the gravity of product deadlines.

🤩 Tasks you will own

The eval suites that gate every model and agent release - capability, behavior, regressions, and human-rated rubrics that catch what automated evals miss
The dashboards and tooling that make researcher experiment loops fast and leadership decisions easy
The bar - what counts as ready to ship, and how we know

🤚 Areas where you will assist

Research, by making sure what we measure is what we want
Product engineers, by instrumenting real-user behavior on real devices
Partnerships, by translating "did it get better" into language an OEM partner can hold us to

Skills you'll be expected to teach

How to measure non-deterministic systems - agent eval, tool use, long-horizon tasks, multilingual behavior
How to push back on a metric that's being gamed without breaking the team

Skills you'll be expected to learn

On-device perf trade-offs and how they show up in real-user evals
What QA-ing AI at OEM scale actually looks like
The realities of shipping consumer agents to production partners

Timeline of success

After 30 days - You've audited every eval we run today and produced a sharp doc on what's good, what's noise, and what's missing. You've fixed the most embarrassing gap.

After 60 days - You've stood up a new eval surface - agentic, on-device, or behavioral - and the team is making real decisions on its output. Researchers come to you before launching a run, not after.

After 90 days - Releases now ship against your eval bar, not a vibe-check. You've caught a regression that would have shipped, and cleared a launch the team was nervous about. You're shaping the research roadmap by surfacing where we're flat, where we're climbing, and where we're lying to ourselves.

Compensation

Competitive cash and meaningful equity. Top-tier relocation and immigration support. SF, in person.

How to apply

Send a link to an eval, benchmark, or measurement system you built - and one paragraph on what decision it changed. Plus your resume or LinkedIn. Every exceptional candidate hears back within 48 hours.

About Pitchbook

PitchBook is a financial data and software company that provides research, analysis, and data on private equity, venture capital, and M&A transactions. The company's platform offers a range of tools and services, including market research, deal sourcing, due diligence, and portfolio management. PitchBook serves a variety of clients, including investment banks, private equity firms, venture capital firms, and corporate development teams. The company was founded in 2007 and is headquartered in Seattle, Washington.

Learn more about Pitchbook

Size

1,000 employees

Industry

Finance & Insurance

Founded

2007

* Ladders Estimates

Similar Jobs

Senior AI Product Manager
$153K — $194K *
Cotiviti
Remote
Today
AI Product Manager
$150K — $170K *
Anywhere, Inc
San Jose, CA 95123 (Santa Clara County)
Today
Product Manager - AI Data Platform
$150K — $250K *
Glint Tech Solutions LLC
Sunnyvale, CA 94087 (Santa Clara County)
Today
Sr. Conversational AI Consultant
$120K — $150K *
Verint Systems
Remote
Today
Product Manager
$150K — $200K *
Radiology Partners
Remote
Reposted Today
AI Deployment Strategist
$120K — $180K *
BackOps AI
San Francisco, CA 94112 (San Francisco County)
Today

Get Ready For Your
Next Interview

More Jobs at Pitchbook

Product Designer
$120K — $150K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Consumer Technology
In-Person
AI Researcher
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Consumer Technology
In-Person
AI Engineer - Backend
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Information Technology
In-Person
AI Product Engineer
$120K — $180K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Consumer Technology
In-Person
iOS Engineer
$130K — $180K *
San Francisco, CA 94112 (San Francisco County)
1 month ago
Consumer Technology
In-Person

More Information Technology Jobs

Sales Operations Specialist
Dotcomteam LLC
Salem, NH 03079 (Rockingham County)
Today
Project Manager 3 (IT Projects)
$100K — $130K *
First Tek, Inc.
Vancouver, WA 98682 (Clark County)
Reposted Today
Conseiller(ère) principal(e) - Gouvernance, risques et conformité en cybersécurité
$90K — $120K *
Exo - Réseau de Transport Métropolitain
Montreal, QC H1A 0A1
Today
Senior Software Engineer, Edge
$130K — $180K *
Kargo
San Francisco, CA 94112 (San Francisco County)
Today
Manager, Data Management
$160K — $185K *
Goodwin Procter
Los Angeles, CA 90011 (Los Angeles County)
Reposted Today

Find similar Research Engineer - Evals jobs:

Nationwide San Francisco, CA

Research Engineer - Evals

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Engineer - Evals jobs:

Get Ready For Your
Next Interview