Staff Engineer, AI Evals

Sema4.ai

• $120K — $150K *

Madison, WI 53711In-Person

Enterprise Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

7+ years of software engineering experience, including 2+ years building AI/ML systems in production
Deep experience with backend systems in Python
Hands-on experience evaluating LLM-based systems, such as agents and workflows
Strong intuition for metrics, experimentation, and failure analysis
Exceptional communication skills for collaboration with diverse stakeholders
High-ownership mindset toward system integrity and decision-making

Responsibilities

Design, build, and operate core evaluation infrastructure for LLMs and agents
Translate fuzzy goals around correctness and reliability into measurable signals
Solve ambiguous problems regarding the evaluation of complex agent systems
Guide architectural decisions and model selections with evaluation results
Participate in design reviews and set technical standards
Mentor other engineers and assist in interviewing senior technical candidates

Benefits

Flexible working hours
Collaborative team environment
Opportunities for professional growth and mentorship
Access to the latest tools and technologies
Contributions to impactful AI initiatives

Full Job Description

The Opportunity

As a Staff Engineer, AI Evals, you'll design and own the evaluation systems that determine whether our agents are actually good: correct, reliable, efficient, and improving over time. You'll build the measurement backbone that guides model choice, agent design, product decisions, and customer trust.

This is an early, high-impact role. You'll be defining how we measure success for AI agents in production, where ambiguity is real, and ground truth can be messy. We're looking for an engineer who brings rigor, judgment, and strong opinions about what "good" looks like, and who know how to operationalize it.

Who You Are

AI Systems & Evaluation Expert

You understand that AI systems are only as good as the way they're measured. You've worked with LLMs and agentic systems in production and have seen how offline benchmarks, synthetic data, and human judgment can all fail in different ways. You know how to design evaluations that are meaningful, repeatable, and decision-useful, not just theoretically impressive.

You're familiar with the sharp edges: non-determinism, prompt drift, regression risk, overfitting, data leakage, and the tension between fast iteration and statistical rigor.

In-Depth Technologist

You stay close to research and industry practice in evaluation, alignment, and reliability. You understand where automated metrics work, where they break down, and how to combine them with human review, golden datasets, and production signals. You bring creativity to building evaluation sets and scenarios, and in sourcing (or synthesizing) the data you need.

Builder With High Standards

You care deeply about correctness, clarity, and operational behavior. You can move fast, but you don't confuse speed and rigor. You design eval systems that engineers trust, product relies on, and leadership uses to make decisions. You know when to build custom infrastructure and when to leverage existing tools without outsourcing critical thinking.

What You'll Do

Build and Own the Evaluation Platform

Design, build, and operate Sema4.ai's core evaluation infrastructure for LLMs and agents: offline benchmarks, regression tests, task-level metrics, and production feedback loops. These systems will directly inform product launches, model upgrades, and customer requirements.

Define "Good" for Agents in Production

Work closely with agent, product, and field engineering teams to translate fuzzy goals around correctness, reliability, usefulness into concrete, measurable signals. You'll help define success criteria for new capabilities and ensure we can detect regressions before customers do.

Tackle Ambiguous, High-Leverage Problems

Solve hard problems where the answer isn't obvious:

How to evaluate long-running, multi-step agents
How to balance automated scoring with human judgment
How to measure improvement when tasks evolve
How to compare models under cost and latency constraints

Influence Technical and Product Direction

Use evaluation results to guide architectural decisions, model selection, and roadmap tradeoffs. You'll participate in design reviews, set technical standards for eval rigor, mentor other engineers, and help interview senior technical candidates.

What You Bring

7+ years of software engineering experience, including 2+ years building AI/ML systems in production
Deep experience with backend systems in Python, including data pipelines, observability, and reliability
Hands-on experience evaluating LLM-based systems (agents, retrieval, tool use, workflows, etc.)
Strong intuition for metrics, experimentation, and failure analysis in non-deterministic systems
Strong communication skills: whether you're talking to colleagues, customers, or machines, you communicate clearly, concisely, and collaboratively
A high-ownership mindset: you care deeply about the integrity of the systems you build and the decisions they inform

* Ladders Estimates

Similar Jobs

AI Software Engineer
$90K — $130K *
Synergy Pet Group
Remote
Today
Research Engineer
$150K — $250K *
Helm.ai
Remote
Today
Sr. Software Engineer, AI (Founding)
$120K — $160K *
Turbo Law Inc
Remote
Today
Staff Engineer, AI
$120K — $160K *
Sema4.ai
Madison, WI 53711 (Dane County)
Today
Founding Staff AI Engineer
$130K — $180K *
Qumis Inc
Chicago, IL 60629 (Cook County)
Today
Research Engineer
$150K — $250K *
Helm.ai
Remote
Today

Get Ready For Your
Next Interview

More Jobs at Sema4.ai

Staff Engineer, Agentic Backend
$120K — $150K *
Atlanta, GA 30349 (Fulton County)
Today
Information Technology
In-Person
Tech Lead, Agentic Engineering
$130K — $180K *
Madison, WI 53711 (Dane County)
Today
Information Technology
In-Person
Tech Lead, Agentic Engineering
$130K — $180K *
Atlanta, GA 30349 (Fulton County)
Today
Information Technology
In-Person
Staff Engineer, AI
$120K — $160K *
Atlanta, GA 30349 (Fulton County)
Today
Enterprise Technology
In-Person
Staff Engineer, AI
$120K — $160K *
Madison, WI 53711 (Dane County)
Today
Enterprise Technology
In-Person

More Enterprise Technology Jobs

Staff Fullstack Engineer
$130K — $180K *
Hamilton AI
San Francisco, CA 94112 (San Francisco County)
Today
Account Executive
$180K — $280K *
PointOne Technologies, Inc
New York, NY 10025 (New York County)
Today
Product Reliability Engineer
$100K — $130K *
PointOne Technologies, Inc
New York, NY 10025 (New York County)
Today
Forward Deployed Engineer
$120K — $160K *
Interface
San Francisco, CA 94112 (San Francisco County)
Today
Product Manager
$120K — $150K *
Plaid
San Francisco, CA 94112 (San Francisco County)
Reposted Today

Find similar Staff Engineer, AI Evals jobs:

Nationwide Madison, WI

Staff Engineer, AI Evals

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff Engineer, AI Evals jobs:

Get Ready For Your
Next Interview