Software Engineer (Model Evaluation & Benchmarking)

SpreeAI

• $120K — $160K *

San Francisco, CA 94112Hybrid

Information Technology

Less than 5 years of experience

1 week ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

Degree in Computer Science, AI, Engineering, or relevant experience.
Strong programming skills in Python.
Familiarity with object-oriented programming (C++, Java, Python, etc.).
Strong understanding of data structures and algorithms.
Knowledge of machine learning experimentation workflows.

Responsibilities

Build automated evaluation pipelines for multimodal AI models.
Benchmark diffusion models, vision systems, and generative workflows.
Validate model checkpoints and detect regressions across versions.
Develop evaluation metrics for realism, consistency, and performance.
Integrate evaluation tools into CI/CD workflows.
Collaborate with ML researchers and infrastructure teams for production readiness.
Analyze failure modes and propose necessary evaluation strategies.

Benefits

Opportunity to work at the cutting edge of AI technologies.
Cross-disciplinary collaboration with researchers and engineers.
Impactful role in defining quality standards for generative AI systems.
Access to modern infrastructure and evaluation tools.
Involvement in a dynamic and rapidly evolving field.

Full Job Description

About the Role

We are hiring Engineers focused on AI Model Evaluation to build the systems that ensure multimodal AI behaves reliably, consistently, and predictably as it moves from research into production. This position focuses on evaluating generative and vision-based models through automated benchmarking, dataset-driven testing, and performance validation pipelines.

You will work at the intersection of applied science, infrastructure, and product - helping define how we measure realism, consistency, and quality across image, video, and multimodal AI systems.

Why This Role Exists

Modern AI evaluation extends beyond pass/fail testing. Multimodal generative systems require:

benchmarking across visual realism, pose consistency, and identity preservation,
automated regression detection across model checkpoints,
scalable evaluation pipelines integrated into continuous deployment workflows.

We are building evaluation systems where research velocity and product reliability must coexist. This role is for engineers interested in defining how quality is measured in generative AI systems.

What you'll do

Build automated evaluation pipelines for multimodal AI models.
Benchmark diffusion models, vision systems, and generative workflows.
Validate model checkpoints and detect regressions across versions.
Develop evaluation metrics for realism, consistency, and performance.
Integrate evaluation tooling into CI/CD workflows.
Collaborate with ML researchers and infrastructure teams to ensure production readiness.
Analyze failure modes and propose evaluation strategies.

Core Areas & Tooling

Candidates should be familiar with or interested in:

LLM, VLM, or Stable Diffusion model evals
Image/Video benchmarking techniques
Multimodal evaluation frameworks
dataset-driven testing workflows
research experiment validation pipelines

Qualifications

Degree in Computer Science, AI, Engineering, or comparable combination of education and practical experience.
Strong programming skills in Python.
Familiarity with object-oriented programming (C++, Java, Python, or similar).
Strong data structures and algorithms fundamentals.
Understanding of machine learning experimentation workflows.

Preferred Qualifications

Experience evaluating vision or generative models.
Familiarity with HuggingFace ecosystem or open-source ML toolkits.
Experience building automated test frameworks or benchmarking tools.
Knowledge of diffusion models or multimodal architectures.

Experience with data analysis tools (NumPy, Pandas, visualization libraries).

* Ladders Estimates

Similar Jobs

Principal Systems Engineer
$153K — $211K *
Flex
Remote
Reposted Today
Systems Engineer
$110K — $140K *
Grantek Systems Integration
San Jose, CA 95123 (Santa Clara County)
Today
Identity Engineer
$136K — $230K *
MiniMed
Remote
Today
System Engineer
$78K — $138K *
Super Micro Computer, Inc
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Systems Engineer Stf - Configuration Management
$109K — $192K *
Lockheed Martin
Sunnyvale, CA 94087 (Santa Clara County)
Today
Systems Engineer - Satellite Bus
$89K — $157K *
Lockheed Martin
Sunnyvale, CA 94087 (Santa Clara County)
Today

Get Ready For Your
Next Interview

More Jobs at SpreeAI

Mobile Software Engineer - Flagship Apps (iOS / Android / Web)
$120K — $160K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Consumer Technology
Hybrid
Executive Assistant
$75K — $100K *
Los Angeles, CA 90011 (Los Angeles County)
1 week ago
Business Services
In-Person
Software Engineer (Model Evaluation & Benchmarking)
$120K — $160K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
Hybrid
Software Engineer (AI Infrastructure / Training / Inference)
$120K — $160K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Information Technology
Hybrid
Mobile Software Engineer Intern - Flagship Apps (iOS / Android / Web)
$80K — $120K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Consumer Technology
Hybrid

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Software Engineer II, Search & Data Infrastructure -Slack
$117K — $223K *
Salesforce
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer Lead
$55K — $158K *
The PNC Financial Services Group, Inc
Dallas, TX 75217 (Dallas County)
Reposted Today
Senior R&D Engineer-17637
$130K — $180K *
Synopsys Inc
Sunnyvale, CA 94087 (Santa Clara County)
Today

Find similar Software Engineer (Model Evaluation & Benchmarking) jobs:

Nationwide San Francisco, CA

Software Engineer (Model Evaluation & Benchmarking)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Software Engineer (Model Evaluation & Benchmarking) jobs:

Get Ready For Your
Next Interview