AI Evaluation & Benchmarking Engineer IRC299413

GlobalLogic • $150K — $180K *

Minneapolis, MN 55407In-Person

Consumer Technology

Less than 5 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5-7 years of hands-on reinforcement learning experience.
Familiarity with using LLMs for evaluation and workflow automation.
Proficient in Python for ML applications and data analysis.
Expertise in designing experiments with statistical rigor.
Strong understanding of evaluation metrics and benchmarking techniques.
Experience analyzing performance using structured logs and outputs.
Ability to work with client infrastructure while adhering to security protocols.

Responsibilities

Develop and integrate reinforcement learning and baseline algorithms onto the evaluation platform.
Incorporate LLM-based agents for improved performance within game environments.
Execute benchmarks across various configurations and algorithm versions.
Define effective evaluation strategies for diverse algorithm approaches.
Extract and validate meaningful performance metrics from experimental results.
Build scoring frameworks and comparison metrics to summarize algorithm performance.
Act as a key user of the platform, providing insights on improvement areas.

Benefits

Opportunities for professional growth in a cutting-edge field.
Collaborative work environment promoting innovation.
Access to advanced AI technologies and resources.
Chance to contribute to impactful projects in gaming and AI evaluation.

Full Job Description

Description

We are looking for an AI Evaluation & Benchmarking Engineer with experience in reinforcement learning, LLM-based agents, experiment design, benchmarking, and performance evaluation. This role will support the productionization of an AI evaluation platform used to execute and evaluate algorithms within video game environments.

The engineer will develop and integrate baseline algorithms, reinforcement learning approaches, LLM-based agents, and externally developed algorithms into the platform. This person will also design experiments, define evaluation metrics, run benchmarks, analyze performance, and serve as a primary power user of the platform to provide feedback to the engineering team.

Ideal Candidate Profile

The ideal candidate is a hands-on AI evaluation engineer who can both build and use the platform. This person should be comfortable integrating algorithms, running experiments, defining metrics, analyzing results, and giving practical feedback to engineering teams. The role requires a blend of ML experimentation, LLM agent evaluation, Python engineering, and strong platform-user instincts.

Important Note

GlobalLogic estimates the starting pay range for this role to be performed in Minneapolis, MN will be $150K to $180K and reflects base salary only and does not include additional performance-linked variable compensation, benefits etc that may be applicable for the role. This pay range is provided as a good faith estimate and the amount offered may be higher or lower. GlobalLogic takes many factors into consideration in making an offer, including candidate qualifications, work experience, operational needs, travel and onsite requirements, internal peer equity, prevailing wage, responsibilities, and other market and business considerations.

Requirements

* Hands-on reinforcement learning experience.

* Experience using LLMs for agents, evaluation, reasoning, automation, or benchmark workflows.
* Strong Python experience for ML, data workflows, experimentation, and analysis.
* Experience designing and running experiments with statistical and analytical rigor.
* Strong understanding of evaluation metrics, scoring frameworks, performance comparison, and benchmark design.
* Experience analyzing structured logs, run outputs, model/agent performance, and experiment results.
* Ability to work across APIs, logs, CLI/tools, data structures, and platform workflows.
* Strong communication skills to translate experiment findings into platform improvement requirements.
* Ability to work inside client-owned repositories, infrastructure, workflows, and security controls.

Preferred Skills

* Experience with game environments, simulation environments, Gym-like interfaces, RL environments, or agentic AI test harnesses.
* Experience benchmarking LLM agents, RL policies, autonomous agents, or hybrid AI systems.
* Experience with experiment tracking, run comparison tools, metrics dashboards, or evaluation pipelines.
* Experience with prompt engineering, agent orchestration, tool use, and LLM evaluation frameworks.
* Experience with data visualization and performance analytics.
* Experience working with externally developed algorithms, reproducible experiments, and version-controlled evaluation workflows.

Job responsibilities

* Develop, adapt, and integrate reinforcement learning algorithms and baseline approaches into the shared evaluation platform.

* Integrate LLM-based agents and/or evaluators for solving, interacting with, and benchmarking game environments.
* Integrate external or off-the-shelf algorithms into the platform using defined execution and ingestion workflows.
* Design and run benchmark experiments across games, environments, configurations, agents, and algorithm versions.
* Define evaluation strategies for comparing RL, LLM-based, hybrid, and baseline approaches.
* Define, extract, and validate meaningful performance metrics from logs, outputs, run results, and environment interactions.
* Build comparison logic, scoring approaches, rankings, verdicts, and performance summaries.
* Develop analytics and visualizations to evaluate algorithm performance across runs and environments.
* Act as a primary power user of the platform, running experiments and identifying gaps in tooling, APIs, metrics, workflows, logs, and user experience.
* Provide structured feedback to Platform and Full Stack engineers to improve execution, logging, evaluation, and reporting capabilities.
* Validate existing game environments and support development or validation of new game environments.
* Evaluate environment operability using baseline/reference frontier LLM models, harnesses, and agents.
* Collaborate with client technical teams and engineering resources within 3M-owned repositories, workflows, infrastructure, and security processes.
* Ensure all algorithms, experiments, notebooks/scripts, configuration, documentation, and outputs comply with 3M-defined standards and policies.

About GlobalLogic

GlobalLogic is a digital product engineering company that provides software development, design, and consulting services to businesses in various industries. The company was founded in 2000 and is headquartered in San Jose, California. GlobalLogic has over 20,000 employees in 14 countries and has worked with over 400 clients, including many Fortune 500 companies. The company's services include product engineering, digital transformation, and customer experience design. GlobalLogic is committed to delivering innovative solutions that help its clients stay ahead of the competition.

Learn more about GlobalLogic

Size

20,000 employees

Industry

Information Technology

Founded

2000

* Ladders Estimates

Similar Jobs

AI FinOps Engineer
$100K — $150K *
Nelnet
Remote
5 days ago
AI Engineer
$109K — $156K *
Guild Mortgage
Remote
2 weeks ago
AI Engineer - Data Intelligence
$150K — $180K *
Clarium
Remote
4 weeks ago

Get Ready For Your
Next Interview

More Jobs at GlobalLogic

AI Evaluation & Benchmarking Engineer IRC299413
$150K — $180K *
Minneapolis, MN 55407 (Hennepin County)
Today
Consumer Technology
In-Person
Technical Architect IRC294754
$140K — $150K *
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Information Technology
In-Person
Technical Architect IRC294754
$140K — $150K *
Remote
Today
Enterprise Technology
Remote in San Jose, CA
Principal Software Engineer/Architect - AI/ML IRC295754
$140K — $150K *
Allen, TX 75002 (Collin County)
Reposted Today
Enterprise Technology
Hybrid
Principal Software Engineer/Architect - AI/ML IRC295754
$140K — $150K *
Irving, TX 75061 (Dallas County)
Reposted Today
Enterprise Technology
In-Person

More Consumer Technology Jobs

Senior Product Manager
$126K — $198K *
ZoomInfo
Waltham, MA 02453 (Middlesex County)
Reposted Today
Technical Program Manager, Link
$130K — $180K *
Stripe
New York City, NY 10025 (New York County)
Today
Senior Customer Success Manager
$190K — $215K *
Crusoe
Denver, CO 80219 (Denver County)
Today
Principal Experience Designer, Customer Resolution
$218K — $323K *
PayPal
San Jose, CA 95123 (Santa Clara County)
Today
SSD Qualification Engineer - PCIe Validation
$180K — $270K *
Pure Storage
Santa Clara, CA 95051 (Santa Clara County)
Today

Find similar AI Evaluation & Benchmarking Engineer IRC299413 jobs:

Nationwide Minneapolis, MN

AI Evaluation & Benchmarking Engineer IRC299413

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar AI Evaluation & Benchmarking Engineer IRC299413 jobs:

Get Ready For Your
Next Interview