GlobalLogic

AI Evaluation & Benchmarking Engineer IRC299413

GlobalLogic$150K — $180K *
Consumer Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of hands-on reinforcement learning experience.
  • Familiarity with using LLMs for evaluation and workflow automation.
  • Proficient in Python for ML applications and data analysis.
  • Expertise in designing experiments with statistical rigor.
  • Strong understanding of evaluation metrics and benchmarking techniques.
  • Experience analyzing performance using structured logs and outputs.
  • Ability to work with client infrastructure while adhering to security protocols.

Responsibilities

  • Develop and integrate reinforcement learning and baseline algorithms onto the evaluation platform.
  • Incorporate LLM-based agents for improved performance within game environments.
  • Execute benchmarks across various configurations and algorithm versions.
  • Define effective evaluation strategies for diverse algorithm approaches.
  • Extract and validate meaningful performance metrics from experimental results.
  • Build scoring frameworks and comparison metrics to summarize algorithm performance.
  • Act as a key user of the platform, providing insights on improvement areas.

Benefits

  • Opportunities for professional growth in a cutting-edge field.
  • Collaborative work environment promoting innovation.
  • Access to advanced AI technologies and resources.
  • Chance to contribute to impactful projects in gaming and AI evaluation.
Full Job Description
Description

We are looking for an AI Evaluation & Benchmarking Engineer with experience in reinforcement learning, LLM-based agents, experiment design, benchmarking, and performance evaluation. This role will support the productionization of an AI evaluation platform used to execute and evaluate algorithms within video game environments.

The engineer will develop and integrate baseline algorithms, reinforcement learning approaches, LLM-based agents, and externally developed algorithms into the platform. This person will also design experiments, define evaluation metrics, run benchmarks, analyze performance, and serve as a primary power user of the platform to provide feedback to the engineering team.

Ideal Candidate Profile

The ideal candidate is a hands-on AI evaluation engineer who can both build and use the platform. This person should be comfortable integrating algorithms, running experiments, defining metrics, analyzing results, and giving practical feedback to engineering teams. The role requires a blend of ML experimentation, LLM agent evaluation, Python engineering, and strong platform-user instincts.

Important Note

GlobalLogic estimates the starting pay range for this role to be performed in Minneapolis, MN will be $150K to $180K and reflects base salary only and does not include additional performance-linked variable compensation, benefits etc that may be applicable for the role. This pay range is provided as a good faith estimate and the amount offered may be higher or lower. GlobalLogic takes many factors into consideration in making an offer, including candidate qualifications, work experience, operational needs, travel and onsite requirements, internal peer equity, prevailing wage, responsibilities, and other market and business considerations.

Requirements

* Hands-on reinforcement learning experience.

* Experience using LLMs for agents, evaluation, reasoning, automation, or benchmark workflows.
* Strong Python experience for ML, data workflows, experimentation, and analysis.
* Experience designing and running experiments with statistical and analytical rigor.
* Strong understanding of evaluation metrics, scoring frameworks, performance comparison, and benchmark design.
* Experience analyzing structured logs, run outputs, model/agent performance, and experiment results.
* Ability to work across APIs, logs, CLI/tools, data structures, and platform workflows.
* Strong communication skills to translate experiment findings into platform improvement requirements.
* Ability to work inside client-owned repositories, infrastructure, workflows, and security controls.

Preferred Skills

* Experience with game environments, simulation environments, Gym-like interfaces, RL environments, or agentic AI test harnesses.
* Experience benchmarking LLM agents, RL policies, autonomous agents, or hybrid AI systems.
* Experience with experiment tracking, run comparison tools, metrics dashboards, or evaluation pipelines.
* Experience with prompt engineering, agent orchestration, tool use, and LLM evaluation frameworks.
* Experience with data visualization and performance analytics.
* Experience working with externally developed algorithms, reproducible experiments, and version-controlled evaluation workflows.

Job responsibilities

* Develop, adapt, and integrate reinforcement learning algorithms and baseline approaches into the shared evaluation platform.

* Integrate LLM-based agents and/or evaluators for solving, interacting with, and benchmarking game environments.
* Integrate external or off-the-shelf algorithms into the platform using defined execution and ingestion workflows.
* Design and run benchmark experiments across games, environments, configurations, agents, and algorithm versions.
* Define evaluation strategies for comparing RL, LLM-based, hybrid, and baseline approaches.
* Define, extract, and validate meaningful performance metrics from logs, outputs, run results, and environment interactions.
* Build comparison logic, scoring approaches, rankings, verdicts, and performance summaries.
* Develop analytics and visualizations to evaluate algorithm performance across runs and environments.
* Act as a primary power user of the platform, running experiments and identifying gaps in tooling, APIs, metrics, workflows, logs, and user experience.
* Provide structured feedback to Platform and Full Stack engineers to improve execution, logging, evaluation, and reporting capabilities.
* Validate existing game environments and support development or validation of new game environments.
* Evaluate environment operability using baseline/reference frontier LLM models, harnesses, and agents.
* Collaborate with client technical teams and engineering resources within 3M-owned repositories, workflows, infrastructure, and security processes.
* Ensure all algorithms, experiments, notebooks/scripts, configuration, documentation, and outputs comply with 3M-defined standards and policies.

About GlobalLogic

GlobalLogic is a digital product engineering company that provides software development, design, and consulting services to businesses in various industries. The company was founded in 2000 and is headquartered in San Jose, California. GlobalLogic has over 20,000 employees in 14 countries and has worked with over 400 clients, including many Fortune 500 companies. The company's services include product engineering, digital transformation, and customer experience design. GlobalLogic is committed to delivering innovative solutions that help its clients stay ahead of the competition.
Learn more about GlobalLogic
Size
20,000 employees
Industry
Founded
2000

Similar Jobs

More Jobs at GlobalLogic

More Consumer Technology Jobs

Find similar AI Evaluation & Benchmarking Engineer IRC299413 jobs: