AI Eval / Testing (Eval Engineer)

NTT DATA Services • $100K — $130K *

Dallas, TX 75217In-Person

Information Technology

8 - 10 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years of experience in AI evaluation and testing disciplines.
Strong proficiency in Python and software testing automation tools (e.g., Pytest).
Hands-on experience with AI evaluation tools such as LangSmith and DeepEval.
Familiarity with orchestration tools like LangChain or CrewAI.
Strong analytical and debugging skills with attention to detail.

Responsibilities

Build and maintain AI evaluation pipelines for testing AI system performance.
Implement observability features to track error propagation in pipelines.
Define and monitor AI quality metrics and performance KPIs.
Automate evaluation and regression testing processes for AI applications.
Collaborate with cross-functional teams to enhance user experiences with AI systems.

Benefits

Collaborative work environment with cross-functional teams.
Exposure to cutting-edge AI technologies and tools.
Opportunities for professional growth in a fast-paced setting.

Full Job Description

Req ID: 375677

We are currently seeking a AI Eval / Testing (Eval Engineer) to join our team in Dallas, Texas (US-TX), United States (US).

Job title : AI Eval / Testing (Eval Engineer)

Experience level: 10 + years
Job Summary
We are looking for an AI Evaluation & Test Engineer to ensure generative AI models and applications are safe, accurate, trustworthy, and deliver an elegant user experience.

Validates AI models and agents for accuracy, safety, bias, and performance through structured testing, benchmarking, and continuous evaluation pipelines and will be responsible for following:

Build and maintain AI evaluation pipelines to test, measure, and evaluate the behavior and performance of AI systems.
Implement traces, spans, and session tracking for observability and identify error propagation in multi-step pipelines.
Define AI quality metrics and KPIs around factuality, faithfulness, toxicity, grounding precision/recall, latency, cost, etc., with clear acceptance bars.
Implement evaluation and testing automation to enable end-to-end system and regression testing at scale.
Define criteria for and implement release gates in the CI/CD pipeline.
Define criteria for and implement release gates in the CI/CD pipeline.
Find creative ways to break products.
Assist in root cause analysis and troubleshooting of bugs and field issues.
Collaborate with cross-functional teammates from product, engineering, linguistics,, and customer support to shape human-AI interaction paradigms and ensure that our AI models and applications deliver the desired outcome and user experience.

Platform & Enablement Roles

AI Platform Admin (M365, copilot Studio) Manages AI platforms and environments, including access provisioning, governance controls, and policy enforcement (e.g., DLP, security, and compliance).
AI Reusable Utility Develops reusable components (e.g., prompts, connectors, APIs, templates) to accelerate AI solution delivery and promote standardization across use cases.
AI Common Infrastructure, Framework & Observability Architect (AWS and Azure) Designs and maintains the foundational AI infrastructure, frameworks, and observability capabilities (telemetry, monitoring, metrics) required for scalable, reliable, and governed AI operations.

Core Responsibilities

Adversarial Testing (Red Teaming): Design prompts to manipulate agent behavior, stress-test edge cases, and expose security vulnerabilities (e.g., prompt injection or PII leakage) before deployment.
Pipeline Automation: Build and maintain automated regression testing, CI/CD release gates, and testing data sets (golden sets) to measure system drift.
Grader Development: Implement "LLM-as-a-judge" frameworks, rule-based checks, and human-in-the-loop scoring rubrics to objectively evaluate open-ended AI outputs.
Root Cause Analysis: Trace multi-turn conversations and agent tool interactions to diagnose when and why the AI chose the wrong path.
Metric Definition: Establish and monitor AI KPIs such as factual accuracy, latency, cost, and grounding precision.

• Required Skills & Tech Stack

Programming: 5+ years of strong proficiency in Python and testing frameworks like pytest.
AI & LLM Frameworks: 5+ years of hands-on experience with evaluation tools like LangSmith, DeepEval, TruLens, or Promptfoo.
Orchestration Tools: 3 to 5 years of familiarity with agentic workflows built on LangChain, CrewAI, or LlamaIndex.
Observability: Understanding of tracing and session tracking to map how errors propagate in RAG systems.

5+ years of strong software testing fundamentals and expertise in writing test plans, executing test cases, and generating detailed reports and dashboards.
Strong analytical and debugging skills, and attention to detail.
5+ years of proficiency in Python, scripting, and software testing automation frameworks and tools such as Pytest, Selenium, Robot Framework, etc.
Working knowledge of generative AI models, AI agents, and related concepts such as retrieval augmented generation (RAG), prompt engineering, context engineering, explainability, traceability, observability, guard rails, reasoning, specificity, etc.
Sound understanding of the fundamental differences in the approach for testing conventional software versus evaluating generative AI systems.
Team player with excellent interpersonal skills and the ability to collaborate effectively with remote and cross- functional team members.
Go-getter attitude and ability to flourish in a fast-paced, startup environment.
Experience in any of the following would be a big plus.
AI evaluation frameworks such as Arize, Braintrust, DeepEval, LangSmith, Ragas
AI safety and red teaming experience, e.g., prompt injection, jailbreak, adversarial and stress testing.
Different types of AI evaluation methods, e.g, Human-in-the-loop, LLM-as-a-Judge.

• Typical Qualifications

Experience: Usually 2-4+ years of hands-on experience as an ML Engineer, AI Engineer, or a specialized QA/Testing Engineer focusing on machine learning.
Education: Degree in Computer Science, Data Science, Linguistics, or closely related technical fields

Common Expectation from all the roles:

Compliance with Client's responsible AI principles and Acceptable Use policy

Adherence to data residency, privacy (GDPR, HIPAA where applicable), and 21 CFR Part 11 controls where in scope
Third-party risk assessment and SOC 2 Type II (or equivalent) certification
Disclosure of subcontractors and offshore delivery locations
Disclosure of model providers, training data practices, and any use of client data for model improvement (opt-out required)

#LI-NorthAmerica

About NTT DATA Services

NTT DATA Corporation is a Japanese multinational information technology service and consulting company headquartered in Tokyo, Japan. It is partially-owned subsidiary of Nippon Telegraph and Telephone. Japan Telegraph and Telephone Public Corporation, a predecessor of NTT, started Data Communications business in 1967. NTT, following its privatization in 1985, spun off the Data Communications division as NTT DATA in 1988, which has now become the largest of the IT Services companies headquartered in Japan.

Learn more about NTT DATA Services

Size

151,991 employees

Industry

Technical Services

Founded

1988

NASDAQ

NTTDF

* Ladders Estimates

Similar Jobs

Senior AI Platform Engineer - Frisco
$107K — $176K *
McAfee
Frisco, TX 75034 (Denton County)
Today
IT AI ENGINEER
$100K — $130K *
Sally Beauty Holdings Inc
Plano, TX 75025 (Collin County)
Today
AI Engineer - GenAI / Agentic Systems - Financial Services - Infosys Consulting
$116K — $148K *
Infosys
Dallas, TX 75217 (Dallas County)
Today
AI Game Designer | North America | Canada | Europe | Fully Remote
$80K — $120K *
Escape Velocity Entertainment Inc
Remote
Reposted Today
Artificial Intelligence Developer
$90K — $130K *
McQuay International
Waller, TX 77484 (Harris County)
Today
AI Engineer - Remote
$100K — $150K *
Huzzle
Remote
Reposted Today

Get Ready For Your
Next Interview

More Jobs at NTT DATA Services

AI Eval / Testing (Eval Engineer)
$100K — $130K *
Dallas, TX 75217 (Dallas County)
Today
Information Technology
In-Person
SAP PP/QM Lead
$120K — $150K *
Plano, TX 75025 (Collin County)
Today
Manufacturing & Automotive
In-Person
AI Engr (Commercial & MTO)
$120K — $150K *
Dallas, TX 75217 (Dallas County)
Today
Enterprise Technology
In-Person
Enterprise Architect
$166K — $243K *
Remote
Today
Enterprise Technology
Remote in Canada
Automation Operations Lead
$135K — $250K *
Atlanta, GA 30349 (Fulton County)
Today
Information Technology
In-Person

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
IT Network Engineer IV
$133K *
Hackensack Meridian Health
Edison, NJ 08817 (Middlesex County)
Today
Software Developer
$115K — $132K *
HealthEdge
Remote
Reposted Today
IT Support Technician
$83K — $102K *
Hammel, Green, and Abrahamson
Santa Monica, CA 90405 (Los Angeles County)
Today
IT Systems/Network Engineer Subject Matter Expert
$100K — $130K *
OBXtek Inc.
Fort Belvoir, VA 22060 (Fairfax County)
Today

Find similar AI Eval / Testing (Eval Engineer) jobs:

Nationwide Dallas, TX

AI Eval / Testing (Eval Engineer)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar AI Eval / Testing (Eval Engineer) jobs:

Get Ready For Your
Next Interview