Senior / Staff Software AI Test Engineer, AI Engineering

TWG Global AI

• $190K — $250K *

New York, NY 10025In-Person

Enterprise Technology

Less than 5 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

3-7 years in software engineering with focus on test automation
Expert in Python with design and library creation experience
Hands-on experience in Java for testing services
Familiarity with LangGraph or Vercel frameworks
Experience in building eval sets for LLM models
Testing experience on iOS, plugins, and Chrome extensions
Proficient in automated test suite creation with frameworks like pytest and Selenium
Experience with CI/CD integrations like GitHub Actions or Jenkins
Strong data manipulation skills including SQL

Responsibilities

Design scalable test automation frameworks for AI and LLM applications
Write maintainable Python for test harnesses and internal tools
Build evaluation infrastructure to benchmark agent performance
Run performance and load tests for streaming and async systems
Integrate tests into the CI/CD pipeline for model validation
Implement human-in-the-loop workflows for quality validation
Champion quality engineering practices across the team

Benefits

On-site position in Santa Monica, CA
Bonus in addition to base pay
Comprehensive medical benefits
Financial benefits and support
Additional undisclosed perks

Full Job Description

The Role

TWG Global is seeking a Senior or Staff AI Software Engineer in Test to join our AI Engineering team building commercial-grade AI products. This is a software engineering role focused on test automation. You won't just write test cases, you'll design and build the frameworks, harnesses, evaluation infrastructure, and tooling that make testing AI agents and LLM-powered applications possible at scale.

Our agents are written in LangGraph and run on Azure on the TWG side, with a parallel Vercel-based stack on the Palantir side. You'll write eval sets against both, and you'll validate the surfaces our users actually touch: iOS apps, plugins, and Chrome extensions, not just the model layer.

You'll work shoulder-to-shoulder with AI engineers and data scientists, contributing production-quality code to shared repositories. The ideal candidate is a strong coder, fluent in Python and Java - who has shipped automated test infrastructure in a production environment and has hands-on experience evaluating LLM and agentic systems.

Key Responsibilities

Framework and harness engineering

Design and build scalable, reusable test automation frameworks for AI agents, LLM-powered applications, and underlying APIs.
Write clean, maintainable Python for test harnesses, eval pipelines, synthetic data generation utilities, and internal tooling.
Treat test code as production code: code review, type hints, documentation, library design.

Evaluation infrastructure

Build evaluation infrastructure for benchmarking agent performance against SOTA LLMs, competitors, and internal baselines.
Own regression suites, golden datasets, rubric-based evals, and metric dashboards.
Build tooling for synthetic test data generation, edge-case discovery, and adversarial testing.

Resilience and load

Design and run release, system, performance, and load tests against streaming, stateful, and async systems.
Build chaos and fault injection tooling for token expiry, connection pool exhaustion, provider failover, and cache pressure scenarios.
Drive contract testing across LLM providers (Bedrock, Anthropic, OpenAI) to catch parity drift.

CI/CD and observability

Integrate automated tests into CI/CD so every model, prompt, and code change is validated before it ships.
Build trace-based assertions on LangGraph state, tool calls, and agent decisions - debugging an agent failure means replaying graph state, not re-running a prompt.
Make observability a first-class testing surface (LangSmith, audit logs).

Human-in-the-loop and partnership

Implement HIL review workflows where automation alone cannot validate quality, then push the automation boundary outward.
Partner with AI engineers and data scientists on model evaluation, training and eval data prep, and root-cause debugging of complex end-to-end failures.
Champion quality engineering practices across the team: code review, coverage standards, observability, reproducibility.
Ensure user-centric validation so AI outputs are accurate, reliable, and meet real-world application needs.

Requirements

3-7 years of software engineering experience, with a meaningful portion focused on test automation, SDET, or software engineering in test roles.
Expert-level Python. You write Python every day, design libraries other engineers use, and apply OOP and clean-code practices.
Hands-on Java experience, enough to read, write, and test Java services, not just touch them.
Working understanding of the LangGraph or Vercel frameworks: graph state, nodes, edges, tool calls, and how to write evals against agentic flows.
Demonstrated experience building eval sets for LLM models (this is critical to the role).
Experience testing across multiple client surfaces: iOS apps, plugins, and Chrome extensions.
Hands-on experience building automated test suites with frameworks such as pytest, Selenium, Playwright, Cypress, or similar.
Proven experience integrating test automation into CI/CD systems (GitHub Actions, Jenkins, CircleCI, GitLab CI, or similar).
Strong skills in data manipulation, test data preparation, and SQL.
Bachelor's degree or higher in Computer Science, Engineering, or a related field.

Strongly preferred:

Experience with Azure (our primary cloud) and containerization (Docker).
Experience testing RAG pipelines, agentic workflows, or multi-step tool-calling systems.

Benefits

Position Location:

This position is located in Santa Monica, CA (on-site).

Compensation:

The base pay for this position is $190,000-250,000. A bonus will be provided as part of the compensation package, in addition to a full range of medical, financial, and/or other benefits.

* Ladders Estimates

Similar Jobs

Software Engineer - Applied AI/ML
$176K — $332K *
Appcast
Annapolis Junction, MD 20701 (Howard County)
Today
Software Engineer - Applied AI/ML
$176K — $332K *
Appcast
Annapolis, MD 21401 (Anne Arundel County)
Today
Software Engineer 1 - AI/ML/Terraform/C++/AWS/GPU
$130K — $270K *
Captivation Software
Annapolis, MD 21401 (Anne Arundel County)
Today
Software Engineer 1 - AI/ML/Terraform/C++/AWS/GPU
$130K — $270K *
Captivation Software
Annapolis Junction, MD 20701 (Howard County)
Today
Agentic AI Developer
$80K — $194K *
Appcast
Hartford, CT 06106 (Capitol County)
Today
AI Forward Deployed Engineer (Senior/Lead/Principal)
$117K — $313K *
Salesforce
Remote
Today

Get Ready For Your
Next Interview

More Jobs at TWG Global AI

Senior AI Product Manager (Investment Banking & Capital Markets)
$190K — $200K *
New York, NY 10025 (New York County)
3 days ago
Finance & Insurance
In-Person
Senior AI Product Manager (Sports Focus)
$190K — $290K *
Los Angeles, CA 90011 (Los Angeles County)
2 weeks ago
Consumer Technology
In-Person
Senior / Staff Software AI Test Engineer, AI Engineering
$190K — $250K *
New York, NY 10025 (New York County)
2 weeks ago
Enterprise Technology
In-Person
iOS Software Engineer, AI Engineering
$190K — $200K *
New York, NY 10025 (New York County)
2 weeks ago
Consumer Technology
In-Person
Senior Full-Stack AI Engineer (Web + iOS)
$190K — $200K *
New York, NY 10025 (New York County)
3 weeks ago
Information Technology
In-Person

More Enterprise Technology Jobs

Senior Director/Director, Enterprise AI Governance
$130K — $180K *
Aritzia LP
Vancouver, BC V5K 5J9
Today
Platform Engineer - Palantir
$90K — $130K *
Acrisure
Atlanta, GA 30349 (Fulton County)
Today
Solution Architect - Industry Innovation & Co-Development Group (InnoCoDev)
$124K — $223K *
Guidewire Software
Remote
Today
Product Manager, AI Strategy and Enablement
$120K — $150K *
Amerisource Bergen
Remote
Reposted Today
Senior Analyst - Corporate Finance Systems
$85K — $110K *
Pulte Group
Atlanta, GA 30349 (Fulton County)
Today

Find similar Senior / Staff Software AI Test Engineer, AI Engineering jobs:

Nationwide New York, NY

Senior / Staff Software AI Test Engineer, AI Engineering

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior / Staff Software AI Test Engineer, AI Engineering jobs:

Get Ready For Your
Next Interview