Lead AI Engineer (Production Agentic & RAG Systems)

EPAM Systems • $130K — $180K *

US-Anywhere

+ 2 other locationsRemote

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

6+ years in software engineering, with 3+ years shipping production LLM / agentic systems
1+ years of experience leading engineers or technical workstreams
Expert-level proficiency in Python and FastAPI (async, REST, SSE)
Deep production expertise in LangChain and LangGraph or equivalent stacks
Strong background in production RAG, embeddings, chunking, and hybrid retrieval with caching
Advanced skills in vector databases like Pinecone, Weaviate, or OpenSearch
Solid command of observability tools (MLflow, OpenTelemetry) and CI/CD for AI systems.

Responsibilities

Own the end-to-end architecture of GenAI platforms across services and teams
Lead the design of agent orchestration in LangGraph / LangChain
Architect production RAG: chunking, embeddings, vector stores, and hybrid retrieval
Drive the design and delivery of Python / FastAPI services
Define the observability strategy across the platform
Own the deployment platform on Docker + Kubernetes with CI/CD
Establish GenAI safety & governance practices.

Benefits

Opportunity to lead engineering efforts in innovative AI technology
Collaborative environment with mentorship and career growth
Exposure to cutting-edge frameworks and tools
Involvement in strategic decisions impacting company direction
Engagement with cross-functional teams, including product and executive stakeholders

Full Job Description

We are looking for a seasoned Lead AI Engineer who architects, builds, and operates production GenAI platforms - agentic workflows, RAG pipelines, and LLM-backed services with real users and real SLAs - while leading engineers and setting the technical direction across multiple workstreams. This is an engineering leadership role, not a research role. The bar is reliability, latency, cost, observability, and safe deployment at scale, with end-to-end ownership from architecture through on-call, and accountability for the technical quality and delivery of the team. Typical workloads include enterprise knowledge platforms, conversational analytics, agentic automation, and LLM-augmented data products. Responsibilities Own the end-to-end architecture of GenAI platforms across multiple services and teams, defining standards, patterns, and reference implementations Lead the design of agent orchestration (graph/state, conditional routing, tool calling, memory, checkpointing) in LangGraph / LangChain or equivalent, and set best practices for the team Architect production RAG end-to-end: chunking, embeddings, vector stores, hybrid retrieval, reranking, caching, and grounded synthesis - and mentor engineers in building it Drive the design and delivery of Python / FastAPI services - async, SSE streaming, session handling, and structured error contracts - establishing service templates and conventions Define the observability and evaluation strategy (MLflow, OpenTelemetry, or equivalent) for accuracy, cost, and regression across the platform Own the deployment platform on Docker + Kubernetes (EKS/AKS/GKE) with CI/CD, test, eval, and canary gates - setting release standards for AI systems Lead LLM cost engineering strategy - model routing, prompt optimization, caching, token accounting, and build-vs-buy decisions at portfolio level Establish GenAI safety & governance practices: hallucination control, prompt-injection defense, PII handling, and HITL where required Partner with data engineering leadership on semantic layers and pipelines (PySpark / SQL where applicable), and align roadmaps across teams Mentor and grow senior and mid-level engineers through design reviews, pairing, and technical coaching; conduct hiring and technical interviews Represent engineering in conversations with clients, product, and executive stakeholders; translate business goals into technical strategy and delivery plans Requirements 6+ years in software engineering, with 3+ years shipping production LLM / agentic systems (not POCs or research) 1+ years of experience leading engineers or technical workstreams Proven track record of owning architecture for multi-service GenAI or distributed systems in production Expert-level proficiency in Python and FastAPI (async, REST, SSE) Deep production expertise in LangChain and LangGraph (or equivalent serious production experience with LlamaIndex, AutoGen, or MCP stacks) Strong background in production RAG: embeddings, chunking, and hybrid retrieval with reranking and caching - with the ability to define standards across teams Advanced skills in vector databases such as Pinecone, Weaviate, pgvector, OpenSearch, or Databricks Vector Search Hands-on production experience with at least one major LLM provider - AWS Bedrock (preferred), OpenAI / Azure OpenAI, or Anthropic - including model selection, routing trade-offs, and multi-provider strategy Strong competency in Kubernetes and Docker in real production environments (EKS/AKS/GKE), including platform-level decisions Deep expertise in cloud engineering on AWS, including cost, security, and scalability trade-offs Solid command of observability and tracing tools (MLflow, LangSmith, OpenTelemetry), evaluation harnesses, and latency/cost ownership at platform scale Experience designing and owning CI/CD for AI systems (GitHub Actions, Jenkins, or equivalent) with test/eval gates Demonstrated experience mentoring engineers, leading design reviews, and driving technical decisions across teams Strong written and spoken English (B2+ level); able to lead design discussions, present to senior stakeholders, and influence technical direction with clients and executives Nice to have Databricks depth - MLflow (tracking & serving), Vector Search, Unity Catalog / Metric Views, PySpark / SQL Experience with LLM fine-tuning - PEFT, LoRA, QLoRA - and the ability to guide build-vs-fine-tune-vs-prompt decisions Strong understanding of MCP servers and tool integration patterns Expertise in GenAI governance & FinOps - auditability, prompt-injection hardening, PII, and token cost in regulated environments Background in classical ML / DL - NLP, BERT-family, time-series, and CV

About EPAM Systems

EPAM Systems, Inc. is a leading global provider of digital platform engineering and development services. The company has a strong presence in North America, Europe, and Asia, and serves clients in a variety of industries, including financial services, healthcare, and retail. EPAM's services include software engineering, product development, and digital platform engineering, and the company has a reputation for delivering high-quality solutions that help its clients achieve their business goals. EPAM has been recognized as a leader in the digital services industry by a number of independent research firms, and the company has won numerous awards for its work.

Learn more about EPAM Systems

Size

58,824 employees

Market Cap

$18.2 billion

Industry

Information Technology

Net Income

$327.1 million

Founded

1993

5 Year Trend

+26.5%

Revenue

$2.6 billion

NASDAQ

EPAM

* Ladders Estimates

Similar Jobs

Lead AI Engineer - Java with Claude Code
$120K — $160K *
EPAM Systems
Remote
Today
Full Stack Developer & Data Science Team Lead
$120K — $150K *
SAIC
Chantilly, VA 20152 (Loudoun County)
Reposted Today
Forward Deployed Engineer Manager
$94K — $293K *
Accenture
Albany, NY 12203 (Albany County)
Reposted Today
Forward Deployed Engineer Manager
$94K — $293K *
Accenture
Arlington, VA 22204 (Arlington County)
Reposted Today
Forward Deployed Engineer Manager
$94K — $293K *
Accenture
Atlanta, GA 30349 (Fulton County)
Reposted Today
Forward Deployed Engineer Manager
$94K — $293K *
Accenture
Chicago, IL 60629 (Cook County)
Reposted Today

Get Ready For Your
Next Interview

More Jobs at EPAM Systems

Global Event Marketing Manager
$80K — $120K *
Remote
Today
Business Services
Remote in United States
Lead AI Engineer - Java with Claude Code
$120K — $160K *
Remote
Today
Information Technology
Remote
Senior AI Engineer (Production Agentic & RAG Systems)
$130K — $180K *
Remote
Today
Information Technology
Remote
Lead AWS DevOps Engineer
$120K — $150K *
Remote
Reposted Today
Information Technology
Remote
Principal - Cloud Engagement Lead (Microsoft)
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
Today
Enterprise Technology
Hybrid

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Sr. Software Engineer (React Native)
$119K — $149K *
Subway
Shelton, CT 06484 (Greater Bridgeport County)
Reposted Today
Lead / Senior Data Modeler
$148K — $260K *
Salesforce
Palo Alto, CA 94303 (Santa Clara County)
Today

Find similar Lead AI Engineer (Production Agentic & RAG Systems) jobs:

Nationwide Remote

Lead AI Engineer (Production Agentic & RAG Systems)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Lead AI Engineer (Production Agentic & RAG Systems) jobs:

Get Ready For Your
Next Interview