Senior AI Engineer (Production Agentic & RAG Systems)

EPAM Systems • $130K — $180K *

US-Anywhere

+ 2 other locationsRemote

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years in software engineering; 2+ years in production LLM/agentic systems.
Proficient in Python and FastAPI (async, REST, SSE).
Production expertise in LangChain/LangGraph or equivalent.
Experience with RAG: embeddings, chunking, hybrid retrieval.
Familiarity with Kubernetes and Docker in production environments.
Knowledge of major LLM providers in production, preferably AWS Bedrock.
Strong English communication skills; able to lead design discussions.

Responsibilities

Design agent orchestration in LangGraph / LangChain or similar.
Build end-to-end production RAG: chunking, embeddings, vector stores.
Own development of Python/FastAPI services with streaming and error contracts.
Instrument for tracing and evaluation of ML performance metrics.
Ship applications using Docker + Kubernetes through CI/CD pipelines.
Drive cost engineering for LLM: model routing and prompt optimization.
Collaborate with data engineering on semantic layers and pipelines.

Benefits

Work in a hands-on engineering role with real user impact.
Contribute to cutting-edge AI and GenAI technologies.
Gain experience in deploying scalable production systems.
Opportunity for end-to-end ownership of complex projects.

Full Job Description

We are seeking a hands-on Senior AI Engineer who designs, builds, and operates production GenAI systems - agentic workflows, RAG pipelines, and LLM-backed services with real users and real SLAs. This is an engineering role, not a research role. The bar is reliability, latency, cost, observability, and safe deployment at scale, with end-to-end ownership from architecture through on-call. Typical workloads include enterprise knowledge platforms, conversational analytics, agentic automation, and LLM-augmented data products. Responsibilities Design agent orchestration (graph/state, conditional routing, tool calling, memory, checkpointing) in LangGraph / LangChain or equivalent Build production RAG end-to-end: chunking, embeddings, vector stores, hybrid retrieval, reranking, caching, and grounded synthesis Own Python / FastAPI services - async, SSE streaming, session handling, and structured error contracts Instrument with tracing and evaluation harnesses (MLflow, OpenTelemetry, or equivalent) for accuracy, cost, and regression Ship on Docker + Kubernetes (EKS/AKS/GKE) via CI/CD with test, eval, and canary gates Drive LLM cost engineering - model routing, prompt optimization, caching, token accounting, and build-vs-buy decisions Apply GenAI safety & governance: hallucination control, prompt-injection defense, PII handling, and HITL where required Partner with data engineering on semantic layers and pipelines (PySpark / SQL where applicable) Requirements 5+ years in software engineering, with 2+ years shipping production LLM / agentic systems (not POCs or research) Proficiency in Python and FastAPI (async, REST, SSE) Production expertise in LangChain and LangGraph (or equivalent serious production experience with LlamaIndex, AutoGen, or MCP stacks) Background in production RAG: embeddings, chunking, and hybrid retrieval with reranking and caching Skills in vector databases such as Pinecone, Weaviate, pgvector, OpenSearch, or Databricks Vector Search Knowledge of at least one major LLM provider in production - AWS Bedrock (preferred), OpenAI / Azure OpenAI, or Anthropic - with model selection and routing trade-offs Competency in Kubernetes and Docker in real production environments (EKS/AKS/GKE) Expertise in cloud engineering on AWS Familiarity with observability and tracing tools (MLflow, LangSmith, OpenTelemetry), evaluation harnesses, and latency/cost ownership Capability to build CI/CD for AI systems (GitHub Actions, Jenkins, or equivalent) with test/eval gates Strong written and spoken English (B2 level); able to own design discussions with engineering and business stakeholders independently Nice to have Databricks depth - MLflow (tracking & serving), Vector Search, Unity Catalog / Metric Views, PySpark / SQL Experience with LLM fine-tuning - PEFT, LoRA, QLoRA Understanding of MCP servers and tool integration Qualifications in GenAI governance & FinOps - auditability, prompt-injection hardening, PII, and token cost in regulated environments Background in classical ML / DL - NLP, BERT-family, time-series, and CV

About EPAM Systems

EPAM Systems, Inc. is a leading global provider of digital platform engineering and development services. The company has a strong presence in North America, Europe, and Asia, and serves clients in a variety of industries, including financial services, healthcare, and retail. EPAM's services include software engineering, product development, and digital platform engineering, and the company has a reputation for delivering high-quality solutions that help its clients achieve their business goals. EPAM has been recognized as a leader in the digital services industry by a number of independent research firms, and the company has won numerous awards for its work.

Learn more about EPAM Systems

Size

58,824 employees

Market Cap

$18.2 billion

Industry

Information Technology

Net Income

$327.1 million

Founded

1993

5 Year Trend

+26.5%

Revenue

$2.6 billion

NASDAQ

EPAM

* Ladders Estimates

Similar Jobs

Junior Software Engineer
$115K — $160K *
Visionist, Inc.
Laurel, MD 20707 (Prince Georges County)
Today
Software Developer
$100K — $130K *
Esolvit, Inc.
Austin, TX 78745 (Travis County)
Today
AI Research Scientist (Robot Learning)
$120K — $180K *
Gem.com
San Francisco, CA 94112 (San Francisco County)
Today
Software Engineer, Applied AI
$170K — $230K *
Sobek AI, Inc
Seattle, WA 98115 (King County)
Today
Member of Technical Staff (Software Engineer, Acceleration)
$130K — $180K *
Perplexity
New York, NY 10025 (New York County)
Today
Member of Technical Staff (Software Engineer, Acceleration)
$130K — $180K *
Perplexity
San Francisco, CA 94112 (San Francisco County)
Today

Get Ready For Your
Next Interview

More Jobs at EPAM Systems

Principal - Cloud Engagement Lead (Microsoft)
$150K — $200K *
San Francisco, CA 94112 (San Francisco County)
Today
Enterprise Technology
Hybrid
Lead AI Engineer (Production Agentic & RAG Systems)
$130K — $180K *
Remote
Today
Information Technology
Remote
Senior AI Engineer
$120K — $160K *
Remote
Reposted Today
Enterprise Technology
Remote
Lead Adobe Consultant
$130K — $180K *
Remote
Reposted Yesterday
Business Services
Remote in New York, NY
Senior Java Full Stack Developer (ReactJS)
$100K — $130K *
Remote
Reposted Yesterday
Information Technology
Remote

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Sr. Software Engineer (React Native)
$119K — $149K *
Subway
Shelton, CT 06484 (Greater Bridgeport County)
Reposted Today
Lead / Senior Data Modeler
$148K — $260K *
Salesforce
Palo Alto, CA 94303 (Santa Clara County)
Today

Find similar Senior AI Engineer (Production Agentic & RAG Systems) jobs:

Nationwide Remote

Senior AI Engineer (Production Agentic & RAG Systems)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior AI Engineer (Production Agentic & RAG Systems) jobs:

Get Ready For Your
Next Interview