EPAM Systems

Lead AI Engineer (Production Agentic & RAG Systems)

EPAM Systems$130K — $180K *
US-Anywhere
+ 2 other locationsRemote
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 6+ years in software engineering, with 3+ years shipping production LLM / agentic systems
  • 1+ years of experience leading engineers or technical workstreams
  • Expert-level proficiency in Python and FastAPI (async, REST, SSE)
  • Deep production expertise in LangChain and LangGraph or equivalent stacks
  • Strong background in production RAG, embeddings, chunking, and hybrid retrieval with caching
  • Advanced skills in vector databases like Pinecone, Weaviate, or OpenSearch
  • Solid command of observability tools (MLflow, OpenTelemetry) and CI/CD for AI systems.

Responsibilities

  • Own the end-to-end architecture of GenAI platforms across services and teams
  • Lead the design of agent orchestration in LangGraph / LangChain
  • Architect production RAG: chunking, embeddings, vector stores, and hybrid retrieval
  • Drive the design and delivery of Python / FastAPI services
  • Define the observability strategy across the platform
  • Own the deployment platform on Docker + Kubernetes with CI/CD
  • Establish GenAI safety & governance practices.

Benefits

  • Opportunity to lead engineering efforts in innovative AI technology
  • Collaborative environment with mentorship and career growth
  • Exposure to cutting-edge frameworks and tools
  • Involvement in strategic decisions impacting company direction
  • Engagement with cross-functional teams, including product and executive stakeholders
Full Job Description
We are looking for a seasoned Lead AI Engineer who architects, builds, and operates production GenAI platforms - agentic workflows, RAG pipelines, and LLM-backed services with real users and real SLAs - while leading engineers and setting the technical direction across multiple workstreams. This is an engineering leadership role, not a research role. The bar is reliability, latency, cost, observability, and safe deployment at scale, with end-to-end ownership from architecture through on-call, and accountability for the technical quality and delivery of the team. Typical workloads include enterprise knowledge platforms, conversational analytics, agentic automation, and LLM-augmented data products. Responsibilities Own the end-to-end architecture of GenAI platforms across multiple services and teams, defining standards, patterns, and reference implementations Lead the design of agent orchestration (graph/state, conditional routing, tool calling, memory, checkpointing) in LangGraph / LangChain or equivalent, and set best practices for the team Architect production RAG end-to-end: chunking, embeddings, vector stores, hybrid retrieval, reranking, caching, and grounded synthesis - and mentor engineers in building it Drive the design and delivery of Python / FastAPI services - async, SSE streaming, session handling, and structured error contracts - establishing service templates and conventions Define the observability and evaluation strategy (MLflow, OpenTelemetry, or equivalent) for accuracy, cost, and regression across the platform Own the deployment platform on Docker + Kubernetes (EKS/AKS/GKE) with CI/CD, test, eval, and canary gates - setting release standards for AI systems Lead LLM cost engineering strategy - model routing, prompt optimization, caching, token accounting, and build-vs-buy decisions at portfolio level Establish GenAI safety & governance practices: hallucination control, prompt-injection defense, PII handling, and HITL where required Partner with data engineering leadership on semantic layers and pipelines (PySpark / SQL where applicable), and align roadmaps across teams Mentor and grow senior and mid-level engineers through design reviews, pairing, and technical coaching; conduct hiring and technical interviews Represent engineering in conversations with clients, product, and executive stakeholders; translate business goals into technical strategy and delivery plans Requirements 6+ years in software engineering, with 3+ years shipping production LLM / agentic systems (not POCs or research) 1+ years of experience leading engineers or technical workstreams Proven track record of owning architecture for multi-service GenAI or distributed systems in production Expert-level proficiency in Python and FastAPI (async, REST, SSE) Deep production expertise in LangChain and LangGraph (or equivalent serious production experience with LlamaIndex, AutoGen, or MCP stacks) Strong background in production RAG: embeddings, chunking, and hybrid retrieval with reranking and caching - with the ability to define standards across teams Advanced skills in vector databases such as Pinecone, Weaviate, pgvector, OpenSearch, or Databricks Vector Search Hands-on production experience with at least one major LLM provider - AWS Bedrock (preferred), OpenAI / Azure OpenAI, or Anthropic - including model selection, routing trade-offs, and multi-provider strategy Strong competency in Kubernetes and Docker in real production environments (EKS/AKS/GKE), including platform-level decisions Deep expertise in cloud engineering on AWS, including cost, security, and scalability trade-offs Solid command of observability and tracing tools (MLflow, LangSmith, OpenTelemetry), evaluation harnesses, and latency/cost ownership at platform scale Experience designing and owning CI/CD for AI systems (GitHub Actions, Jenkins, or equivalent) with test/eval gates Demonstrated experience mentoring engineers, leading design reviews, and driving technical decisions across teams Strong written and spoken English (B2+ level); able to lead design discussions, present to senior stakeholders, and influence technical direction with clients and executives Nice to have Databricks depth - MLflow (tracking & serving), Vector Search, Unity Catalog / Metric Views, PySpark / SQL Experience with LLM fine-tuning - PEFT, LoRA, QLoRA - and the ability to guide build-vs-fine-tune-vs-prompt decisions Strong understanding of MCP servers and tool integration patterns Expertise in GenAI governance & FinOps - auditability, prompt-injection hardening, PII, and token cost in regulated environments Background in classical ML / DL - NLP, BERT-family, time-series, and CV

About EPAM Systems

EPAM Systems, Inc. is a leading global provider of digital platform engineering and development services. The company has a strong presence in North America, Europe, and Asia, and serves clients in a variety of industries, including financial services, healthcare, and retail. EPAM's services include software engineering, product development, and digital platform engineering, and the company has a reputation for delivering high-quality solutions that help its clients achieve their business goals. EPAM has been recognized as a leader in the digital services industry by a number of independent research firms, and the company has won numerous awards for its work.
Learn more about EPAM Systems
Size
58,824 employees
Market Cap
$18.2 billion
Industry
Net Income
$327.1 million
Founded
1993
5 Year Trend
+26.5%
Revenue
$2.6 billion
NASDAQ

Similar Jobs

More Jobs at EPAM Systems

More Information Technology Jobs

Find similar Lead AI Engineer (Production Agentic & RAG Systems) jobs: