EPAM Systems

Generative AI Operations Engineer (GenAI Ops)

EPAM Systems$120K — $150K *
US-Anywhere
+ 2 other locationsRemote
Enterprise Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Minimum 3 years of experience in DevOps or Site Reliability Engineering (SRE)
  • At least 1 year of MLOps experience focused on cloud infrastructure
  • Expertise in AWS, Google Cloud, or Azure platforms
  • Proficient in Python or Bash and containerization/orchestration tools like Docker and Kubernetes
  • Experienced in CI/CD pipeline development with tools such as Jenkins or GitLab CI
  • Familiarity with cloud-native GenAI platforms and LLM architectures
  • B2+ level of English proficiency

Responsibilities

  • Design, implement, and maintain CI/CD pipelines for Large Language Models (LLMs)
  • Build and manage agentic AI systems, facilitating agent collaboration and workflow orchestration
  • Integrate AI agents with external tools and APIs using standards like Model Context Protocol (MCP)
  • Leverage AI development tools to enhance software delivery and infrastructure management
  • Define and manage cloud infrastructure for GenAI workloads with IaC tools like Terraform
  • Implement monitoring and observability solutions for system health using tools like Prometheus or Grafana
  • Ensure AI security and compliance with governance standards

Benefits

  • Dynamic work environment encouraging innovation
  • Opportunity to work with cutting-edge AI technologies
  • Career growth potential in a rapidly evolving field
  • Flexible work policies and remote work options
  • Engagement with a collaborative and expert team
Full Job Description
We are seeking a highly skilled Generative AI Operations Engineer (GenAI Ops) to join our cutting-edge AI team. The ideal candidate will have strong expertise in operationalizing large-scale generative AI systems, building CI/CD pipelines, and managing AI agent infrastructures across cloud environments. You will play a key role in ensuring the scalability, security, and performance of multi-agent AI systems and generative applications. Responsibilities Design, implement, and maintain automated CI/CD pipelines for the development, training, and deployment of Large Language Models (LLMs) and AI agents Build and manage agentic AI systems, ensuring efficient agent-to-agent collaboration and orchestration of complex workflows Integrate AI agents with external tools and APIs using modern standards such as the Model Context Protocol (MCP) Leverage AI-powered development tools to streamline software delivery, infrastructure management, and troubleshooting processes Define and manage cloud infrastructure for GenAI workloads using Infrastructure as Code (IaC) tools such as Terraform, AWS CDK, or CloudFormation Implement monitoring and observability solutions for models, agents, and system health using tools like Prometheus, Grafana, or Datadog Optimize scalability, performance, and cost-efficiency of GenAI services in production environments Enforce AI security, safety, and governance practices, ensuring compliance with organizational and industry standards Requirements Minimum 3 years of experience in DevOps, Site Reliability Engineering (SRE) Minimum 1 year of experience in MLOps roles with a strong focus on cloud infrastructure Proven experience with AWS, Google Cloud, or Azure Proficiency in Python or Bash, and experience with containerization/orchestration tools such as Docker and Kubernetes Strong background in building and maintaining CI/CD pipelines using Jenkins, GitLab CI, or similar tools Experience with cloud-native GenAI platforms (e.g., AWS Bedrock, Azure AI Foundry, Google Vertex AI) Familiarity with LLM architectures and the challenges of deploying large-scale models Experience designing or managing multi-agent systems and orchestrated AI workflows Hands-on experience implementing infrastructure using IaC frameworks B2+ level of English proficiency Nice to have Master's or PhD in Computer Science, AI, or related field Relevant cloud or DevOps certifications (e.g., AWS Certified DevOps Engineer, Google Cloud Professional DevOps Engineer) Strong problem-solving mindset and ability to thrive in a fast-paced, innovative environment

About EPAM Systems

EPAM Systems, Inc. is a leading global provider of digital platform engineering and development services. The company has a strong presence in North America, Europe, and Asia, and serves clients in a variety of industries, including financial services, healthcare, and retail. EPAM's services include software engineering, product development, and digital platform engineering, and the company has a reputation for delivering high-quality solutions that help its clients achieve their business goals. EPAM has been recognized as a leader in the digital services industry by a number of independent research firms, and the company has won numerous awards for its work.
Learn more about EPAM Systems
Size
58,824 employees
Market Cap
$18.2 billion
Industry
Net Income
$327.1 million
Founded
1993
5 Year Trend
+26.5%
Revenue
$2.6 billion
NASDAQ

Similar Jobs

More Jobs at EPAM Systems

More Enterprise Technology Jobs

Find similar Generative AI Operations Engineer (GenAI Ops) jobs: