Advanced Micro Devices, Inc

Principal Kubernetes GPU Infrastructure Engineer

Advanced Micro Devices, Inc$150K — $200K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience in AI infrastructure engineering and GPU-accelerated computing
  • Hands-on experience with Kubernetes and large-scale distributed training
  • Strong background in validating production-ready inference frameworks
  • Expertise in Kubernetes-native distributed training and scheduling
  • Proficient in optimizing AI workloads with benchmark performance analysis

Responsibilities

  • Design reference architectures for LLM training and inference using AMD GPUs
  • Architect Kubernetes-based training stacks for large-scale workloads
  • Define gang scheduling and optimize GPU placement for training
  • Deploy and optimize AMD GPU clusters with enterprise customers
  • Implement GPU orchestration and validation using Kubernetes
  • Benchmark LLM inference frameworks and produce performance playbooks
  • Create guides for communication and optimization of GPU workloads

Benefits

  • Comprehensive health, wellness, and retirement plans
  • Opportunities for career growth and development
  • Inclusive and collaborative company culture
  • Flexible working environment options
  • Access to cutting-edge technology and resources
Full Job Description
THE ROLE:

As a Principal AI Infrastructure Solution Engineer, you will partner with AMD's AI software teams and customers to enable large-scale LLM training and inference on AMD Instinct GPUs. You will design and validate production-ready Kubernetes architectures and translate inference frameworks such as vLLM and SGLang into deployable customer solutions. Your work will accelerate customer time-to-production and strengthen AMD's leadership in AI infrastructure.

THE PERSON:

You are a solution-oriented AI infrastructure engineer with strong expertise in GPU-accelerated computing and large-scale AI deployments. You excel at translating complex technologies into customer-ready solutions and delivering production-grade Kubernetes-based inference and training systems. You bring hands-on experience with Kubernetes-native distributed training, including scheduling, topology-aware GPU placement, and operating resilient, high-performance AI workloads at scale.

KEY REPSONSBILITIES:
  • Design and deliver reference architectures for LLM training and inference on AMD GPUs, from single-node to multi-datacenter deployments using Kubernetes and SLURM.
  • Architect and validate Kubernetes-based distributed training stacks for large-scale LLM workloads on AMD GPUs.
  • Define and implement gang scheduling and topology-aware GPU placement for multi-node training workloads.
  • Enable Kubernetes-native training controllers including Kubeflow Training Operator, MPI Operator, Volcano, and Kueue.
  • Partner with enterprise customers and cloud providers to deploy and optimize production AMD GPU clusters for distributed inference and multi-tenant workloads.
  • Implement and validate GPU orchestration using Kubernetes GPU Operator, device plugins, metrics exporters, and SLURM controllers.
  • Benchmark and optimize LLM inference frameworks (vLLM, SGLang) on AMD hardware, producing customer-ready performance playbooks.
  • Develop repeatable benchmarks for Kubernetes-based distributed training, covering scaling efficiency, step time, communication, and checkpointing.
  • Create tuning guides for RCCL/NCCL-equivalent communication, CPU/GPU affinity, interconnect utilization, and workload-specific optimizations.
  • Serve as the feedback loop between customers and AMD engineering, translating requirements into validated performance improvements.

PREFERRED EXPERIENCE:
  • Deployed and operated large-scale GPU clusters for production AI training and inference
  • Deep expertise in Kubernetes GPU orchestration (operators, device plugins, scheduling, multi-tenancy, observability)
  • Hands-on experience with distributed training on Kubernetes (Kubeflow, MPI Operator, Volcano, Kueue, Ray)
  • Strong knowledge of gang scheduling, elastic jobs, quotas, priority, and shared GPU environments
  • Tuned Kubernetes networking and storage for AI workloads (high-performance CNI, RDMA where applicable, scalable checkpointing)
  • Implemented ML observability for training (GPU/comms metrics, step-time analysis, SLO-driven ops)
  • Experience in AI/ML infrastructure, solution architecture, and production GPU deployments
  • Proven success enabling customers through complex AI platform deployments and migrations
  • Strong background working across engineering and customer-facing roles
  • Understanding of AI accelerator architectures and inference optimization techniques
  • Experience operationalizing Kubernetes-based distributed training at scale
  • Open-source contributions or AI infrastructure community engagement (plus)


LOCATION:
  • Santa Clara, Ca or open to discuss other locations.


This role is not eligible for visa sponsorship.

#LI-EV1

#LI-HYBRID

Benefits offered are described: AMD benefits at a glance.

About Advanced Micro Devices, Inc

Advanced Micro Devices, Inc. Careers

Join the innovative forefront of technology with a career at Advanced Micro Devices, Inc. (AMD), a leader in semiconductor development. As part of our global team, you will contribute to an organization renowned for its dedication to innovation, leadership, and diversity in the tech industry.

Work You’ll Do

At AMD, we offer job opportunities that push the boundaries of what is possible. Our team is composed of professionals who lead the way in microprocessor and graphics technology, driving industry standards and innovation. With AMD, you will be part of a culture that values growth and professional development, ensuring that every team member has the opportunity to excel.

Transform Your Career

AMD is not just about advancing technology, but also about advancing careers. Whether you are looking for an internship, a full-time position, or leadership roles, AMD provides the platform to propel your career to new heights. Our commitment to professional growth is matched by our dedication to diversity and inclusion, making AMD a place where everyone can thrive.

Innovative Work Environment

Join a team of over 12,000 dedicated professionals at the intersection of technology, industry expertise, and digital innovation. At AMD, you will work on groundbreaking projects that shape the future of computing and graphics. Our collaborative environment encourages networking and the sharing of ideas across teams and disciplines.

Career Development and Benefits

AMD is committed to the development of its employees. We offer robust training programs, including leadership development and diversity training, to ensure our team is equipped for both current challenges and future opportunities. Our benefits package is designed to support the well-being and financial security of our employees and their families.

Explore Job Opportunities

From engineering to marketing, AMD offers a range of career paths that cater to diverse skills and interests. Our hiring process is designed to be transparent and engaging, helping you to understand where you fit within our team and how you can contribute to our collective goals.

Stay Connected

Join Our Team Search open positions that match your skills and interest. We look for passionate, curious, creative, and solution-driven team players. Explore the opportunities to join a company that’s committed to your career growth and to innovation in the technology sector.

Keep Up to Date

Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here.

Job Alert Emails

Personalize your subscription to receive job alerts, latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding career opportunities that await at Advanced Micro Devices, Inc.

Interview and Resume Tips

Prepare for your future with AMD by accessing resources that help you craft your resume and excel in interviews. Our goal is to help you showcase your best professional self and align your skills with the needs of our dynamic team. At Advanced Micro Devices, Inc., we empower our employees to innovate, lead, and grow. Join us in driving the future of technology while building a rewarding and sustainable career.
Learn more about Advanced Micro Devices, Inc
Size
15,500 employees
Market Cap
$100.9 billion
Industry
Net Income
$2.4 billion
Founded
1969
5 Year Trend
+30.9%
Revenue
$9.7 billion
NASDAQ

Similar Jobs

More Jobs at Advanced Micro Devices, Inc

More Information Technology Jobs

Find similar Principal Kubernetes GPU Infrastructure Engineer jobs: