Advanced Micro Devices, Inc

Director of Software Validation Engineering - ROCm

Advanced Micro Devices, Inc$140K — $180K *
Enterprise Technology
11 - 15 years of experience
Job Overview by Ladders

Qualifications

  • 12+ years in software or test engineering, with systems-level experience.
  • 5+ years in engineering management, building/test engineering organizations.
  • Deep expertise in test automation and CI/CD pipeline development.
  • Experience in hardware/software test automation and validation.
  • Strong knowledge of GPU architecture and software interfaces.
  • Track record of scaling AI/automation tooling for enhanced throughput.
  • Proficient in Python for writing test automation and scripts.

Responsibilities

  • Define and lead the test engineering strategy for ROCm across the entire hardware/software stack.
  • Transform the quality organization into an AI-first team, enhancing coverage and speed without increasing headcount.
  • Build and manage continuous testing and validation infrastructures for product reliability.
  • Promote automation-first quality through SDET-level practices in the QA team.
  • Prototype new testing frameworks and agentic pipelines, especially in ambiguous situations.
  • Track and improve quality KPIs, including defect detection and test coverage.
  • Collaborate with cross-functional teams to integrate quality across all stages of development.

Benefits

  • Comprehensive health and wellness benefits.
  • Retirement savings plan with company match.
  • Paid time off including vacation and sick leave.
  • Professional development opportunities and training.
  • Flexible work environment options.
Full Job Description
THE TEAM

The ROCm software organization at AMD builds and maintains the open-source GPU software stack powering AI training, inference, and HPC workloads across AMD's data center and consumer GPU portfolio. ROCm is the foundation on which developers, researchers, and enterprises run their most demanding AI and HPC workloads. Quality and reliability are existential to our success. We operate at the intersection of cutting-edge hardware and software - and we move fast. Our team is deeply invested in open-source, community-driven development, and engineering excellence at every layer of the stack.

THE ROLE

We're looking for a hands-on Director of Test Engineering to lead and transform the quality function for ROCm. This is not a program management role - it's a deeply technical leadership position for someone who understands the hardware/software interface of GPUs, has built test engineering organizations from the ground up, and is ready to lead the next wave of AI-native, agentic quality engineering.

You will own the vision, strategy, and execution of test engineering for ROCm - from kernel-level driver validation to user-space ML framework testing. Critically, you will be the driving force behind scaling your team's impact through AI and agentic tooling, building a modern, autonomous quality organization that moves faster than any traditional QA team could.

THE IMPACT YOU WILL HAVE
  • Define and own the test engineering strategy for ROCm across the full HW/SW stack, from driver interfaces to ML framework validation.
  • Transform the quality organization into an AI-first, agentic team - scaling coverage, speed, and reliability without proportional headcount growth.
  • Build and operate continuous testing and validation infrastructure including long-running soak, stress, failure/recovery, and staging environments for product reliability.
  • Raise the bar on test engineering discipline: shift-left practices, SDET-caliber test development, and deep ownership of quality metrics.
  • Partner directly with hardware, firmware, and software engineers to ensure quality is embedded at every stage of development.
  • Drive adoption of AI-assisted testing workflows, intelligent test selection, automated root cause analysis, and agentic CI/CD pipelines across the organization.

THE PERSON

The ideal candidate is a technical leader who has built and scaled test engineering teams in complex, hardware-adjacent software environments. You are hands-on when it matters - able to prototype a test framework, debug a GPU driver failure, or design a validation architecture. You also understand how customers actually use the product: the AI inference and training workloads they run, the parallelism strategies they deploy, the performance they expect, and the failure modes they hit. That customer-workload knowledge is what separates a QA team that writes blackbox sanity checks from one that designs tests targeting the exact code paths real users exercise. You see AI agents not as a novelty but as the primary lever for scaling your team's output. You are impatient with manual, reactive QA and energized by building systems that catch bugs before humans even see them.

KEY RESPONSIBILITIES
  • Own the overall test engineering strategy and architecture for ROCm, spanning driver validation, runtime testing, compiler/toolchain quality, and ML framework integration - with test coverage designed around real customer workload patterns, not synthetic benchmarks.
  • Lead, grow, and mentor a team of SDETs and test engineers, instilling SDET-level engineering discipline and a culture of automation-first quality.
  • Architect and operate continuous testing/validation infrastructure: staging environments for soak testing, stress testing, failure injection, recovery validation, and long-duration reliability runs.
  • Champion AI-first and agentic test engineering: drive adoption of LLM-assisted test generation, autonomous failure triage, intelligent test prioritization, and agentic CI/CD workflows.
  • Hands-on prototyping of new test frameworks, validation tooling, and agentic testing pipelines - especially in early-stage or high-ambiguity situations.
  • Define, track, and improve quality KPIs: test coverage, defect escape rate, time-to-detection, device utilization, and validation cycle time.
  • Collaborate closely with hardware, firmware, and software engineering teams to ensure quality is integrated from design through release.
  • Partner with DevOps and infrastructure teams to evolve the CI/CD pipeline with robust, scalable, GPU-aware test automation.
  • Engage with the open-source ROCm community and external customers on quality feedback loops and reliability expectations, translating their workload patterns and failure reports into structured test coverage.
  • Partner with compiler, runtime, and framework integration teams on numerical correctness validation - understanding shared scope boundaries and ensuring the test organization contributes meaningfully to catching precision regressions across floating-point formats and parallelism configurations.
  • Establish and maintain HW/SW test automation for both Linux and Windows platforms across AMD's GPU product lines.

REQUIRED QUALIFICATIONS
  • 12+ years of experience in software engineering or test engineering, with significant experience in hardware-adjacent or systems-level software.
  • 5+ years of engineering management, including building and scaling test engineering or SDET organizations.
  • Deep hands-on expertise in test automation at scale - framework design, CI/CD pipeline development, and continuous validation systems.
  • Demonstrated experience with hardware + software test automation, including HW bring-up, driver validation, or firmware/software co-testing.
  • Strong understanding of GPU architecture or hardware/software interfaces (PCIe, memory subsystems, compute kernels, or equivalent).
  • Experience designing and operating always-on test infrastructure: soak/stress environments, failure injection, and reliability/recovery validation pipelines.
  • Proven track record of adopting and scaling AI or automation tooling to multiply team throughput.
  • Python proficiency: able to write test automation, tooling, and scripted validation workflows independently.
  • Practical understanding of how AI inference and training workloads are deployed on GPU hardware - including common parallelism strategies (tensor parallel, pipeline parallel, data parallel), serving configurations, and performance expectations - sufficient to translate customer use cases into targeted test coverage.
  • Hands-on software development skills sufficient to prototype test frameworks, write automation tooling, and review SDET-level code.

PREFERRED QUALIFICATIONS
  • Direct experience with ROCm, CUDA, or GPU compute software stacks (runtime, compiler, ML frameworks).
  • Experience integrating LLMs, AI agents, or agentic workflows into software development or test engineering processes.
  • Expertise in open-source development practices and community-facing quality processes (GitHub Actions, open CI, etc.).
  • Background in SDET or test engineering in a semiconductor, HPC, or AI infrastructure company.
  • Experience with GPU-specific test challenges: non-determinism, thermal behavior, multi-device coordination, driver stability.
  • Track record of shipping test frameworks or validation tools used across large engineering organizations.
  • Familiarity with ML training/inference workload validation: throughput, latency, numerical stability across precision formats (FP32/BF16/FP8), and multi-GPU collective communication correctness.
  • Experience with GPU profiling and trace analysis tooling (e.g., rocprof, omniperf, PyTorch profiler) to identify kernel-level performance and correctness anomalies.
  • Familiarity with HIP, CUDA, or low-level GPU programming - sufficient to understand what is being tested at the runtime and kernel level, even if not writing kernels directly.


#LI-G11

#LI-HYBRID

Note: This role is intentionally scoped as a hands-on technical leadership position. Candidates whose primary background is program management or traditional QA management without deep engineering execution experience may not be the right fit.

Benefits offered are described: AMD benefits at a glance.

About Advanced Micro Devices, Inc

Advanced Micro Devices, Inc. Careers

Join the innovative forefront of technology with a career at Advanced Micro Devices, Inc. (AMD), a leader in semiconductor development. As part of our global team, you will contribute to an organization renowned for its dedication to innovation, leadership, and diversity in the tech industry.

Work You’ll Do

At AMD, we offer job opportunities that push the boundaries of what is possible. Our team is composed of professionals who lead the way in microprocessor and graphics technology, driving industry standards and innovation. With AMD, you will be part of a culture that values growth and professional development, ensuring that every team member has the opportunity to excel.

Transform Your Career

AMD is not just about advancing technology, but also about advancing careers. Whether you are looking for an internship, a full-time position, or leadership roles, AMD provides the platform to propel your career to new heights. Our commitment to professional growth is matched by our dedication to diversity and inclusion, making AMD a place where everyone can thrive.

Innovative Work Environment

Join a team of over 12,000 dedicated professionals at the intersection of technology, industry expertise, and digital innovation. At AMD, you will work on groundbreaking projects that shape the future of computing and graphics. Our collaborative environment encourages networking and the sharing of ideas across teams and disciplines.

Career Development and Benefits

AMD is committed to the development of its employees. We offer robust training programs, including leadership development and diversity training, to ensure our team is equipped for both current challenges and future opportunities. Our benefits package is designed to support the well-being and financial security of our employees and their families.

Explore Job Opportunities

From engineering to marketing, AMD offers a range of career paths that cater to diverse skills and interests. Our hiring process is designed to be transparent and engaging, helping you to understand where you fit within our team and how you can contribute to our collective goals.

Stay Connected

Join Our Team Search open positions that match your skills and interest. We look for passionate, curious, creative, and solution-driven team players. Explore the opportunities to join a company that’s committed to your career growth and to innovation in the technology sector.

Keep Up to Date

Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here.

Job Alert Emails

Personalize your subscription to receive job alerts, latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding career opportunities that await at Advanced Micro Devices, Inc.

Interview and Resume Tips

Prepare for your future with AMD by accessing resources that help you craft your resume and excel in interviews. Our goal is to help you showcase your best professional self and align your skills with the needs of our dynamic team. At Advanced Micro Devices, Inc., we empower our employees to innovate, lead, and grow. Join us in driving the future of technology while building a rewarding and sustainable career.
Learn more about Advanced Micro Devices, Inc
Size
15,500 employees
Market Cap
$100.9 billion
Industry
Net Income
$2.4 billion
Founded
1969
5 Year Trend
+30.9%
Revenue
$9.7 billion
NASDAQ

Similar Jobs

More Jobs at Advanced Micro Devices, Inc

More Enterprise Technology Jobs

Find similar Director of Software Validation Engineering - ROCm jobs: