Advanced Micro Devices, Inc

Senior Failure Analysis Engineer - Test Development

Advanced Micro Devices, Inc$120K — $150K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in failure analysis or test development for GPU or server environments.
  • Strong background in developing custom test methodologies for hard-to-reproduce failures.
  • Expertise in building and managing VPOD environments for scalable experimentation.
  • Proficient in Python and shell scripting for automation and analysis.
  • Experience with AI/ML workloads and real-time inference in test systems.
  • Ability to interpret complex telemetry and debugging data to refine experiments.
  • Excellent communication skills to document and present findings clearly.

Responsibilities

  • Architect advanced test methods for elusive GPU behaviors.
  • Design new workload patterns to expose conditions missed by standard diagnostics.
  • Maintain VPOD environments for long-duration and controlled experiments.
  • Use inferencing as stimuli for probing platform limits and sensitivities.
  • Develop automation scripts for workload management and data analysis.
  • Interpret system data to enhance experiment designs and reliability.
  • Collaborate with cross-functional teams to develop repeatable test content.

Benefits

  • Comprehensive health plans including medical, dental, and vision coverage.
  • Retirement savings plan with company match.
  • Generous paid time off and holiday schedule.
  • Employee assistance programs and wellness initiatives.
  • Opportunities for career development and continuous learning.
Full Job Description
The ROLE:

The Quality Engineering team is looking for an experienced Senior Failure Analysis Engineer - Test Development to create advanced test methods that surface elusive failures in GPU accelerator platforms. This role is centered on designing custom execution flows that go beyond standard validation, using stress-based scenarios, VPOD environments, AI/ML workloads, and adaptive test logic to make hard-to-capture issues observable and actionable. The engineer will expand failure analysis capability across lab, factory, and customer-return cases by building test content that improves repeatability, shortens debug cycles, and increases confidence in root cause findings. They will also help shape intelligent test systems that use internal engineering knowledge and live model inference to guide execution decisions in real time. Working across FA, validation, firmware, diagnostics, and data teams, this person will help convert unclear symptoms into testable conditions that accelerate resolution.

THE PERSON:

The ideal candidate is inventive, methodical, and technically versatile, with a strong instinct for designing experiments that reveal behavior hidden under normal test conditions. They are comfortable navigating hardware, firmware, software, and system-level interactions, and know how to choose the right levers-environment, timing, workload composition, instrumentation, or automation-to provoke meaningful behavior. They are effective in VPOD-based test environments, capable of using model-driven compute activity as part of system stimulation, and confident building AI-enabled workflows that draw from team-specific knowledge during execution. Just as importantly, they can turn messy observations into disciplined experiments, communicate clearly across teams, and document approaches in a way others can reuse.

KEY RESPONSIBILITIES:

  • Architect targeted test methods for hard-to-capture platform behaviors across GPU, server, and rack-scale environments.


  • Invent new workload patterns, sequencing approaches, and stress combinations that reveal conditions not covered by conventional diagnostics.


  • Build and maintain VPOD-based environments that support scalable experimentation, long-duration execution, and controlled reproduction studies.


  • Use inference and training activity as system stimuli to probe platform limits, timing sensitivities, and failure-prone operating regions.


  • Develop automation, scripting, and orchestration tools to launch workloads, monitor execution, collect logs, and analyze results at scale across Windows and Linux environments.


  • Interpret telemetry, logs, and observed signatures to refine experiments, isolate trigger conditions, and improve confidence in reproduced behavior.


  • Create AI-enabled execution flows that use internal FA knowledge and live inference to guide test branching, detect emerging patterns, and support faster triage decisions.


  • Partner closely with FA, validation, diagnostics, firmware, and manufacturing teams to translate vague symptoms or sporadic field issues into targeted and repeatable test content.


  • Document workload intent, test methods, reproduction conditions, and findings clearly so they can be reused across teams and incorporated into future FA workflows.


  • Drive continuous improvement of test development methods, workload libraries, and failure reproduction strategies to expand FA coverage and reduce time to root cause.


PREFERRED EXPERIENCE:

  • Proven track record of developing custom test methodologies for intermittent, low-occurrence, or otherwise difficult-to-observe failure modes.


  • Strong foundation in GPU and server platform behavior, including system stress interactions, concurrency effects, and stability characterization.


  • Demonstrated ability to build, run, and optimize VPOD environments and related infrastructure for large-scale FA or validation test execution.


  • Hands-on familiarity with inference and training environments, including their use as controllable system stressors in platform investigation.


  • Proficient in Python, shell scripting, and automation development for workload launch, orchestration, telemetry capture, and post-run analysis.


  • Ability to interpret system data and debug artifacts to uncover meaningful signals and guide the next experimental step.


  • Familiarity with diagnostics, firmware interactions, drivers, and hardware/software boundaries that influence failure behavior under stress workloads.


  • Experience building AI-enabled test systems that incorporate internal engineering knowledge and support real-time inference during execution.


  • Strong communication, documentation, collaboration, and presentation skills, with the ability to explain complex reproduction strategies and findings across technical teams.


  • Experience with GPU data center infrastructure, AI/ML technologies, and non-standard workload development is a strong plus.


ACADEMIC CREDENTIALS:

  • Bachelor's degree in Electrical Engineering, Computer Engineering, Computer Science, or a related field.


LOCATION:

  • Secaucus, NJ

This role is not eligible for Visa sponsorship

#LI-AP2

Benefits offered are described: AMD benefits at a glance.

About Advanced Micro Devices, Inc

Advanced Micro Devices, Inc. Careers

Join the innovative forefront of technology with a career at Advanced Micro Devices, Inc. (AMD), a leader in semiconductor development. As part of our global team, you will contribute to an organization renowned for its dedication to innovation, leadership, and diversity in the tech industry.

Work You’ll Do

At AMD, we offer job opportunities that push the boundaries of what is possible. Our team is composed of professionals who lead the way in microprocessor and graphics technology, driving industry standards and innovation. With AMD, you will be part of a culture that values growth and professional development, ensuring that every team member has the opportunity to excel.

Transform Your Career

AMD is not just about advancing technology, but also about advancing careers. Whether you are looking for an internship, a full-time position, or leadership roles, AMD provides the platform to propel your career to new heights. Our commitment to professional growth is matched by our dedication to diversity and inclusion, making AMD a place where everyone can thrive.

Innovative Work Environment

Join a team of over 12,000 dedicated professionals at the intersection of technology, industry expertise, and digital innovation. At AMD, you will work on groundbreaking projects that shape the future of computing and graphics. Our collaborative environment encourages networking and the sharing of ideas across teams and disciplines.

Career Development and Benefits

AMD is committed to the development of its employees. We offer robust training programs, including leadership development and diversity training, to ensure our team is equipped for both current challenges and future opportunities. Our benefits package is designed to support the well-being and financial security of our employees and their families.

Explore Job Opportunities

From engineering to marketing, AMD offers a range of career paths that cater to diverse skills and interests. Our hiring process is designed to be transparent and engaging, helping you to understand where you fit within our team and how you can contribute to our collective goals.

Stay Connected

Join Our Team Search open positions that match your skills and interest. We look for passionate, curious, creative, and solution-driven team players. Explore the opportunities to join a company that’s committed to your career growth and to innovation in the technology sector.

Keep Up to Date

Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here.

Job Alert Emails

Personalize your subscription to receive job alerts, latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding career opportunities that await at Advanced Micro Devices, Inc.

Interview and Resume Tips

Prepare for your future with AMD by accessing resources that help you craft your resume and excel in interviews. Our goal is to help you showcase your best professional self and align your skills with the needs of our dynamic team. At Advanced Micro Devices, Inc., we empower our employees to innovate, lead, and grow. Join us in driving the future of technology while building a rewarding and sustainable career.
Learn more about Advanced Micro Devices, Inc
Size
15,500 employees
Market Cap
$100.9 billion
Industry
Net Income
$2.4 billion
Founded
1969
5 Year Trend
+30.9%
Revenue
$9.7 billion
NASDAQ

Similar Jobs

More Jobs at Advanced Micro Devices, Inc

More Information Technology Jobs

Find similar Senior Failure Analysis Engineer - Test Development jobs: