Job SummaryYou will be the L10 System Validation Engineer, defining what it means for the accelerator platform to be production-ready, build the infrastructure to prove it, and personally drive the most critical issues from detection through root cause to verified fix.
Key Responsibilities- Author and own the L10 system debug guide, serving as the definitive reference for factory failure analysis and debug teams.
- Own end-to-end escalation, debug, and resolution of L10 hardware failures across internal and field teams
- Develop and own L10 system debug guide to be used for the factory by the failure analysis / debug team.
- Own bring-up and integration across accelerator cards, interconnects (Retimers, C2C fabric), power (PSU, power integrity), and thermal domains.
- Lead system-level debug with no guaranteed starting point: hardware, firmware (BMC, BIOS, CPLD), and software all in scope simultaneously.
- Partner directly with CMs to validate incomplete systems, close coverage gaps, and gate production ramp.
- Build ad-hoc instrumentation, automation scripts, and debug workflows to extract signals from unstable, pre-spec systems.
- Run sustained workloads using real model variants under thermal and power stress
- Drive issue closure across cross-functional engineering teams: HW, FW, SW, thermal, power, and manufacturing.
- Represent validation readiness to leadership with crisp, data-backed assessments
You may be a good fit if you have (Must-have qualifications)- 10+ years in system validation, platform bring-up, or a tightly adjacent discipline - with direct, solo ownership of full L10 (tray/rack) integration on multi-component platforms.
- Demonstrated hands-on signal integrity and power integrity debug: SerDes, PCIe Gen5/6, Ethernet, HBM signaling
- Proven ability to build a coverage model that produces quantitative platform readiness metrics
- Strong Linux and Python/Bash scripting - you should be comfortable writing tools to instrument, automate, and extract data from systems that aren't fully debuggable yet.
- Occasional travel overseas to contract manufactures
- Experience validating AI accelerator or high-performance compute platforms with accelerators, NVMe, high-speed interconnects, and complex power delivery.
- Track record of detecting integration issues before they escape to production - in environments where specs, tooling, and infrastructure were still being built.
Strong candidates may also have experience with (Nice-to-have qualifications)- Familiarity with BMC/IPMI, UEFI/BIOS bring-up, and CPLD firmware debug.
- Experience with HBM or CSRAM memory subsystem validation.
- Background validating multi-rack or tray-level systems at hyperscaler or AI hardware company scale.
- Exposure to Llama-class inference workloads as stress and coverage tools.
- Prior work with ODM/CM partners (Pegatron, Foxconn, Wistron, or equivalent) in a validation or NPI capacity.
What success looks like- Critical escapes to production are zero - not because nothing broke, but because you assumed everything would break and proved otherwise.
- You created a structure where there was none: coverage models, automation, debug workflows, and RCA templates exist because you built them.
- Platform stress runs are stable, repeatable, and instrument-verified under sustained thermal and power load.
- Manufacturing partners have clear, data-driven pass/fail gates - and they trust the gates because you defined them.
- Engineering leadership has a real-time, quantified view of platform readiness at every stage of ramp.
Benefits- Medical, dental, and vision packages with generous premium coverage
- $500 per month credit for waiving medical benefits
- Housing subsidy of $2k per month for those living within walking distance of the office
- Relocation support for those moving to San Jose (Santana Row)
- Various wellness benefits covering fitness, mental health, and more
- Daily lunch and dinner in our office