RAS Validation Lead

Aetherflux

• $150K — $225K *

San Carlos, CA 94070In-Person

Aerospace & Defense

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years in hardware validation or reliability engineering for server systems.
Deep understanding of CPU and GPU architecture, including memory and interconnects.
Strong familiarity with RAS concepts like ECC and fault containment.
Hands-on experience with fault injection across hardware and software.
Experience with BMC, IPMI, Redfish, and MCTP/PLDM system management interfaces.
Direct experience collaborating with silicon vendors on failure analysis.
Proficient scripting skills in Python or equivalent for automation.

Responsibilities

Lead RAS validation strategy and execution for GPU server platforms.
Collaborate with designers to analyze hardware failures and align on requirements.
Characterize fault propagation paths and validate error signal handling.
Ensure BMC visibility into hardware health events via relevant protocols.
Debug complex failure modes across architecture and subsystems.
Drive root-cause analysis for RAS failures and inform design decisions.
Define RAS coverage metrics and ensure hardware fault model traceability.
Validate OS-level error handling and recovery flows.

Benefits

Equity in Cowboy Space Corp.
Medical, dental, and vision insurance for employees and eligible dependents.
401(k) retirement savings plan.
Paid time off and 10 paid holidays per year.
Paid parental leave.
Relocation assistance if needed.
Daily office lunch and stocked kitchen with snacks.

Full Job Description

RAS Validation Lead - Orbital Data Center

The Role

Deploying high-performance GPU compute in Low Earth Orbit introduces a fundamentally different fault landscape than ground-based datacenter operation. This role sits at the frontier of that problem. When a fault occurs 500km above Earth, the system must detect it, classify it, contain it, and recover from it autonomously. You will own the end-to-end RAS validation strategy for GPU server systems, working directly with GPU and HBM silicon partners to analyze failures, characterize fault propagation paths, and ensure detection and recovery mechanisms function correctly. The right candidate combines deep knowledge of processor and memory architecture with hands-on system-level validation experience and the ability to drive partner engagements to resolution. This role is located in San Carlos or Seattle.

Key Responsibilities:

Lead RAS validation strategy and execution for GPU server platforms, including fault injection, detection coverage, and recovery verification.
Partner directly with GPU system designers to analyze hardware failures, review silicon errata, and align on fault handling requirements for DDR, HBM, CPU, and GPU subsystems.
Characterize fault propagation paths from hardware detection through firmware and OS layers, and validate that error signals are correctly classified, logged, and acted upon.
Validate BMC and out-of-band management visibility into hardware health events via IPMI, Redfish, and MCTP/PLDM protocols.
Debug complex failure modes spanning GPU and CPU architecture, memory subsystems, PCIe/NVLink fabric, and system management firmware.
Drive root-cause analysis for RAS failures discovered during validation and work with partners to provide input on platform design decisions that affect fault detection and serviceability.
Define RAS coverage metrics and maintain traceability from hardware fault models to test coverage.
Collaborate with firmware, software, and platform teams to validate OS-level error handling, ACPI error interfaces (EINJ, BERT,HEST), and runtime error recovery flows.

Basic Qualifications:

5+ years of experience in hardware validation, platform reliability engineering, or silicon validation on server-class compute systems.
Deep understanding of CPU and GPU architecture, including memory subsystems (DDR, HBM), cache hierarchies, and interconnect fabrics (PCIe, NVLink, XGMI).
Strong knowledge of RAS concepts: error detection and correction (ECC), fault containment, error propagation, machine check architecture (MCA/MCI), and recovery mechanisms.
Hands-on experience with fault injection methodologies at hardware, firmware, and software levels.
Familiarity with system management interfaces including BMC, IPMI, Redfish, and MCTP/PLDM.
Experience working directly with silicon vendors or ODM partners on hardware failure analysis and RAS gap closure.
Strong scripting skills in Python or equivalent for test automation and log analysis.

Compensation and Benefits:

The salary range for this position is $150,000 - $225,000 annually. The actual base salary offered will depend on factors such as job-related skills, experience, qualifications, and internal equity.

Equity in Cowboy Space Corp.
Employees and their eligible dependents may enroll in medical, dental, and vision insurance
401(k) retirement savings plan
Paid time off
10 paid holidays per calendar year
Paid parental leave
Relocation assistance if applicable
Daily lunch in the office and a fully stocked kitchen with beverages and snacks

ITAR Requirements

Export Control Requirement: To conform to U.S. Government space technology export regulations, including the International Traffic in Arms Regulations (ITAR), applicants must be a U.S. citizen, lawful permanent resident of the U.S., protected individual as defined by 8 U.S.C. 1324b(a)(3), or eligible to obtain the required authorizations from the U.S. Department of State. Learn more about ITAR here.

Disclaimer

This job description is a summary of the primary duties and responsibilities of the job and position. It is not intended to be a comprehensive or all-inclusive listing of duties and responsibilities. Contents are subject to change at Cowboy Space Corp.'s discretion.

* Ladders Estimates

Similar Jobs

Hardware Cyber Security Engineer (Test Lead)
$120K — $160K *
Element Materials Technology
Fremont, CA 94536 (Alameda County)
Reposted 2 days ago
Operations Test Engineer
$120K — $150K *
Apple
Cupertino, CA 95014 (Santa Clara County)
4 days ago
Senior Hardware Test Engineer
$130K — $165K *
Tarana Wireless
Milpitas, CA 95035 (Santa Clara County)
5 days ago
RAS Validation Lead - Orbital Data Center
$150K — $225K *
Aetherflux
San Carlos, CA 94070 (San Mateo County)
5 days ago
Principal Hardware Diagnostics Engineer
$130K — $180K *
Graphcore
Milpitas, CA 95035 (Santa Clara County)
6 days ago
Senior Debug System Engineer, Datacenter
$200K — $322K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
6 days ago

Get Ready For Your
Next Interview

More Jobs at Aetherflux

RAS Validation Lead
$150K — $225K *
San Carlos, CA 94070 (San Mateo County)
Today
Aerospace & Defense
In-Person
RAS Validation Lead
$150K — $225K *
Seattle, WA 98115 (King County)
Today
Aerospace & Defense
In-Person
RF Engineer, Avionics
$110K — $150K *
San Carlos, CA 94070 (San Mateo County)
2 days ago
Aerospace & Defense
In-Person
RAS Validation Lead - Orbital Data Center
$150K — $225K *
San Carlos, CA 94070 (San Mateo County)
5 days ago
Aerospace & Defense
In-Person
Senior Power Design Engineer - GPU Server Boards
$150K — $200K *
San Carlos, CA 94070 (San Mateo County)
6 days ago
Aerospace & Defense
In-Person

More Aerospace & Defense Jobs

Senior Quality Engineer (Supplier Quality)
DYNAMIX Group
Dayton, OH 45402 (Montgomery County)
Yesterday
Chief Executive Officer – UAV Aerospace Technology
$300K + significant company stock/equity participation *
Soaring Aerospace Inc.
Orange, CA 92868 (Orange County)
1 week ago
Site General Manager
$200K — $500K++ $60K bonus *
Spartronics
Williamsport, PA 17703 (Lycoming County)
1 week ago
Chief Executive Officer
The Mitalmor Group
New York, NY 10001 (New York County)
Reposted 1 week ago
Engineering Program Manager
$80K — $150K *
Signature Research, Inc.
Calumet, MI 49913 (Houghton County)
2 weeks ago

Find similar RAS Validation Lead jobs:

Nationwide San Carlos, CA

RAS Validation Lead

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar RAS Validation Lead jobs:

Get Ready For Your
Next Interview