Full Job Description
We are looking for a disciplined and dynamic Systems Engineer with focus on server CPU based system to join our growing compute rack validation team. Candidate we are seeking should have demonstrated work-experience in leading server rack and blade hardware systems deployment, hardware installation, and inventory management activities in the Austin, TX area. As a diligent leader in Systems Engineering, you will drive multiple aspects of post-silicon validation throughout the life cycle of the program. In this high visibility position, you will be part of a technical team chartered to innovate and improve system bring-up and enablement capabilities, as well as silicon and system validation to deliver the highest quality, industry leading technologies to market. Your technical leadership skills, systems engineering and hardware bring-up, validation and debug expertise will be necessary towards product development, definition, root cause and resolution. Your agility and collaborative approach will be essential to work within System Validation & other engineering teams (System Architects, SoC and Rack FW etc).
The ideal candidate will be driving key areas around at-scale system validation including ARM based server and rack level systems bring-up (nodes and rack level systems). Candidate will be immersed in challenging system enablement work, ramp-up post-silicon capabilities in engineering lab environments, validation tests execution/triage. The candidate will be leading contributor towards state-of-the-art HW bring-up and lab capabilities for Grapchore's system engineering. The candidate should be able to work in a global environment while maintaining a synergetic culture.
Primary Responsibilities:
- Install, configure, commission (and decommission if needed) blade servers, chassis, switches, and supporting infrastructure.
- Lead rack and stack activities, including mounting equipment, cable management, and labelling. Execute hardware upgrades, replacements, and troubleshooting of server and network components.
- Maintain accurate asset records within DCIM platforms and inventory management systems.
- Conduct physical audits and reconcile inventory discrepancies.
- Track hardware movements, deployments, and decommissions through established change management processes.
- Document installation procedures, rack layouts, cabling diagrams, and inventory updates.
- Support data center migration, expansion, and refresh projects.
- Collaborate with engineering, operations, logistics, and project management teams.
- Adhere to all data center safety, security, and operational standards.
- Develop, setup and scale key methodologies for at-scale test execution, lab HW and system SW capabilities as well as system visibilities and debug tools necessary for successful system (HW/SW/FW) bring-up and system validation at blade and rack level for AI compute rack.
- Ability to work independently in a production ready environment, and a commitment tomaintaining accurate inventory and asset records.
- Triage issues found during server rack validation bring-up, Post-Silicon Validation, and production phases of the program. Ensure issues are solved on time with quality.
- Lead test execution of key domains within AI compute solutions like CPU, GPU, memory, HBM, IO etc.
- Drive technical innovation to improve capabilities across system validation, including tools, script development, technical and procedural methodology enhancement, and various internal and cross-functional technical initiatives.
Qualifications:
- Strong analytical/problem-solving skills and pronounced attention to details
- Experience in Blade server installation and maintenance (Cisco UCS, HPE Synergy, Dell MX, or similar).
- Rack and stack deployments in enterprise or hyperscale environments.
- Copper and fiber cabling installation and management.
- DCIM and asset management platforms.
- Strong understanding of server, storage, and networking hardware.
- Experience performing inventory audits and maintaining asset accuracy.
- Ability to read rack elevation diagrams, cabling schematics, and deployment documentation.
- Familiarity with ticketing and change management systems.
- Exposure to Linux (ubuntu) OS bootable images and system firmware basics for image building, provisioning and firmware flashing.
- Exposure to automation testing, to enable execution of hardware acceptance tests, best-known-config testing etc.
- Exposure to python script development and execution.
- Proven experience in understanding, defining and enabling storage (storage rack), networking capabilities (network rack, DNS, DHCP etc) in a lab environment to help add end-to-end validation and debug capabilities for rack and blade validation.
- Excellent communication and coordination skills.
- Detailed oriented, highly organized, able to prioritize, and juggle multiple work streams to tight deadlines.
- Technical leadership: capable of championing new tools, methods, and capabilities to drive platform validation improvements in schedule, quality, or coverage.
- Experience working with data center technical staff, 3rd party vendors, ODMs etc throughout the life cycle of server system product development.
- Must be a self-starter, and able to independently drive tasks to completion
Preferred Qualifications:
- Masters or PhD in Electrical Engineering, Computer Engineering or a related field.
- 10+ years of work experience demonstrating working on complex systems engineering challenges to validate and debug HW-FW-SW challenges in a server compute rack or data center blade environment.
- Experience designing and deploying modern AI/ML rack scale systems
- Knowledge of industry standards and best practices for hardware development
- Familiarity with emerging technologies in AI and Data Center infrastructure.
- Comfortable meeting, engaging and collaborating with ODM partners and staffing vendors across the globe.