Celestica

Lead Test Engineer, Server Compute Firmware - AI Data Center 1

Celestica$110K — $140K *
Enterprise Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or related field.
  • 5+ years of testing experience, with a focus on enterprise storage and server systems.
  • 1+ years in a lead or senior technical role, mentoring engineers.
  • Deep knowledge of server architectures and subsystems, including CPU and memory management.
  • Strong background in storage technologies including NVMe and distributed file systems.
  • Proficiency in scripting languages for automation and data analysis.
  • Familiarity with networking concepts and methodologies.

Responsibilities

  • Define and implement test plans for storage and server hardware in AI data centers.
  • Lead the team in creating and executing complex test cases for performance and reliability.
  • Mentor junior test engineers, promoting technical excellence.
  • Develop automated test frameworks and scripts to enhance testing efficiency.
  • Analyze performance issues in server hardware and software interactions.
  • Maintain robust testbeds and infrastructure for continuous integration.
  • Collaborate with cross-functional teams to integrate testing throughout the lifecycle.

Benefits

  • Opportunities for professional growth and mentorship.
  • Access to cutting-edge AI technology and projects.
  • Collaborative environment with diverse engineering teams.
  • Flexible work arrangements to support work-life balance.
Full Job Description
Req ID: 137753
Region: Americas
Country: USA
State/Province: Texas
City: Austin

General Overview

Functional Area: Engineering
Career Stream: Design - Software Engineering
SAP Short Name: LEN-ENG-DSE
Job Level: Level 08
IC/MGR: Individual Contributor
Direct/Indirect Indicator: Indirect

Summary

The Senior Lead Server Compute CPU & GPU Firmware Test Engineer will play a pivotal role in the design, development, and execution of comprehensive test strategies for our AI data center's server infrastructure. This leadership position requires deep expertise in server architectures, enterprise storage systems, networking, and a strong understanding of the unique performance and reliability demands of AI/ML workloads. The ideal candidate will be a hands-on technical leader, capable of mentoring junior engineers, driving test automation, and collaborating across engineering teams to deliver robust and high-performing solutions

Knowledge / Skills / Competencies

  • Define, develop, and implement comprehensive test plans and strategies for all storage and server hardware, firmware, and software components within the AI data center environment.
  • Lead the test team in designing, executing, and analyzing complex test cases, including functional, performance, reliability, stress, and endurance testing.
  • Mentor and provide technical guidance to junior test engineers, fostering a culture of technical excellence and continuous improvement.
  • Design and implement automated test frameworks and scripts using languages like Python, Go, or similar, to improve efficiency and coverage of testing.
  • Conduct in-depth performance analysis and bottleneck identification for server platforms (e.g., CPU, GPU, memory, PCIe, networking), OpenBMC interfaces/features and storage systems (e.g., NVMe, SSD, HDD arrays, distributed storage, SAN/NAS)
  • This includes debugging issues related to BMC functionality and its interaction with server hardware.
  • Develop and maintain robust testbeds and infrastructure for continuous integration and validation.
  • Utilize open-source and commercial test tools relevant to server, OpenBMC and storage validation.
  • Collaborate closely with hardware design, software development, infrastructure, and AI/ML engineering teams to understand requirements and integrate testing throughout the product lifecycle.
  • Communicate test progress, results, and critical issues effectively to stakeholders, including executive leadership.
  • Develop specialized test methodologies to validate performance and reliability under heavy AI/ML workloads (e.g., large model training, inference at scale, data ingestion).
  • Understand and test the interactions between GPU-accelerated computing, high-speed networking, and storage systems.


Qualifications

  • Bachelor's or Master's degree in Computer Science, Electrical Engineering, or a related technical field.
  • 5+ years of experience in hardware and/or software testing, with at least 5 years focused on enterprise-level storage and server systems.
  • 1+ years of experience in a lead or senior technical role, mentoring junior engineers or leading test initiatives.
  • Proven experience in a lead or senior technical role, mentoring and guiding other engineers.
  • Deep expertise in server architectures (x86, ARM, GPU servers), CPU/memory subsystems, PCIe, and power management.
  • Strong understanding of server architectures (x86, ARM, GPU servers), CPU/memory subsystems, PCIe, power management, and Baseband Management Controllers (BMC) functionality.
  • Strong understanding of storage technologies such as NVMe, SAS/SATA SSDs/HDDs, RAID, distributed file systems (e.g., Ceph, Lustre, GPFS), SAN, and NAS.
  • Proficiency in scripting languages (e.g., Python, Bash) for test automation and data analysis.
  • Experience with Linux operating systems (e.g., Ubuntu, CentOS, RHEL) and command-line tools.
  • Familiarity with networking concepts (Ethernet, TCP/IP, InfiniBand) and network testing methodologies.
  • Experience with test methodologies such as performance testing, reliability testing, stress testing, and fault injection.
  • Excellent problem-solving, analytical, and debugging skills.
  • Strong communication and interpersonal skills, with the ability to collaborate effectively across diverse teams.

Preferred Qualifications:
  • Familiarity with OCP (Open Compute Project)
  • Experience with cloud environments (AWS, Azure, GCP) and virtualization technologies.
  • Knowledge of containerization technologies (Docker, Kubernetes).
  • Familiarity with AI/ML frameworks (e.g., TensorFlow, PyTorch) and their infrastructure requirements.
  • Experience with performance profiling tools (e.g., fio, Iometer, Perf, VTune).
  • Contributions to open-source projects related to storage, servers, or testing.
  • Certifications in relevant technologies (e.g., NetApp, Dell EMC, HPE, NVIDIA).


Notes

This job description is not intended to be an exhaustive list of all duties and responsibilities of the position. Employees are held accountable for all duties of the job. Job duties and the % of time identified for any function are subject to change at any time.

About Celestica

Celestica is a Canadian multinational electronics manufacturing services company headquartered in Toronto, Ontario. The company provides a range of services to original equipment manufacturers (OEMs) in the aerospace and defense, communications, enterprise computing, healthcare, industrial, semiconductor, and smart energy industries. Celestica's services include design and engineering, supply chain management, assembly and testing, and after-market services. The company operates in North America, Europe, and Asia and has manufacturing facilities in over 10 countries. Celestica was founded in 1994 as a subsidiary of IBM Canada and became an independent company in 1997.
Learn more about Celestica
Size
23,915 employees
Market Cap
$1.3 billion
Industry
Founded
1994
5 Year Trend
-1.3%
NASDAQ

Similar Jobs

More Jobs at Celestica

More Enterprise Technology Jobs

Find similar Lead Test Engineer, Server Compute Firmware - AI Data Center 1 jobs: