Meta is seeking a Production Systems Engineer, Tooling to join our Production Systems Engineering organization, where you will help drive the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through improvements by test automation. You will design and build the systems tooling, test automation, and frameworks that keep Meta's global production fleet - spanning compute, storage, networking, and custom silicon - operating at peak performance. Working at the intersection of hardware and software, you will partner with data center operations, hardware engineering, platform teams, and ODM/vendor partners to drive systemic improvements across the full infrastructure stack.
Responsibilities
Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale - spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure
• Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets
• Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale
• Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms
• Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows
• Develop scalable infrastructure automation that reduces operational toil and accelerates hardware deployment and remediation across the global fleet
• Mentor other engineers on systems software design, debugging methodologies, and production infrastructure best practices
• Communicate technical designs and architectural decisions through written documentation and cross-functional stakeholder alignment
Minimum Qualifications
• Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
• 3+ years of experience in production systems engineering or infrastructure software engineering, including development in C, C++, or Python for Linux-based environments
• 3+ years of experience with large-scale hardware infrastructure systems, including fleet automation, hardware lifecycle management, or data center operations software
• 3+ years of experience in designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines
• 3+ years of experience in communicating system designs and technical decisions through written documentation and cross-functional stakeholder engagement
• Demonstrated troubleshooting skills across hardware products and automation software
Preferred Qualifications
• Master's Degree in Computer Science, Computer Engineering, or similar field
• 6+ years of experience across a variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment
• 3+ years of experience in building or operating CI/CD pipelines and test automation frameworks for infrastructure software
• Familiarity with custom silicon or accelerator platform integration, including firmware and platform management interfaces
• Expertise guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale