Production Systems Engineer, Automation

Meta

$130K — $180K *
Technical Services
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science, Computer Engineering, or equivalent experience
  • 3+ years in production systems engineering or infrastructure software engineering with C, C++, or Python for Linux
  • 3+ years with large-scale hardware infrastructure systems and fleet automation
  • 3+ years designing distributed systems software at scale, including monitoring and alerting
  • 3+ years of communication experience regarding system designs and technical decisions

Responsibilities

  • Design and build test orchestration, CI/CD pipelines, and automation frameworks for large-scale AI hardware platforms
  • Develop hardware lifecycle management and automated remediation tooling for production system failures
  • Analyze telemetry and diagnostics to resolve systemic reliability and performance issues
  • Collaborate with hardware teams on software interfaces and firmware integration
  • Evaluate and integrate new hardware technologies into the production environment
  • Create scalable infrastructure automation to enhance hardware deployment and reduce operational toil
  • Mentor engineers on systems software design and production infrastructure best practices

Benefits

  • Opportunity to work at the intersection of hardware and software
  • Collaborative environment with cross-functional teams
  • Exposure to large-scale AI hardware platforms
  • Mentorship opportunities and professional development
  • Engagement with ODM and vendor partners to drive systemic infrastructure improvements
Full Job Description
Meta is seeking a Production Systems Engineer, Tooling to join our Production Systems Engineering organization, where you will help drive the reliability, efficiency, and scalability of Meta's large-scale hardware infrastructure through improvements by test automation. You will design and build the systems tooling, test automation, and frameworks that keep Meta's global production fleet - spanning compute, storage, networking, and custom silicon - operating at peak performance. Working at the intersection of hardware and software, you will partner with data center operations, hardware engineering, platform teams, and ODM/vendor partners to drive systemic improvements across the full infrastructure stack.

Responsibilities

Design, build, and scale test orchestration and validation tooling, CI/CD pipelines, and automation frameworks that qualify large-scale AI hardware platforms at cluster scale - spanning provisioning, monitoring, and lifecycle management of compute, storage, and networking infrastructure
• Develop tooling for hardware lifecycle management, fleet health observability, and automated remediation of production system failures across Meta's data center fleets
• Identify and resolve systemic reliability and performance issues by analyzing hardware telemetry, failure patterns, and system-level diagnostics at scale
• Collaborate with hardware engineering teams to define software interfaces, firmware integration requirements, and bring-up workflows for new server and accelerator platforms
• Lead cross-functional efforts to evaluate, qualify, and integrate new hardware technologies into the production environment, including validation and qualification workflows
• Develop scalable infrastructure automation that reduces operational toil and accelerates hardware deployment and remediation across the global fleet
• Mentor other engineers on systems software design, debugging methodologies, and production infrastructure best practices
• Communicate technical designs and architectural decisions through written documentation and cross-functional stakeholder alignment

Minimum Qualifications
• Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
• 3+ years of experience in production systems engineering or infrastructure software engineering, including development in C, C++, or Python for Linux-based environments
• 3+ years of experience with large-scale hardware infrastructure systems, including fleet automation, hardware lifecycle management, or data center operations software
• 3+ years of experience in designing and operating distributed systems software at scale, including monitoring, alerting, and automated remediation pipelines
• 3+ years of experience in communicating system designs and technical decisions through written documentation and cross-functional stakeholder engagement
• Demonstrated troubleshooting skills across hardware products and automation software

Preferred Qualifications
• Master's Degree in Computer Science, Computer Engineering, or similar field
• 6+ years of experience across a variety of infrastructure components such as network, and compute in a datacenter or large-scale production environment
• 3+ years of experience in building or operating CI/CD pipelines and test automation frameworks for infrastructure software
• Familiarity with custom silicon or accelerator platform integration, including firmware and platform management interfaces
• Expertise guiding cross-functional teams or ODM/vendor partners through the setup, integration, and execution of automation and validation frameworks at scale

Similar Jobs

More Jobs at Meta

More Technical Services Jobs

Find similar Production Systems Engineer, Automation jobs: