Job SummaryGraphcore's AI/ML training and inference infrastructure is rapidly scaling to meet the growing demands of AI workloads across mobile, edge, and datacenter environments. This role focuses on optimizing performance across ARM-based architectures and large-scale distributed systems, ensuring efficiency, scalability, and reliability across the full hardware-software stack.
The TeamThe System Engineering Performance team architects and optimizes high-performance infrastructure for large-scale datacenter deployments. The team works across hardware, software, networking, and system architecture to deliver cutting-edge AI solutions and ensure optimal system performance at scale.
Responsibilities and Duties- Analyze ML models' compute and memory requirements using roofline analysis and simulations
- Collaborate across hardware and software teams to optimize large-scale AI workloads
- Benchmark, monitor, and troubleshoot system performance across distributed systems
- Optimize communication stacks including MPI, NCCL, UCX, RDMA, and networking fabrics
- Profile and optimize AI workloads, focusing on performance bottlenecks
- Develop high-quality, ARM-compatible code and documentation
Candidate ProfileEssential:
- BS/MS in Computer Science, Electrical Engineering, or related field
- Experience with distributed systems and communication libraries (MPI, NCCL, UCX, libfabric)
- Strong programming skills in C++ and Python
- Experience profiling and optimizing HPC or AI/ML workloads
- Familiarity with ML benchmarks such as MLPerf
Desirable:
- Experience with GPUs or accelerated computing architectures
- Knowledge of HPC networking and interconnect technologies (InfiniBand, RoCE)
- Familiarity with ML frameworks such as PyTorch or TensorFlow
- Understanding of ARM architectures and toolchains
- Strong debugging, profiling, and performance optimization skills