The AWS Neuron Collectives team is seeking a Software Engineer to optimize collective operations for AWS Trainium. Trainium is one of Amazon's highest priority initiatives, powering the frontier AI models being trained today.
Collectives are the critical operations that scale AI compute across the data center. You'll work in depth to optimize compute for the specific topologies used to train modern LLMs. Working closely with the hardware team, you'll push for maximum performance using C/C++, interfacing with DMA and firmware and investigating detailed topologies.
You'll analyze current collective algorithms using publicly accessible tools like Neuron Explorer and optimize these to fully utilize compute and bus bandwidth to scale across the data center. This is a unique opportunity to impact how AI training runs at AWS scale, while growing your technical breadth and depth.
Key job responsibilities
As a Neuron Collectives Software Developer, you will:
* Enhance collective algorithms and topologies for optimal training performance
* Use tools like Neuron Explorer to identify bottlenecks in compute and bus bandwidth utilization
* Monitor and analyze processor, DMA, firmware, and workload metrics
* Optimize collective operations to scale AI compute across the data center
* Work closely with the hardware team to co-optimize software and Trainium silicon
* Develop and optimize C/C++ implementations of collective communication patterns
* Investigate and implement improvements for specific training topologies used by modern LLMs
* Build and maintain analysis frameworks and automation solutions
The role offers opportunities to work on cutting-edge AI training hardware while contributing to one of Amazon's most critical initiatives.
BASIC QUALIFICATIONS
- Experience building complex software systems that have been successfully delivered to customers
- Experience contributing to the architecture and design (architecture, design patterns, reliability and scaling) of new and current systems
- Bachelor's degree in computer science or equivalent
- Knowledge of engineering practices and patterns for the full software/hardware/networks development life cycle, including coding standards, code reviews, source control management, build processes, testing, certification, and livesite operations
- Experience in development in the last 3 years, or experience in embedded development in C/C++
PREFERRED QUALIFICATIONS
- Master's degree in computer science or equivalent
- Experience with hardware/software integration and real-time systems
- Familiarity with collective communication algorithms (e.g., all-reduce, all-gather) or distributed training frameworks
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, CA, Cupertino - 165,200.00 - 223,600.00 USD annually