Amazon

Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training

Amazon$193K — $261K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in computer science or equivalent.
  • 5+ years of professional software development experience.
  • 5+ years of programming experience in at least one programming language.
  • 5+ years of design or architecture experience for new and existing systems.
  • 5+ years of experience with full software development life cycle processes.
  • Experience as a mentor, tech lead, or leading an engineering team.
  • Experience in machine learning, particularly with large-scale training of LLMs, and expertise in Pytorch.

Responsibilities

  • Design and implement distributed training solutions for large-scale ML models.
  • Optimize distributed training frameworks like FSDP, torchtitan, and Hugging Face for Neuron ecosystem.
  • Develop and implement mixed-precision and low-precision training techniques.
  • Profile and analyze end-to-end training pipelines for optimal Trainium performance.
  • Collaborate with hardware, compiler, and runtime teams to enhance system capabilities.
  • Work with AWS solution architects and clients to scale and optimize training workloads.

Benefits

  • Comprehensive health insurance (medical, dental, vision, prescription coverage).
  • 401(k) matching.
  • Paid time off and parental leave.
  • Flexible Spending Accounts and support for adoption/surrogacy reimbursement.
  • Access to Mental Health Support and Employee Assistance Programs.
Full Job Description
In this role, you will be responsible for the development, enablement, and performance optimization of large scale ML model training across diverse model families. This includes massive scale pre-training and post-training of LLMs with Dense and Mixture-of-Experts architectures, Multimodal models that are transformer and diffusion based, and Reinforcement Learning workloads. You will work at the intersection of ML research and high performance systems, collaborating closely with chip architects, compiler engineers, runtime engineers and AWS solution architects to deliver cost-effective, performant machine learning solutions on AWS Trainium based systems.

Key job responsibilities

You will design, implement and optimize distributed training solutions for large scale ML models running on Trainium instances. A significant part of your work will involve extending and optimizing popular distributed training frameworks including FSDP (Fully-Sharded Data Parallel), torchtitan and Hugging Face libraries for the Neuron ecosystem.

A core focus of this role involves developing and optimizing mixed-precision and low-precision training techniques. You will work with BF16, FP8, and emerging numerical formats to maximize training throughput while maintaining model accuracy and convergence quality. This requires implementing precision aware training strategies, loss scaling techniques, and careful gradient management to ensure training stability across reduced precision formats. Understanding the tradeoffs between computational efficiency and numerical fidelity is essential to success in this position.

Beyond precision optimization, you will profile, analyze, and tune end-to-end training pipelines to achieve optimal performance on Trainium hardware. You will partner with hardware, compiler, and runtime teams to influence system design and unlock new capabilities. Additionally, you will work directly with AWS solution architects and customers to deploy and optimize training workloads at scale.

BASIC QUALIFICATIONS

- Bachelor's degree in computer science or equivalent

- 5+ years of non-internship professional software development experience

- 5+ years of programming with at least one software programming language experience

- 5+ years of leading design or architecture (design patterns, reliability and scaling) of new and existing systems experience

- 5+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience

- Experience as a mentor, tech lead or leading an engineering team

- Experience in machine learning, large scale training with LLMs and expertise in Pytorch.

PREFERRED QUALIFICATIONS

- Master's degree in computer science or equivalent

- Experience in computer architecture

- Previous software engineering expertise with Pytorch/Jax/Tensorflow, Distributed libraries and Frameworks, End-to-end Model Training.

The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.

USA, CA, Cupertino - 193,300.00 - 261,500.00 USD annually

About Amazon

Audible is a provider of spoken audio information and entertainment , on the Internet. They provide premium spoken audio content, such as audio versions of books and newspapers and radio programs, that is delivered over the Internet and played back on personal computers and hand-held electronic devices. The Audible service allows consumers to purchase and download their content from their Website, store it in digital files and play it back on personal computers and electronic devices. More than 15,000 hours of audio content are available on their Web site, including audio versions of books, periodicals and radio programs. Several manufacturers have agreed to support and promote the playback of their content on their hand-held audio-enabled electronic devices.

Amazon Careers

Joining Amazon presents an unparalleled opportunity to become part of a vibrant team pushing the boundaries of innovation and growth in the global marketplace. As a leader in e-commerce, technology, and logistics, Amazon offers a variety of job opportunities that cater to a range of skills and professional interests. Work You’ll Do At Amazon, every day is an opportunity to collaborate with the brightest minds in technology and business to redefine what’s possible. Whether you’re interested in software development, marketing, human resources, or customer service, Amazon has a position waiting for you. Transform the way the world shops and innovates with our diverse and inclusive team. Amazon is not just a company; it’s a community where you can drive real change and contribute to projects impacting millions globally. Lead with Innovation and Leadership Amazon is the perfect place to enhance your leadership and innovation skills. Our culture encourages pushing the envelope and imagining the unimaginable. Here, you will lead projects that challenge the status quo and define new industry standards. Work with a team that values diversity and is committed to creating an inclusive environment. Our leadership is focused on harnessing the collective power of unique perspectives to foster growth and innovation. Explore Amazon’s Employment Benefits Amazon’s commitment to its employees extends beyond just career growth. We offer competitive benefits, including health care, parental leave, and diversity training, ensuring that our team not only excels professionally but also enjoys well-being and security. Internship and Networking Opportunities Start your career with an Amazon internship and gain hands-on experience that matters. Our internships provide a gateway to full-time employment and an opportunity to network with professionals across various sectors of the company. Future-Proof Your Career With Amazon, your career path is filled with numerous opportunities for advancement. Our learning and development programs are designed to nurture your professional growth and keep you at the forefront of industry trends. Stay Connected Join Our Team Discover the job opportunities at Amazon that match your skills and interests. We are constantly on the lookout for passionate, curious, and innovative team players ready to make a difference. Keep Up to Date Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here. Job Alert Emails Customize your subscription to receive job alerts, the latest news, and insider tips tailored to your preferences. Explore the exciting and rewarding career opportunities that await at Amazon. Amazon is more than just a company—it’s a platform for building a promising future. Whether you’re starting or looking to advance your career, Amazon offers the resources, support, and network you need to succeed. Join us, and be a part of our continuing mission to be Earth's most customer-centric company.
Learn more about Amazon
Size
1,608 employees
Market Cap
$832.6 billion
Industry
Net Income
$21.3 billion
Founded
1994
5 Year Trend
+28.1%
Revenue
$386 billion
NASDAQ

Similar Jobs

More Jobs at Amazon

More Information Technology Jobs

Find similar Sr. Software Engineer- AI/ML, AWS Neuron Distributed Training jobs: