Full Job Description
ML Kernel Performance Engineer, Edge AI and Science
Within Edge AI & Science, the AI Platform team builds a compression platform-the first of its kind-enabling 20-100x neural network compression for edge and cloud deployment. As model sizes grow from billions to hundreds of billions of parameters, compute efficiency becomes the single largest return on engineering investment during training. The gap between eager-mode Python and optimized GPU execution is where months of training time are won or lost.
We are looking for an ML Kernel Performance Engineer to work at the hardware-software boundary of this platform, crafting high-performance CUDA and Triton kernels that make our compression algorithms run at peak efficiency during training, fine-tuning, and inference. You will build the tooling and kernel libraries that democratize GPU performance optimization across the team, enabling scientists and engineers to profile, diagnose, and fix kernel bottlenecks without needing to be CUDA experts themselves.
Working alongside compression scientists and platform engineers, you will ensure that novel quantization schemes (ternary, nonary, mixed-precision) and sparse computation patterns translate into real throughput gains on GPU hardware. Your work will directly accelerate every training run in the organization and unlock deployment of compressed models to both edge devices and cloud inference.
Key job responsibilities
Design and implement high-performance CUDA and Triton kernels for quantization-aware training, sparse matrix operations, and low-bit inference on modern GPU accelerators
Analyze and optimize kernel-level performance for compression training workloads, conducting detailed performance analysis using profiling tools to identify and resolve bottlenecks that slow model training from days to weeks
Implement kernel-level optimizations such as operator fusion, tiling, memory access pattern optimization, and scheduling for compression-specific compute patterns
Build a kernel development harness that enables any team member to profile kernel performance, test forward/backward accuracy, and validate at production scale, lowering the bar from "CUDA expert" to "any engineer with agents"
Maintain and extend the team's training kernels library with clean interfaces, CI, and examples that enable scientists to contribute kernel improvements alongside platform engineers
Collaborate closely with Applied Scientists, compiler engineers, and hardware architects to co-design ML-centric solutions that unify software and hardware for both cloud and edge deployment
Develop inference kernels for cloud deployment (custom backends for quantized models that keep weights packed in memory and reconstruct on the fly for compute)
Build and maintain performance regression tests and benchmarking infrastructure that track kernel efficiency as models scale from billions to hundreds of billions of parameters
A day in the life
A scientist files a ticket: "QAT training on our large model is 4x slower than expected." You pull up the profiler, identify that a custom quantizer kernel is thrashing shared memory at scale, write a Triton replacement that tiles correctly for the layer shapes at that model size, validate accuracy in the test harness, and push it to the kernels repo. By end of day, the training run that was taking four days now takes one.
You will also build the tooling that makes this workflow repeatable by others. You will participate in design discussions with Applied Scientists, translate their algorithmic ideas into efficient GPU implementations, and work in a startup-like environment where every engineering hour directly accelerates the team's ability to ship compressed models.
About the team
The AI Platform team builds Amazon's neural network compression platform. We compress models using knowledge distillation, network restructuring, and advanced quantization to achieve 20-100x compression while preserving model quality. Our platform packages these into automated pipelines that deploy to both custom edge silicon and GPU-based cloud inference.
As model sizes grow, the proprietary advantage shifts from the science to the software (making it work at hundreds of billions of parameters is the moat). GPU kernel performance is the biggest single lever on training throughput, and we expect AI-assisted development tooling to significantly multiply engineering productivity, meaning a small team with the right harness can operate at the scale of a much larger one.
The ML Kernel Performance Engineer bridges science and platforms: you turn algorithmic innovations into production-grade GPU code that runs at scale. You will work alongside Applied Scientists, compiler engineers, hardware architects, and platform developers in a small, agile team building the next generation of edge AI for Amazon's consumer products.
BASIC QUALIFICATIONS
- 3+ years of non-internship professional software development experience
- 2+ years of non-internship design or architecture (design patterns, reliability and scaling) of new and existing systems experience
- Knowledge of Python and/or C++ programming
- Experience with CUDA kernels or ML/low-level kernels, or experience in developing and deploying LLMs in production on GPUs, Neuron, TPU or other AI acceleration hardware
PREFERRED QUALIFICATIONS
- Bachelor's degree in computer science or equivalent
- 3+ years of full software development life cycle, including coding standards, code reviews, source control management, build processes, testing, and operations experience
- Experience with GPU kernel optimization and GPGPU computing (CUDA, Triton, SYCL, or ROCm)
- Proficiency in low-level performance optimization for GPUs
- Understanding of GPU memory hierarchies and optimization strategies (shared memory, L1/L2 cache, register pressure, memory coalescing)
- Experience developing high-performance libraries for ML or HPC applications
- Knowledge of ML frameworks (PyTorch, TensorFlow) and their GPU backends
- Experience implementing custom PyTorch operators (torch.autograd.Function, C++ extensions)
- Experience with parallel programming and optimization techniques
- Background in neural network compression (quantization, pruning, knowledge distillation, low-rank factorization)
- Knowledge of mixed-precision training and inference (FP16, BF16, FP8, INT8, INT4)
- Experience with inference optimization (TensorRT, ONNX Runtime, vLLM, or similar)
- Familiarity with Transformer architectures, attention mechanisms, and their compute/memory profiles
- Experience with AWS Trainium/Inferentia or the Neuron Kernel Interface (NKI)
- Experience with edge deployment, model compilation, or hardware-aware optimization
The base salary range for this position is listed below. Your Amazon package will include sign-on payments and restricted stock units (RSUs). Final compensation will be determined based on factors including experience, qualifications, and location. Amazon also offers comprehensive benefits including health insurance (medical, dental, vision, prescription, Basic Life & AD&D insurance and option for Supplemental life plans, EAP, Mental Health Support, Medical Advice Line, Flexible Spending Accounts, Adoption and Surrogacy Reimbursement coverage), 401(k) matching, paid time off, and parental leave. Learn more about our benefits at https://amazon.jobs/en/benefits.
USA, CA, Sunnyvale - 165,200.00 - 223,600.00 USD annually