Job Description:What You Will Work On - Optimize and deploy high-performance LLM inference pipelines
- Own inference runtimes across data center, edge, and embedded platforms
- Push model performance through quantization, kernel fusion, and cache optimization
- Drive latency and throughput improvements that directly impact production products
- Enable efficient, reliable deployment without external vendor dependency
Core Responsibilities Inference Engines & Runtime
- Build deep expertise and ownership of:
- Extend and tune inference engines using custom CUDA kernels
- Adapt runtimes for constrained and embedded deployment environments
Quantization & Numerical Optimisation - Implement and evaluate quantisation strategies:
- INT8, INT4, FP4, FP8, mixed precision
- Balance accuracy, latency, memory footprint, and throughput
KV Cache Optimization - Optimize key-value cache performance through:
- Cache-aware memory layout design
- Reduce memory pressure while sustaining high throughput
Latency & Throughput Optimisation - Optimize tail latency and tokens/sec under real production traffic patterns
What Success Looks Like - Models deploy efficiently on edge and embedded devices, not just servers
- Tokens/sec significantly outperform baseline implementations
- End-to-end latency is minimized and predictable
- Inference cost per request is materially reduced
- The company is no longer dependent on partners for inference optimization
Required Experience & Skills Strongly Required - Proven experience optimizing ML inference performance in production
- Deep understanding of GPU architecture and memory hierarchies
- Hands-on experience with CUDA and low-level performance tuning
- Experience deploying models beyond research environments
Critical Technical Skills - Inference engines: vLLM, TensorRT-LLM, llama.cpp, QAIRT
- CUDA kernel development and profiling
- Quantisation techniques: INT8/INT4/FP4/FP8, AWQ, GPTQ
- KV cache optimisation and memory layout design
- Latency optimisation: batching, speculative decoding, continuous batching
Common Problems You'll Be Solving - Deploy efficiently on edge or embedded targets
- Achieve competitive tokens/sec
- Reduce and stabilize inference latency
You will be responsible for closing these gaps, creating a major competitive advantage.
What we offer We offer a generous compensation and benefits package (in addition to the base salary), including:
- Salary range $141,400 USD - $226,300 USD It is not typical for offers to be made at or near the top of the range. The actual salary will be determined based on experience and other job-related factors.
- Insurance coverage (medical, dental, vision, life, and disability)
- Company contribution to the RRSP (Registered Retirement Savings Plan)
- Equity awards for certain positions and levels
- Remote and/or hybrid work available depending on the position
All compensation and benefits are subject to the terms and conditions of the underlying plans or programs, as applicable, and may be amended, terminated, or replaced from time to time.