Sr. Principal Software Engineer

Cerence Inc.$141K — $226K *
US-AnywhereRemote in United States
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years experience in ML inference performance optimization
  • In-depth knowledge of GPU architecture and memory hierarchies
  • Proficiency in CUDA and low-level performance tuning
  • Experience deploying models in production environments
  • Familiarity with inference engines like vLLM, TensorRT-LLM, llama.cpp, QAIRT
  • Expertise in quantization techniques including INT8, INT4, FP4, FP8, AWQ, GPTQ
  • A solid understanding of latency optimization strategies

Responsibilities

  • Optimize and deploy LLM inference pipelines
  • Manage inference runtimes across various platforms
  • Enhance model performance using techniques like quantization and kernel fusion
  • Drive improvements in latency and throughput for production products
  • Enable efficient deployment independent of external vendors
  • Build expertise in key inference engines
  • Adapt runtimes for constrained environments

Benefits

  • Annual bonus opportunity
  • Comprehensive insurance coverage including medical, dental, vision, life, and disability
  • Paid time off and holidays
  • Company contributions to RRSP
  • Equity awards available for select roles
  • Remote or hybrid work options based on position
Full Job Description
Job Description:

What You Will Work On

  • Optimize and deploy high-performance LLM inference pipelines


  • Own inference runtimes across data center, edge, and embedded platforms


  • Push model performance through quantization, kernel fusion, and cache optimization


  • Drive latency and throughput improvements that directly impact production products


  • Enable efficient, reliable deployment without external vendor dependency


Core Responsibilities

Inference Engines & Runtime

  • Build deep expertise and ownership of:


  • vLLM


  • TensorRT-LLM


  • llama.cpp


  • QAIRT


  • Extend and tune inference engines using custom CUDA kernels


  • Adapt runtimes for constrained and embedded deployment environments


Quantization & Numerical Optimisation

  • Implement and evaluate quantisation strategies:


  • INT8, INT4, FP4, FP8, mixed precision


  • AWQ


  • GPTQ


  • Balance accuracy, latency, memory footprint, and throughput


KV Cache Optimization

  • Optimize key-value cache performance through:


  • Paging


  • Prefix caching


  • Cache-aware memory layout design


  • Reduce memory pressure while sustaining high throughput


Latency & Throughput Optimisation

  • Design and tune:


  • Batching strategies


  • Continuous batching


  • Speculative decoding


  • Optimize tail latency and tokens/sec under real production traffic patterns


What Success Looks Like

  • Models deploy efficiently on edge and embedded devices, not just servers


  • Tokens/sec significantly outperform baseline implementations


  • End-to-end latency is minimized and predictable


  • Inference cost per request is materially reduced


  • The company is no longer dependent on partners for inference optimization


Required Experience & Skills

Strongly Required

  • Proven experience optimizing ML inference performance in production


  • Deep understanding of GPU architecture and memory hierarchies


  • Hands-on experience with CUDA and low-level performance tuning


  • Experience deploying models beyond research environments


Critical Technical Skills

  • Inference engines: vLLM, TensorRT-LLM, llama.cpp, QAIRT


  • CUDA kernel development and profiling


  • Quantisation techniques: INT8/INT4/FP4/FP8, AWQ, GPTQ


  • KV cache optimisation and memory layout design


  • Latency optimisation: batching, speculative decoding, continuous batching


Common Problems You'll Be Solving

  • Deploy efficiently on edge or embedded targets


  • Achieve competitive tokens/sec


  • Reduce and stabilize inference latency


You will be responsible for closing these gaps, creating a major competitive advantage.

What we offer

We offer a generous compensation and benefits package (in addition to the base salary), including:

  • Salary range $141,400 USD - $226,300 USD It is not typical for offers to be made at or near the top of the range. The actual salary will be determined based on experience and other job-related factors.


  • Annual bonus opportunity


  • Insurance coverage (medical, dental, vision, life, and disability)


  • Paid time off


  • Paid holidays


  • Company contribution to the RRSP (Registered Retirement Savings Plan)


  • Equity awards for certain positions and levels


  • Remote and/or hybrid work available depending on the position


All compensation and benefits are subject to the terms and conditions of the underlying plans or programs, as applicable, and may be amended, terminated, or replaced from time to time.

About Cerence Inc.

Cerence Inc. is a software company that specializes in voice recognition and natural language understanding technology. The company was spun off from Nuance Communications in 2019 and is headquartered in Newton, Massachusetts. Cerence's software is used in a variety of applications, including automotive infotainment systems, smart speakers, and virtual assistants. The company's clients include many of the world's leading automakers, as well as companies in the consumer electronics and mobile device industries. Cerence has received several awards for its technology, including the 2020 CES Innovation Award for its Cerence Drive platform.
Learn more about Cerence Inc.
Size
1,200 employees
Market Cap
$726.1 million
Industry
Net Income
$12.7 million
5 Year Trend
+6%
Revenue
$347.1 million

Similar Jobs

More Jobs at Cerence Inc.

More Information Technology Jobs

Find similar Sr. Principal Software Engineer jobs: