Senior Inference Engineer

DEEPREC.AI

$130K — $180K *
Consumer Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of engineering experience in inference acceleration and model deployment.
  • Proven expertise in inference optimization, including quantization and attention acceleration.
  • Deep knowledge of GPU programming with CUDA and NCCL.
  • Familiarity with video generation models and large language models (LLMs).
  • Strong cross-discipline communication skills for collaboration.

Responsibilities

  • Lead and implement advanced inference acceleration techniques for efficient model serving.
  • Engineer and optimize GPU strategies for maximal accuracy and scalability.
  • Develop and optimize high-performance computing kernels and distributed workloads.
  • Collaborate with teams to bring video generation and large language models into production.
  • Contribute to improvements in model training speed and resource utilization.
  • Drive code reviews and mentor engineers on best practices in GPU programming.

Benefits

  • Equity in a fast-growing company driving innovation in generative AI.
  • Comprehensive health benefits and monthly stipends.
  • Company retreats promoting team collaboration.
  • A collaborative culture emphasizing teamwork and collective success.
Full Job Description
Senior Inference Engineer AI Video Generation Company (Stealth) | Palo Alto, CA | Hybrid
About the Role We are seeking a Senior Inference Engineer to accelerate the performance of our AI-driven video generation products. In this highly technical role, you will operate at the intersection of cutting-edge inference acceleration, GPU parallelism, advanced model deployment, and video generation technologies. Your expertise will drive significant improvements to model speed and efficiency, ensuring our creative AI systems deliver industry-leading user experiences at scale.

You will design and optimize inference pipelines, implement state-of-the-art acceleration techniques, and work closely with researchers and engineers across the team to push the boundaries of what's possible in real-time AI deployment. Your efforts will play a foundational role in powering the next generation of our video and language models.
What You'll Do
  • Accelerate Inference: Lead and implement advanced inference acceleration techniques, including attention optimization and quantization for efficient model serving.
  • Maximize GPU Parallelism: Engineer and optimize GPU strategies across tensor, sequence, and pipeline parallelism (TP, SP, PP) for maximal efficiency and scalability.
  • Programming for Performance: Develop and optimize high-performance computing kernels and distributed workloads using CUDA and NCCL.
  • Advance AI Deployment: Collaborate with research and engineering teams to bring state-of-the-art video generation and large language models into production.
  • Improve Training Efficiency: Contribute to improvements in model training speed, stability, and resource utilization as part of our deployment lifecycle. (Bonus)
  • Technical Excellence: Drive rigorous code reviews, participate in technical discussions, and mentor fellow engineers on best practices in inference and GPU programming.
What We're Looking For
  • Experience: 5 years of engineering experience, with a strong track record in inference acceleration and model deployment at scale.
  • Inference Mastery: Proven expertise in inference optimization, including quantization, attention acceleration, and deep learning compiler stacks.
  • GPU and Parallelism: Deep knowledge of GPU programming (CUDA, NCCL) and experience with SP, TP, PP, and other forms of parallelism for distributed inference.
  • AI Domain Knowledge: Familiarity with video generation models and large language models (LLMs).
  • Collaboration: Strong cross-discipline communication skills; able to drive shared goals across research and engineering functions.
  • Ownership Mindset: Self-driven, solutions-oriented, and capable of managing ambiguity in a fast-paced startup environment.
Nice to Have
  • Experience with high-throughput video or real-time streaming model deployment.
  • Familiarity with distributed training and optimization toolkits.
  • Contributions to open source projects in AI infrastructure or deep learning compilers.
  • Startup or rapid prototyping experience.
What We Offer
  • Competitive salary commensurate with AI industry benchmarks.
  • Equity in a fast-growing company shaping the future of generative AI.
  • Comprehensive health benefits, monthly stipends, and company retreats.
  • A collaborative, in-office culture focused on building and shipping together.

Similar Jobs

More Jobs at DEEPREC.AI

More Consumer Technology Jobs

Find similar Senior Inference Engineer jobs: