5-7 years of experience deploying ML models in production environments
Proficient in PyTorch and contemporary ML architectures
Experience with GPU systems and CUDA debugging
Proven track record in building scalable data pipelines
Familiarity with model compression techniques
Responsibilities
Collaborate with researchers to transition experimental models to production systems
Enhance model inference performance through various optimization techniques
Leverage optimization tools to accelerate diverse multimodal models
Design and implement large-scale data ingestion and training pipelines
Create frameworks for evaluating model quality and guiding improvements
Benefits
In-person collaboration, fostering teamwork and innovation in Seattle HQ
Full Job Description
Responsibilities
Operationalize Research: Collaborate with researchers to move models from experimental checkpoints to production-ready systems. Establish patterns for large-scale training, rapid experimentation, and deployment of new architectures.
Optimize Model Performance: Profile and improve model inference for latency and throughput using quantization, pruning, distillation, and architectural refinements to ensure viable unit economics
Model Acceleration: Apply optimization techniques (TensorRT, ONNX, vLLM) to accelerate multimodal models including video diffusion, LLMs, and speech models
Design Data Pipelines: Design and implement efficient pipelines for video data ingestion, preprocessing, and training at petabyte scale using tools like Dagster and Ray.
Evaluate and Iterate: Build evaluation frameworks to measure model quality, establish benchmarks, and guide continuous improvement of model capabilities.
Requirements
Production ML: Experience deploying ML models to production. You understand common failure modes and how to address them (resource contention, OOMs, batch optimization)
Deep Learning Experience: Strong knowledge of PyTorch and modern ML architectures. Experience training and optimizing large models (transformers, diffusion models, or similar).
Systems Proficiency: Comfortable working with GPUs, debugging CUDA issues, and profiling model workloads to identify compute or memory bottlenecks.
Data Engineering: Experience building scalable data pipelines for high-bandwidth media processing and training workflows.
Preferred Experience
Experience with video or audio models in research or production settings
Familiarity with low-level optimization (CUDA kernels, Triton, custom operators)
Knowledge of real-time ML systems and latency-critical inference
Prior work with model compression techniques (quantization, distillation, pruning)
Nuance Labs Key Facts
In-person collaboration, 5 days a week at Seattle HQ