About the RoleSpecialize in low-latency, high-throughput inference for OCR and multimodal models. Own profiling, batching, and autoscaling across single-tenant and multi-tenant environments.
Responsibilities- Build inference services with smart batching and caching
- Optimize kernels, tokenization, and model graphs
- Evaluate vLLM, TensorRT LLM, and Triton tradeoffs
- Implement autoscaling and admission control with clear SLOs
- Own performance dashboards and capacity planning
Requirements- 3+ years in performance engineering or ML systems
- Strong Python, plus C++ or CUDA exposure
- Experience with GPU profiling and model serving
Nice to have- Experience reducing p95 and cost in production ML systems
SponsorshipSponsorship available.
Compensation and benefitsCompetitive base salary plus equity, performance-based bonus, relocation assistance for Bay Area moves, daily meal stipend, medical, vision, and dental coverage.