The RoleWe are looking for an engineer to own the inference systems that power our models in production and research. You'll work across the full inference stack, from serving infrastructure down to hardware-level optimization. Some example areas you might work on (not limited to):
- Design and build high-throughput, low-latency inference serving systems for frontier models, optimizing for both research iteration and production deployment
- Optimize inference performance across GPU and accelerator hardware - maximizing FLOPs utilization, memory bandwidth, and compute efficiency for large-scale models
- Enable and extend distributed inference frameworks (e.g. vLLM, SGLang, TensorRT-LLM) to support novel architectures, long-context workloads, and agentic inference patterns
- Implement and validate inference-time optimizations: speculative decoding, quantization, KV cache management, and batching strategies
- Build observability and reliability infrastructure so the team can measure latency, throughput, and cost across every serving configuration
- Partner directly with teams to bring new model architectures and post-training techniques into production quickly
If you're excited about pushing the performance limits of frontier model inference, we'd love to hear from you.
We offer a base salary of $350,000-$500,000 USD and a meaningful equity grant, depending on experience and background, along with competitive benefits.