About the Role:In the Senior Engineer role, you will own meaningful subsystems of Stack AV's inference platform and drive them from design through production. You will be the go-to engineer for one or more areas such as model onboarding, serving APIs, metering, observability, performance optimization, or tenant isolation. The role requires strong hands-on implementation, production debugging, thoughtful design, and the ability to mentor engineers while keeping delivery moving.
Responsibilities:- Own technical design and delivery of subsystems in a high-throughput, low-latency inference platform capable of handling multi-tenant, enterprise-grade inference workloads.
- Develop robust API layers (gRPC, WebSockets, REST, etc.) and developer SDKs that abstract complex distributed inference orchestration into seamless, reliable token streams.
- Build and harden a multi-tenant control plane to enable accurate metering, rate limiting, quotas, tenant isolation and noisy-neighbor fairness across the platform.
- Optimize inference performance across the entire system stack, including the model engine layer.
- Build observability and SLOs to gain insights into system economics, cache-hit rates, GPU utilization and cost accounting per model and per tenant.
- Partner with product and infrastructure teams on model onboarding, capacity planning, external API contracts and customer adoption.
- Decompose ambiguous work, drive issues to closure, and raise the engineering bar through code quality, reviews, testing, and mentoring.
Qualifications: - Education: Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
- Experience: 4+ years of experience building and operating backend distributed systems end to end.
- Strong Data & ML systems fundamentals: data-intensive distributed systems, concurrency, networking and performance profiling.
- Hands-on experience with large-scale inference services on GPUs, including KV caches, prefill/decode stages and throughput/latency trade-offs.
- Direct experience with inference engines (TensorRT, vLLM, etc) or serving frameworks (Dynamo, Triton or equivalent).
- Technical Skills:
- Strong programming skills in C++, Go, Rust or Python.
- Familiarity with deep learning frameworks (PyTorch, etc.) as well as model parallelism.
- Familiarity with GPU computing primitives such as CUDA, NCCL, NVLink, and hardware-specific optimizations.
- Practical understanding of high-performance networking architectures, including InfiniBand, RoCE, and low-latency cluster communication.
- Problem-Solving: Strong analytical and problem-solving skills.
- Autonomous vehicles (AV) experience is a bonus.