Staff Software Engineer, Machine Learning Inference Platform

Stack AV

$120K — $150K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • 7+ years of experience in building and operating backend distributed systems end to end.
  • Cross-team technical leadership in backend distributed systems, ML infrastructure, or high-performance compute platforms.
  • Strong fundamentals in data-intensive distributed systems, concurrency, and performance profiling.
  • Hands-on experience running large-scale inference services on GPUs.
  • Direct experience with inference engines like TensorRT or serving frameworks such as Dynamo or Triton.
  • Strong programming skills in C++, Go, Rust, or Python.

Responsibilities

  • Design platform architecture for multi-tenant inference workloads covering serving, orchestration, and APIs.
  • Develop robust API layers (gRPC, WebSockets, REST) and SDKs for distributed inference orchestration.
  • Build and maintain a multi-tenant control plane with metering, rate limiting, and tenant isolation features.
  • Optimize inference performance across the system stack including the model engine layer.
  • Implement observability and SLOs for system economics and resource utilization insights.
  • Collaborate with product and infrastructure teams for model onboarding and capacity planning.
  • Promote and maintain a culture of engineering excellence within the team.

Benefits

  • Comprehensive health and wellness programs.
  • Flexible work hours and remote work options.
  • Opportunities for professional development and training.
  • Collaborative work environment with a strong emphasis on innovation.
  • Access to cutting-edge technology and tools for performance optimization.
Full Job Description
About the Role:

In the Staff Engineer role, you will define and drive architecture for a high-throughput, low-latency, multi-tenant ML inference platform. You will balance hands-on coding with long-term technical direction, operate across ML Platform, infrastructure, MLE, and external-facing API needs, and establish principled architecture for serving, control plane, observability, capacity, tenant isolation, system economics, and model-engine integration.

Responsibilities:
  • Design platform architecture for multi-tenant inference workloads across serving, orchestration, control plane, APIs, SDKs, observability, and model-engine integration.
  • Develop robust API layers (gRPC, WebSockets, REST, etc.) and developer SDKs that abstract complex distributed inference orchestration into seamless, reliable token streams.
  • Build and harden a multi-tenant control plane to enable accurate metering, rate limiting, quotas, tenant isolation and noisy-neighbor fairness across the platform.
  • Optimize inference performance across the entire system stack, including the model engine layer.
  • Build observability and SLOs to gain insights into system economics, cache-hit rates, GPU utilization and cost accounting per model and per tenant.
  • Partner with product and infrastructure teams on model onboarding, capacity planning, external API contracts and customer adoption.
  • Promote Engineering Excellence: Maintain a high bar for engineering excellence in their own work but also set a culture of engineering excellence within the team.

Qualifications:
  • Education: Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
  • Experience: 7+ years of experience building and operating backend distributed systems end to end.
  • Demonstrated cross-team technical leadership in backend distributed systems, ML infrastructure, inference serving, or high-performance compute platforms.
  • Strong Data & ML systems fundamentals: data-intensive distributed systems, concurrency, networking and performance profiling.
  • Hands-on experience running large-scale inference services on GPUs, including KV caches, prefill/decode stages and throughput/latency trade-offs.
  • Direct experience with inference engines (TensorRT, vLLM, etc) or serving frameworks (Dynamo, Triton or equivalent).
  • Technical Skills:
    • Strong programming skills in C++, Go, Rust or Python.
    • Familiarity with deep learning frameworks (PyTorch, etc.) as well as model parallelism.
    • Familiarity with GPU computing primitives such as CUDA, NCCL, NVLink, and hardware-specific optimizations.
    • Practical understanding of high-performance networking architectures, including InfiniBand, RoCE, and low-latency cluster communication.
  • Communication: Excellent verbal and written communication skills, with the ability to convey complex technical concepts to non-technical stakeholders.
  • Autonomous vehicles (AV) experience is a bonus.

Similar Jobs

More Jobs at Stack AV

More Information Technology Jobs

Find similar Staff Software Engineer, Machine Learning Inference Platform jobs: