Senior Software Engineer, Engine & Distributed Systems

StackAI

$130K — $180K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years of experience building production backend systems, specifically in distributed systems.
  • Hands-on experience with durable execution or workflow orchestration tools (e.g., Temporal, Cadence, Airflow).
  • Strong understanding of concurrency, queueing, retries, and fault tolerance under load.
  • Proficient in Python and modern backend frameworks (such as FastAPI) with a solid grasp of database fundamentals.
  • Interest in solving complex correctness problems essential for system reliability.

Responsibilities

  • Own the execution engine, including runtime, scheduling, and sub-agent parallelization.
  • Build systems for long-running work to ensure durability and recovery after failures.
  • Shape the execution model for efficient workload scheduling and responsiveness under load.
  • Engineer the system for scale, focusing on health targets and performance metrics.
  • Ensure the engine remains open to integrating new agent harnesses and orchestration frameworks.

Benefits

  • Join a lean, high-impact team.
  • Opportunity for fast shipping of impactful work across the product.
  • Exposure to cutting-edge distributed systems challenges in AI.
  • Steer the development of a core engine that underpins system functionality.
  • Work in an environment conducive to rapid personal and professional growth.
Full Job Description
The role

Enterprises run real work on AI agents, and at Stack AI that work runs on a single engine. Some agents finish in a second. Others run for days, fan out into dozens of sub-agents, pause, resume, and recover from failures without losing a step. We're hiring a Senior Software Engineer, Engine & Distributed Systems to own that engine: the durable runtime at the core of the platform that has to be correct every time, at any scale.

This is deep systems work at the heart of the product. When the engine is solid, agents simply run - and getting it there is one of the more interesting distributed-systems problems in AI today. You'll own it end to end, from the execution model to how it behaves in production.

What you'll do
  • Own the execution engine. The runtime, scheduling, and sub-agent parallelization that run every agent on the platform.
  • Make long-running work durable. Build checkpointing, resumption, and recovery so agents survive failures and restarts and pick up exactly where they left off.
  • Shape the execution model. Decide how work is scheduled, queued, and moved from synchronous to asynchronous, so the platform stays correct and responsive as load grows.
  • Engineer for scale and reliability. Hold the engine to strict health targets for worker freshness, deploy safety, and drain time, and keep latency and throughput strong as volume grows.
  • Keep the engine open to the ecosystem. Make it straightforward to bring new agent harnesses, orchestration frameworks, and model capabilities into the runtime.
What we're looking for
  • 5+ years building backend systems in production, with real depth in distributed systems.
  • Hands-on experience with durable execution or workflow orchestration (Temporal, Cadence, Airflow, or equivalent), with a way of thinking rooted in idempotency, state machines, and failure recovery.
  • Strong command of concurrency, queueing, retries, and fault tolerance under load.
  • Strong in Python and modern backend frameworks (FastAPI or similar), with sound database fundamentals (Postgres or similar).
  • You're drawn to the correctness problems that everything else quietly depends on.

Distributed systems is broad. If you're strong on most of this and excited to grow into the rest, we'd like to hear from you, even if you don't check every box.

Bonus points
  • Operating Temporal at scale.
  • Event-driven architectures and message queues.
  • Experience with PydanticAI, LangGraph, or similar.
  • AI or agent runtimes: tool-calling, sub-agent orchestration, streaming.
  • Performance and cost optimization of high-throughput backends.
  • Startup or growth-stage experience.

You'll join a lean, high-impact team and own the engine that every customer's agents run on. Your work ships fast and is felt across the whole product.

Similar Jobs

More Jobs at StackAI

More Information Technology Jobs

Find similar Senior Software Engineer, Engine & Distributed Systems jobs: