Member of Technical Staff, Infrastructure

Chakra

$120K — $150K *
Enterprise Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • Experience with Kubernetes or similar container orchestration in a production environment.
  • Familiarity with message-driven architectures like SQS or Kafka.
  • Background in running large-scale LLM workloads including token instrumentation and prompt caching.
  • 3-5 years of relevant experience preferred, but flexibility based on skillset.

Responsibilities

  • Manage agent orchestration across multiple runs and environments.
  • Design and create realistic evaluation environments for agent tasks.
  • Stay updated with advancements in the agent evaluation space and integrate new modalities.
  • Implement observability tools using Prometheus, Grafana, and structured logging.

Benefits

  • Ownership of entire systems rather than just tasks.
  • Opportunity to work directly with AI researchers and labs.
  • Dynamic, early-stage team environment with rapid iteration and scaling opportunities.
Full Job Description
What You'd Work On
  • Agent orchestration at scale. Hundreds of agent runs at once, each with its own stateful environment. 100M tokens per minute across the fleet. You own the dispatch layer: SQS, concurrency control, failure handling.
  • Environment and task design. We need environments that feel real and scenarios that actually push agents to their limits. You'd figure out how to build new evaluations and design the tasks that test what matters, not just what's easy to measure.
  • New frontiers. The agent evaluation space is moving fast. You'd stay on that edge, supporting new environment modalities and shipping integrations with external orchestration frameworks.
  • Observability. Prometheus and OpenTelemetry across services, Grafana dashboards, structured logging.
About You
  • Container orchestration. You're comfortable running Kubernetes or similar in production. Auto-scaling, pod lifecycle, persistent storage, networking. You can figure out why something won't schedule and reason about resource contention.
  • Distributed systems. You've built or maintained message-driven architectures. SQS, Kafka, or similar. You know how to keep jobs moving when things back up, retry without duplicating, and fail without losing work.
  • LLM infrastructure. You've run LLM workloads at scale. Token instrumentation, rate limit handling, prompt caching, multi-provider routing. You've built the plumbing between models and external tools, and you know what it takes to keep it all running under load.
  • Experience. No hard rule. Roughly 3-5 years at this level, but more or less works if the above sounds like you.
What Makes This Different
  • It's infra, but the workload is AI agents. You're monitoring model behavior alongside pod health, debugging token throughput alongside network throughput.
  • Our customers are AI researchers and labs. You'd work directly with the people pushing the frontier of what agents can do, and build the infrastructure they run it on.
  • Early-stage team. You own whole systems, not tickets in a queue. One week you're shipping a new environment type, the next you're scaling the dispatch layer to handle 10x the throughput.

Similar Jobs

More Enterprise Technology Jobs

Find similar Member of Technical Staff, Infrastructure jobs: