Member of Technical Staff, Infrastructure

Chakra

$120K — $150K *
Information Technology
Less than 5 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience with Kubernetes or similar orchestration tools.
  • Proven experience in building or maintaining message-driven architectures like SQS or Kafka.
  • Experience running large language model (LLM) workloads at scale.
  • Ability to troubleshoot production systems involving container orchestration and distributed computing.
  • Familiarity with observability tools such as Grafana and OpenTelemetry.

Responsibilities

  • Design and oversee agent orchestration at scale with a focus on concurrency control and failure handling.
  • Create realistic environments and challenging tasks that push AI agents to their limits.
  • Stay current with developments in agent evaluation methods and modalities.
  • Implement structured logging and monitoring using tools like Prometheus and Grafana.
  • Integrate external orchestration frameworks to enhance system capabilities.

Benefits

  • Work in an early-stage team with a high level of autonomy.
  • Direct collaboration with AI researchers and labs pushing the boundaries of technology.
  • Opportunity to own and architect whole systems rather than just executing ticketed tasks.
  • Engage with cutting-edge technology in a rapidly evolving field.
Full Job Description
What You'd Work On
  • Agent orchestration at scale. Hundreds of agent runs at once, each with its own stateful environment. 100M tokens per minute across the fleet. You own the dispatch layer: SQS, concurrency control, failure handling.
  • Environment and task design. We need environments that feel real and scenarios that actually push agents to their limits. You'd figure out how to build new evaluations and design the tasks that test what matters, not just what's easy to measure.
  • New frontiers. The agent evaluation space is moving fast. You'd stay on that edge, supporting new environment modalities and shipping integrations with external orchestration frameworks.
  • Observability. Prometheus and OpenTelemetry across services, Grafana dashboards, structured logging.
About You
  • Container orchestration. You're comfortable running Kubernetes or similar in production. Auto-scaling, pod lifecycle, persistent storage, networking. You can figure out why something won't schedule and reason about resource contention.
  • Distributed systems. You've built or maintained message-driven architectures. SQS, Kafka, or similar. You know how to keep jobs moving when things back up, retry without duplicating, and fail without losing work.
  • LLM infrastructure. You've run LLM workloads at scale. Token instrumentation, rate limit handling, prompt caching, multi-provider routing. You've built the plumbing between models and external tools, and you know what it takes to keep it all running under load.
  • Experience. No hard rule. Roughly 3-5 years at this level, but more or less works if the above sounds like you.
What Makes This Different
  • It's infra, but the workload is AI agents. You're monitoring model behavior alongside pod health, debugging token throughput alongside network throughput.
  • Our customers are AI researchers and labs. You'd work directly with the people pushing the frontier of what agents can do, and build the infrastructure they run it on.
  • Early-stage team. You own whole systems, not tickets in a queue. One week you're shipping a new environment type, the next you're scaling the dispatch layer to handle 10x the throughput.

Similar Jobs

More Information Technology Jobs

Find similar Member of Technical Staff, Infrastructure jobs: