Member of Technical Staff, Infrastructure

Chakra

• $120K — $150K *

Enterprise Technology

Less than 5 years of experience

1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Qualifications

Experience with Kubernetes or similar container orchestration in a production environment.
Familiarity with message-driven architectures like SQS or Kafka.
Background in running large-scale LLM workloads including token instrumentation and prompt caching.
3-5 years of relevant experience preferred, but flexibility based on skillset.

Responsibilities

Manage agent orchestration across multiple runs and environments.
Design and create realistic evaluation environments for agent tasks.
Stay updated with advancements in the agent evaluation space and integrate new modalities.
Implement observability tools using Prometheus, Grafana, and structured logging.

Benefits

Ownership of entire systems rather than just tasks.
Opportunity to work directly with AI researchers and labs.
Dynamic, early-stage team environment with rapid iteration and scaling opportunities.

What You'd Work On

Agent orchestration at scale. Hundreds of agent runs at once, each with its own stateful environment. 100M tokens per minute across the fleet. You own the dispatch layer: SQS, concurrency control, failure handling.
Environment and task design. We need environments that feel real and scenarios that actually push agents to their limits. You'd figure out how to build new evaluations and design the tasks that test what matters, not just what's easy to measure.
New frontiers. The agent evaluation space is moving fast. You'd stay on that edge, supporting new environment modalities and shipping integrations with external orchestration frameworks.
Observability. Prometheus and OpenTelemetry across services, Grafana dashboards, structured logging.

About You

Container orchestration. You're comfortable running Kubernetes or similar in production. Auto-scaling, pod lifecycle, persistent storage, networking. You can figure out why something won't schedule and reason about resource contention.
Distributed systems. You've built or maintained message-driven architectures. SQS, Kafka, or similar. You know how to keep jobs moving when things back up, retry without duplicating, and fail without losing work.
LLM infrastructure. You've run LLM workloads at scale. Token instrumentation, rate limit handling, prompt caching, multi-provider routing. You've built the plumbing between models and external tools, and you know what it takes to keep it all running under load.
Experience. No hard rule. Roughly 3-5 years at this level, but more or less works if the above sounds like you.

What Makes This Different

It's infra, but the workload is AI agents. You're monitoring model behavior alongside pod health, debugging token throughput alongside network throughput.
Our customers are AI researchers and labs. You'd work directly with the people pushing the frontier of what agents can do, and build the infrastructure they run it on.
Early-stage team. You own whole systems, not tickets in a queue. One week you're shipping a new environment type, the next you're scaling the dispatch layer to handle 10x the throughput.

Similar Jobs

Infrastructure as Code Engineer
$125K — $150K *
BRMi
Frederick, MD 21702 (Frederick County)
Today
Lead Systems Engineer (HPC)
$135K — $150K *
Princeton University
Princeton, NJ 08540 (Mercer County)
Today
Senior Modeling & Simulation Engineer
$110K — $140K *
QinetiQ North America
Chantilly, VA 20152 (Loudoun County)
Today
IEC Engineer
$100K — $130K *
Abile Group, Inc.
Springfield, VA 22153 (Fairfax County)
Today
Senior MRB/Liaison Engineer
$90K — $120K *
Airbus
Kinston, NC 28504 (Lenoir County)
Today
Eng Sr Prin II - Sys
$120K — $150K *
BAE Systems
Herndon, VA 20171 (Fairfax County)
Reposted Today

More Enterprise Technology Jobs

Business Intelligence Engineer II, CDR Pillar 1 Strategy and DX
$99K — $160K *
Amazon
Seattle, WA 98115 (King County)
Reposted Today
Software Development Manager, Kiro, Kiro
$184K — $250K *
Amazon
Seattle, WA 98115 (King County)
Today
Senior Principal Software Engineer
$96K — $306K *
Oracle Corporation
Santa Clara, CA 95051 (Santa Clara County)
Today
Machine Learning Engineer, Agentic Product - Moveworks
$120K — $160K *
ServiceNow
Mountain View, CA 94040 (Santa Clara County)
Today
Appian Principal Developer Consultant - Civilian
$128K — $180K *
Groundswell Agriculture Festival
Mclean, VA 22101 (Fairfax County)
Today

Find similar Member of Technical Staff, Infrastructure jobs: