Member of Technical Staff, Infrastructure

Chakra

• $120K — $150K *

Information Technology

Less than 5 years of experience

2 months ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Qualifications

5-7 years of experience with Kubernetes or similar orchestration tools.
Proven experience in building or maintaining message-driven architectures like SQS or Kafka.
Experience running large language model (LLM) workloads at scale.
Ability to troubleshoot production systems involving container orchestration and distributed computing.
Familiarity with observability tools such as Grafana and OpenTelemetry.

Responsibilities

Design and oversee agent orchestration at scale with a focus on concurrency control and failure handling.
Create realistic environments and challenging tasks that push AI agents to their limits.
Stay current with developments in agent evaluation methods and modalities.
Implement structured logging and monitoring using tools like Prometheus and Grafana.
Integrate external orchestration frameworks to enhance system capabilities.

Benefits

Work in an early-stage team with a high level of autonomy.
Direct collaboration with AI researchers and labs pushing the boundaries of technology.
Opportunity to own and architect whole systems rather than just executing ticketed tasks.
Engage with cutting-edge technology in a rapidly evolving field.

What You'd Work On

Agent orchestration at scale. Hundreds of agent runs at once, each with its own stateful environment. 100M tokens per minute across the fleet. You own the dispatch layer: SQS, concurrency control, failure handling.
Environment and task design. We need environments that feel real and scenarios that actually push agents to their limits. You'd figure out how to build new evaluations and design the tasks that test what matters, not just what's easy to measure.
New frontiers. The agent evaluation space is moving fast. You'd stay on that edge, supporting new environment modalities and shipping integrations with external orchestration frameworks.
Observability. Prometheus and OpenTelemetry across services, Grafana dashboards, structured logging.

About You

Container orchestration. You're comfortable running Kubernetes or similar in production. Auto-scaling, pod lifecycle, persistent storage, networking. You can figure out why something won't schedule and reason about resource contention.
Distributed systems. You've built or maintained message-driven architectures. SQS, Kafka, or similar. You know how to keep jobs moving when things back up, retry without duplicating, and fail without losing work.
LLM infrastructure. You've run LLM workloads at scale. Token instrumentation, rate limit handling, prompt caching, multi-provider routing. You've built the plumbing between models and external tools, and you know what it takes to keep it all running under load.
Experience. No hard rule. Roughly 3-5 years at this level, but more or less works if the above sounds like you.

What Makes This Different

It's infra, but the workload is AI agents. You're monitoring model behavior alongside pod health, debugging token throughput alongside network throughput.
Our customers are AI researchers and labs. You'd work directly with the people pushing the frontier of what agents can do, and build the infrastructure they run it on.
Early-stage team. You own whole systems, not tickets in a queue. One week you're shipping a new environment type, the next you're scaling the dispatch layer to handle 10x the throughput.

Similar Jobs

Staff Engineer - Capacity Planning and Management
$110K — $230K *
Geico
Bethesda, MD 20817 (Montgomery County)
Reposted Today
Lead, System Integration Engineering (IPTL / LSE)
$122K — $227K *
Level 3 Communications, Inc
Ashburn, VA 20147 (Loudoun County)
Today
Site Reliability Engineer - Data, Cloud & Developer Experience
$140K — $225K *
The Blackstone Group LP
New York, NY 10025 (New York County)
Reposted Today
Senior Site Reliability Engineer
$120K — $150K *
Peapod Digital Labs
Quincy, MA 02169 (Norfolk County)
Today
Senior Linux Engineer
$143K — $185K *
Chicago Board Options Exchange
New York, NY 10025 (New York County)
Today
Systems Engineer (Expert-Level)
$140K — $155K *
SHINE Systems
Washington, DC 20001 (District Of Columbia County)
Today

More Information Technology Jobs

Client Partner - Banking / Financial Services / Capital Markets
$325K — $350K + $100K bonus *
Large IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
5 days ago
Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Security Engineer
$90K — $120K *
CSP
Auburn Hills, MI 48326 (Oakland County)
Today
Director, IT Risk Reporting & Insights (Global Security)
$120K — $150K *
Royal Bank of Canada
Toronto, ON M3C 0E3
Reposted Today

Find similar Member of Technical Staff, Infrastructure jobs: