THE OPPORTUNITYOne exceptional engineer. AI as the team.This is not a standard DevOps posting. We are looking for one unusually capable, AI-native engineer to own our entire platform engineering and SRE function - using autonomous agents, LLM-powered pipelines, and MCP-based tooling as force multipliers to do the work of a team, on-site, in close partnership with our engineering leadership.
You will inherit a mature, fully containerized AWS estate (9 EKS clusters, 27 accounts, 228 Kubernetes nodes), an Akamai CDN layer managing live traffic splits, GitHub Actions + Jenkins CI/CD pipelines for a Webpack 5 micro-frontend monorepo, and an operational AI agent platform - OpsWhisperer - already in production monitoring 25 AWS accounts with a 91% autonomous resolution.
Your job is to extend all of it, automate what remains manual, and be the person who makes every deployment, incident, and infrastructure change happen with speed, precision, and intelligence.
SCOPE OF OWNERSHIPWhat you'll ownAWS Multi-Account Infrastructure- EKS clusters across dedicated AWS accounts
- EC2 worker nodes via Auto Scaling Groups
- SQS pipelines
- AWS Bedrock (Claude) for AI agent workloads
Kubernetes & Containerization- EKS clusters
- Node group mgmt
- Kops clusters alongside EKS
- Multiple environment tiers with full blast-radius isolation
CI/CD & Release Management- Multiple Repos
- GitHub Actions workflows + Jenkins pipeline management
- Turbo build system across multiple micro-frontend packages
- Canary release gating and rollback automation
CDN & Traffic Management- Akamai Property Manager config
- Phased Release Cloudlet for Canary and Production split
- Security, Throttling and Monitoring
- Jenkins-driven cache invalidation
Observability & Incident Response- Elastic/Kibana
- CloudWatch across all AWS accounts
- Business performance monitoring
- SQS backlog + pipeline health alerting
- On-call ownership, proactive, AI-assisted triage
NON-NEGOTIABLEThe AI-native expectationThis is a role where AI fluency is not a bonus - it is how you do the job. We expect you to build, operate, and improve autonomous agents that handle monitoring, alerting, triage, and routine operational work. You are not just a consumer of AI tools; you are the person who builds them, deploys them into production, and iterates on them based on real operational data.
You will extend OpsWhisperer(AI Platform and Observability agent), contribute to the Axle platform, build MCP servers that give agents new capabilities, and apply LLM-powered reasoning to infrastructure problems that previously required multiple humans. If you've never built an agent that runs in production unsupervised, this is not the right role.
WHAT YOU'LL INHERIT & EXTENDThe tech stackCategoryTechnologiesCloud & OrchestrationAWS EKS • Kubernetes • Kops • AWS Organizations • Auto Scaling Groups • AWS SQS • AWS Bedrock • CloudWatch
CDN & NetworkingAkamai Property Manager • Phased Release Cloudlet • Fast Purge • • Content Protector
CI/CD & FrontendGitHub Actions • Jenkins • Turbo (monorepo) • Webpack 5 Module Federation • Canary / Blue-Green Deployments
AI & AgenticMCP (Model Context Protocol) • Claude API / AWS Bedrock • Azure Bot Service • Microsoft Entra ID • Operational AI Agents
Observability & DataElastic / Kibana • BlueTriangle • Databricks • Cloudinary • New Relic
LanguagesNode.js / TypeScript • Python • Bash / Shell • SQL • PowerShell
REQUIREMENTSWhat we're looking for- 10+ years of hands-on DevOps, SRE, or platform engineering experience in production AWS cloud environments.
- Deep AWS expertise: EKS, EC2, SQS, CloudWatch, IAM, Organizations, and multi-account architectures
- Strong Kubernetes skills: cluster operations, node group management, workload isolation, taints/tolerations, auto-scaling
- Experience with Akamai or equivalent enterprise CDN - configuration, purge operations, traffic routing rules
- CI/CD ownership: GitHub Actions and/or Jenkins pipeline design, monorepo build systems, release gating
- Production experience building or operating AI agents - LLM integration, autonomous workflow design, prompt engineering
- Proficiency in Node.js and/or Python for automation, tooling, and MCP server development
- Observability stack ownership: Elastic/Kibana, log analysis, alerting design, SLO/SLI instrumentation
- Comfortable owning on-call responsibility for a production e-commerce platform with significant revenue exposure
- Strong written and verbal communication - will interface with engineering leadership and present findings to executives
- Based in or willing to relocate to the Los Angeles / Long Beach area for on-site work