Senior Software Engineer - Core Team

Userpilot

$130K — $160K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years of experience in designing and operating distributed systems in production environments
  • Proficient in software engineering fundamentals including data structures, algorithms, and system design
  • Strong architectural judgment with experience writing Architecture Decision Records (ADRs)
  • Instinctive ability to identify failure modes and bottlenecks in complex systems
  • Calm and methodical approach to incident response and problem-solving
  • Hands-on experience with AWS (EKS, EC2, S3, RDS) and production Kubernetes
  • Experience with monitoring tools such as Grafana, Prometheus, and CloudWatch
  • Excellent communication skills to drive technical direction across teams

Responsibilities

  • Lead system design for complex, cross-functional projects and ensure implementation follows architectural decisions
  • Collaborate with application squads to translate product needs into scalable designs while allowing teams autonomy
  • Ensure production reliability through effective monitoring, alerting, and incident response practices
  • Act as the primary responder during incidents, managing diagnosis, coordination of fixes, and root-cause analysis
  • Design and operate infrastructure on AWS using Terraform and Kubernetes with attention to high availability
  • Build and enhance CI/CD processes that enable reliable deployment and maintainability for all changes
  • Document the technical context and operational guidelines to facilitate understanding for future engineers and AI tools

Benefits

  • Opportunity to impact system architecture and foundational design
  • Work with cutting-edge technologies and frameworks
  • Be part of a dynamic engineering team focused on AI integration
  • Autonomy to shape technical direction without managing a team
  • Engage in collaborative problem-solving and incident management
  • Continuous improvement of engineering best practices and processes
Full Job Description
The Role

This is the most senior individual-contributor engineering role at Userpilot, and it is a different kind of role. Core Team engineers are the closest thing we have to software architects. They don't own a single feature area; they own how the system fits together, how it behaves under load, and how it recovers when something breaks.

They are a rare breed: equally at home in a Terraform module, an application lifecycle, a high volume database query plan, and an architecture review. They set the technical direction the rest of engineering builds on, they are the first responders when production is on fire, and they design the guardrails that stop a class of problem from ever happening twice. Application squads move fast on features precisely because the Core Team keeps the ground underneath them solid.

And they do all of this in an AI-native way. Coding agents extend their reach across the stack, but the judgment about what is safe, what will scale, and what must never break stays with them.

Where You'll Have Impact
  • Technical direction and system design. Decide how non-trivial work should be built before a squad writes the first line. Write the ADRs, choose the patterns, and make durability, extensibility, robustness, observability, and scalability properties of the system rather than afterthoughts bolted on later.
  • Scale and reliability. Keep a distributed, real-time system healthy as traffic grows: event pipelines from Kafka into ClickHouse, real-time delivery over hundreds of thousands of connections, caching, backpressure, and the failure modes that only appear at scale or during a deploy.
  • Firefighting and incident response. Be the first call when production breaks. Diagnose under pressure, restore service, find the real root cause, and then turn that incident into a guardrail so the squads don't keep hitting it.
  • Infrastructure and foundations. Own infrastructure provisioning end to end: AWS (EKS, EC2, S3, RDS) and the Terraform and Kubernetes that tie it together. This is one of the things you do, not the whole job.
  • Enabling the squads. Raise the architectural bar across teams you don't manage. Review for architectural consistency, drive adoption of patterns that actually stick, and keep application engineers focused on shipping product.
  • Agentic engineering infrastructure. Make the system safe for a team that ships with AI agents: CI/CD quality gates every PR must pass regardless of author, AGENTS.md and runbooks that teach agents the topology and operational constraints, and Infrastructure as Code clean enough that an agent's change proposal is safe to reason about.


What You'll Do
  • Lead system design for cross-cutting and high-risk work, and write and shepherd ADRs the org actually follows.
  • Partner with application squads to turn product requirements into designs that hold up under load and over time, then get out of their way.
  • Own production reliability: monitoring, alerting, and on-call practices that surface real problems without drowning the team in noise (Grafana, Prometheus, CloudWatch).
  • Be first-in on incidents: run the diagnosis, coordinate the fix, write the postmortem, and ship the change that prevents a recurrence.
  • Design, provision, and operate infrastructure on AWS with Terraform and Kubernetes, with high availability and cost both in mind.
  • Build and improve CI/CD pipelines and validation gates that make every change trustworthy, whether a human or an agent wrote it.
  • Write the technical context (ADRs, runbooks, AGENTS.md) that makes the system understandable to new engineers and safe for AI tools.
  • Keep an eye on infrastructure cost and find the optimizations that actually matter.
  • Provide technical direction and mentorship across the engineering org.


What We're Looking For

Required
  • Senior experience designing and operating distributed systems in production, with a track record of being the person who owns how the whole system fits together.
  • Strong software-engineering and CS fundamentals (data structures, algorithms, system design). You can go deep in application and backend code, not just infrastructure.
  • Architectural judgment: you reason explicitly about durability, extensibility, robustness, observability, and scalability and the tradeoffs between them, and can write an ADR others can follow.
  • Distributed-systems instincts: you can break down a complex system to find its failure modes, bottlenecks, and the one change that actually moves the needle.
  • Calm, methodical incident response: you root-cause under pressure and instinctively turn an incident into prevention.
  • Hands-on infrastructure: AWS (EKS, EC2, S3, RDS) and the networking that connects them, production Kubernetes and Docker (operating clusters, not just deploying to them), and solid Terraform / Infrastructure as Code.
  • Observability in practice: Grafana, Prometheus, CloudWatch, and alerting that signals real problems.
  • Strong communication and influence: this role touches every team, and you drive adoption of patterns across people who don't report to you.
  • An AI-native workflow: you use AI coding agents (Claude Code, Cursor) as a real part of how you work, and you have a point of view on how to review and trust their output.


Bonus Points
  • Elixir, Erlang, or BEAM systems (our backend runs on them) and OTP patterns: supervision trees, GenServers, distribution.
  • Scaling highly available distributed systems in a fast-moving product environment.
  • Kafka, RabbitMQ, ClickHouse, Broadway, or similar high-throughput data tooling (we use both brokers).
  • Building and operating CI/CD that supports high-frequency deployments.
  • Cloud cost optimization through caching, right-sizing, or more efficient data processing.
  • Experience as a tech lead, staff engineer, or architect setting direction for an engineering org.
  • A point of view on the trust model for automated and agent-generated change: automated PRs, agent-triggered deploys, and the gates that make them safe.
  • Interest in AI-powered observability: anomaly detection, automated runbook execution, or self-healing infrastructure.
  • Writing technical context documentation (runbooks, ADRs, AGENTS.md-style files) that makes systems understandable to the people and agents joining them.


Our Stack
  • Cloud: AWS (EKS, EC2, S3, RDS, CloudFront)
  • Orchestration: Kubernetes, Docker, Terraform
  • Backend: Elixir / Phoenix, OTP
  • Data: ClickHouse (analytics), MySQL (primary)
  • Messaging: Kafka, RabbitMQ, Broadway
  • Observability: Grafana, Prometheus, CloudWatch
  • CI/CD: GitHub Actions
  • AI: Claude Code / Cursor for agentic development; AGENTS.md, CLAUDE.md, and Infrastructure as Code as shared context for humans and agents alike


What "Agentic Engineering" Means Here

We are shifting toward spec-driven, AI-assisted development, and the Core Team is what makes that safe.
  • Every PR, human or agent, passes the same quality gates. Our CI/CD has to be reliable, fast, and unambiguous in its feedback, regardless of who (or what) wrote the change.
  • Agents need to understand where they're operating. We maintain AGENTS.md and operational context so an agent doesn't make a dangerous assumption about topology, service contracts, or operational constraints.
  • Infrastructure as Code is the single source of truth, for humans and for agents proposing changes. The cleaner and more expressive it is, the safer agent-assisted work becomes.
  • Agents do a lot of the typing; the Core Team owns the architecture, the judgment, and the boundaries that keep fast-moving, non-deterministic development from compounding into risk.

You don't need to have built agentic infrastructure before. But you should find the challenge genuinely interesting.

Similar Jobs

More Jobs at Userpilot

More Information Technology Jobs

Find similar Senior Software Engineer - Core Team jobs: