Senior Platform & Reliability Engineer

OpenArt

$130K — $180K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5+ years experience in building and operating reliable production systems
  • Strong software engineering skills with a focus on production code development
  • Experience with cloud-native systems, preferably AWS or GCP
  • Understanding of observability and reliability practices including metrics and incident response
  • Experience designing systems resilient to external dependencies
  • Ability to communicate technical tradeoffs effectively
  • Comfortable in fast-paced, ambiguous environments, taking ownership of challenges

Responsibilities

  • Define and operationalize SLOs/SLIs for critical user journeys
  • Participate in on-call rotation and enhance incident response processes
  • Develop strategies for system resilience at external boundaries
  • Strengthen deployment safety through CI/CD improvements
  • Contribute to infrastructure architecture evolution for scaling
  • Enhance cost visibility and efficiency strategies
  • Act as a technical contributor to improve engineering practices

Benefits

  • Meaningful ownership through equity in the company
  • Work in a high autonomy, high growth environment
  • Option for hybrid work setup with Bay Area preference
  • Visa sponsorship available for eligible candidates
  • Remote work consideration depending on candidate
Full Job Description
Senior Platform & Reliability Engineer

About the Role

We're looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization.

This is not a traditional ops or DevOps role. You'll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency-in a fast-moving, AI-native environment.

You'll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users-while raising the overall engineering bar.

What You'll Do
  • Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.
  • Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items.
  • Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand "what broke" and "why."
  • Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.
  • Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale.
  • Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning.
  • Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team.


What We're Looking For

Core Requirements
  • 5+ years building and operating production systems where reliability and scaling are important.
  • Strong software engineering skills - you can build and ship production code, not just configure infrastructure.
  • Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes).
  • Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.
  • Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).
  • Ability to communicate technical tradeoffs clearly to engineers across different domains.
  • Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.
    Nice to Have
  • Experience building internal platform abstractions (e.g., job orchestration, APIlayers, workflow systems) that improve team velocity.
  • Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost.
  • Experience working in a startup or high-growth environment, with broad ownership across systems.


Tech Stack You'll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React /Next.js, Node.js, TypeScript, Python, etc.

Compensation
  • Competitive base salary and bonus program
  • Equity - meaningful ownership in what you build
  • High autonomy, high growth environment


Work Setup
  • Bay Area preferred (hybrid allowed)
  • Visa sponsorship available
  • We'll consider remote

Similar Jobs

More Jobs at OpenArt

More Information Technology Jobs

Find similar Senior Platform & Reliability Engineer jobs: