Senior Platform & Reliability Engineer

OpenArt

• $130K — $180K *

San Francisco, CA 94112In-Person

Information Technology

5 - 7 years of experience

Reposted 1 week ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years experience in building and operating reliable production systems
Strong software engineering skills with a focus on production code development
Experience with cloud-native systems, preferably AWS or GCP
Understanding of observability and reliability practices including metrics and incident response
Experience designing systems resilient to external dependencies
Ability to communicate technical tradeoffs effectively
Comfortable in fast-paced, ambiguous environments, taking ownership of challenges

Responsibilities

Define and operationalize SLOs/SLIs for critical user journeys
Participate in on-call rotation and enhance incident response processes
Develop strategies for system resilience at external boundaries
Strengthen deployment safety through CI/CD improvements
Contribute to infrastructure architecture evolution for scaling
Enhance cost visibility and efficiency strategies
Act as a technical contributor to improve engineering practices

Benefits

Meaningful ownership through equity in the company
Work in a high autonomy, high growth environment
Option for hybrid work setup with Bay Area preference
Visa sponsorship available for eligible candidates
Remote work consideration depending on candidate

Full Job Description

Senior Platform & Reliability Engineer

About the Role

We're looking for a Senior Platform & Reliability Engineer to help design, scale, and improve the reliability of our infrastructure, from architectural decisions to hands-on implementation, observability, and cost optimization.

This is not a traditional ops or DevOps role. You'll work across cloud infrastructure, distributed systems, backend services, and developer tooling, making pragmatic decisions that balance product velocity, system reliability, and cost efficiency-in a fast-moving, AI-native environment.

You'll partner closely with product engineers to evolve the platform that powers OpenArt, contributing to key decisions around infrastructure architecture, improving multi-provider AI reliability, and helping us scale systems to millions of users-while raising the overall engineering bar.

What You'll Do

Define and operationalize SLOs/SLIs across critical user journeys (generation, editing, payments/credits, uploads), and use them to guide prioritization and tradeoffs.
Participate in an on-call rotation and improve incident response (alert quality, run books, escalation paths), including leading blameless postmortems and driving follow-through on action items.
Improve system resilience at external boundaries (AI providers, storage, etc.),including timeouts, retries, circuit breakers, and fallback strategies. Build and maintain end-to-end observability (logs, metrics, traces, dashboards) so engineers can quickly understand "what broke" and "why."
Strengthen deploy safety through CI/CD improvements, automated rollbacks, canary releases, and feature flag patterns.
Contribute to the evolution of our infrastructure architecture, helping evaluate when to extend serverless patterns vs. adopt containerized or more managed approaches as we scale.
Improve cost visibility and efficiency, including per-request cost attribution, caching strategies, and capacity planning.
Act as a strong technical contributor, helping improve engineering practices, tooling, and system design decisions across the team.

What We're Looking For

Core Requirements

5+ years building and operating production systems where reliability and scaling are important.
Strong software engineering skills - you can build and ship production code, not just configure infrastructure.
Experience with cloud-native systems (AWS or GCP), including serverless/event-driven architectures and at least one container-based approach (e.g., ECS/Fargate, Cloud Run, Kubernetes).
Solid understanding of observability and reliability practices: metrics, alerting, tracing, and incident response.
Experience designing resilient systems with external dependencies (timeouts, retries/backoff, idempotency, circuit breakers).
Ability to communicate technical tradeoffs clearly to engineers across different domains.
Comfortable operating in ambiguous, fast-moving environments and taking ownership of problems.
Nice to Have
Experience building internal platform abstractions (e.g., job orchestration, APIlayers, workflow systems) that improve team velocity.
Track record of improving reliability metrics (e.g., MTTR, SLO attainment, latency) or reducing infrastructure cost.
Experience working in a startup or high-growth environment, with broad ownership across systems.

Tech Stack You'll Work With

GCP, Cloud Run, Modal, Upstash, Sentry, Amplitude, Firebase, Redis, React /Next.js, Node.js, TypeScript, Python, etc.

Compensation

Competitive base salary and bonus program
Equity - meaningful ownership in what you build
High autonomy, high growth environment

Work Setup

Bay Area preferred (hybrid allowed)
Visa sponsorship available
We'll consider remote

* Ladders Estimates

Similar Jobs

System Development Engineer, AWS EC2 Nitro Team
$173K — $235K *
Amazon
Santa Clara, CA 95051 (Santa Clara County)
Today
Cockroach DB Senior Engineer
$90K — $140K *
Tata Consultancy Services
Sunnyvale, CA 94087 (Santa Clara County)
Today
Systems Development Engineer , AWS EC2 Nitro Team
$148K — $201K *
Amazon
Santa Clara, CA 95051 (Santa Clara County)
Today
Systems/Software Engineer III
$120K — $243K *
Hewlett Packard Enterprise Development LP
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Senior Model Based System Engineer
$89K — $148K *
ManTech International
Remote
Reposted Today
Software Engineer, Site Reliability Engineering
$151K — $195K *
Thumbtack, Inc.
Remote
Today

Get Ready For Your
Next Interview

More Jobs at OpenArt

Growth Engineer - Globalization
$300K — $400K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Consumer Technology
In-Person
Growth Product Manager
$300K — $400K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Consumer Technology
In-Person
Data Scientist
$120K — $160K *
San Francisco, CA 94112 (San Francisco County)
1 week ago
Consumer Technology
In-Person
Creative Director - Video & AI Content
$120K — $180K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Media
In-Person
Partnership Marketing Manager
$90K — $130K *
San Francisco, CA 94112 (San Francisco County)
Reposted 1 week ago
Consumer Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
1 week ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
1 week ago
Business Applications Specialist
$70K — $95K *
Columbus, OH 43230 (Franklin County)
Reposted Today
Senior Security GRC Analyst (PCI ISA Specialist)
$88K — $150K *
BigCommerce
Austin, TX 78745 (Travis County)
Today
Cloud Engineer (RPA)
$90K — $130K *
Soft Tech Consulting
Arlington, VA 22204 (Arlington County)
Today

Find similar Senior Platform & Reliability Engineer jobs:

Nationwide San Francisco, CA

Senior Platform & Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Platform & Reliability Engineer jobs:

Get Ready For Your
Next Interview