Domino Data Lab

Staff Site Reliability Engineer

Domino Data Lab$200K — $230K *
US-AnywhereRemote in United States
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • 5-7 years in Site Reliability Engineering or platform/software engineering with operational ownership
  • Expertise in Kubernetes, Linux, and cloud platforms for problem-solving
  • Ability to identify and rectify reliability issues in products and processes
  • Proficiency in Python or Go, with a history of developing widely-used internal tools
  • Capacity to influence project direction in ambiguous settings
  • Experience in enhancing reliability through engineering rather than manual interventions
  • Strong mentoring and communication capability in technical environments
  • Knowledge about AI and LLM tooling related to operational workflows, with an eye for effectiveness.

Responsibilities

  • Lead the creation of AI-assisted reliability tools to expedite incident resolution
  • Enhance observability of critical customer systems for better troubleshooting
  • Oversee end-to-end incident response, improving documentation and understanding
  • Guide development of observability tools for customer-facing products
  • Establish SLO/SLI frameworks for key services to set actionable reliability standards
  • Optimize cloud operations for SaaS offerings and improve deployment reliability
  • Mentor engineering teams and influence SRE practices for incident response and learning.

Benefits

  • Opportunities for equity in the company
  • 401(k) plan available
  • Comprehensive medical, dental, and vision benefits
  • Wellness stipends to support employee health
  • Potential for company bonuses or commissions.
Full Job Description
What we are building

As our infrastructure and customer footprint grow, we're investing in a new kind of SRE practice where the people who respond to incidents also build the systems that make future incidents shorter, rarer, and less painful. We're developing AI-assisted tooling that helps our support and engineering teams diagnose problems faster, learn from outages more deeply, and automate away the toil that slows everyone down. This role sits at the center of that: equal parts hands-on operator, software engineer, and technical leader. If you believe that operational experience and engineering craft make each other stronger, you'll feel right at home here.

What your impact will be
  • Lead the development of Domino's internal AI-assisted reliability tooling, including systems that analyze tickets, logs, traces, and documentation to help teams resolve outages faster with less recurring toil
  • Improve the observability coverage and signal quality for our most critical customer-facing systems, so engineers have more to work with throughout the development and support lifecycle
  • Own incident response end-to-end, from detection to remediation, and leave each problem space better documented, better understood, and less likely to recur
  • Guide the development of customer and user-facing observability tools within our products
  • Define and mature SLO/SLI frameworks for priority services, turning abstract reliability goals into measurable, actionable standards
  • Scale cloud operations practices for Domino's single-tenant SaaS offering, and work with engineering teams to improve the reliability and repeatability of customer deployments and upgrades
  • Mentor other engineers and shape how SRE is practiced at Domino, including incident response workflows, operational readiness expectations, and post-incident learning culture

What we look for in this role
  • Deep experience in Site Reliability Engineering, platform engineering, or a software engineering role with genuine, hands-on operational ownership
  • Fluency with Kubernetes, Linux, cloud platforms, and observability tooling, and the ability to use them to investigate complex, real-world production problems
  • A strong ability to perceive and close reliability gaps in technical products, tools and processes
  • Strong software engineering skills in Python or Go, with a track record of building internal tools or services that people actually rely on
  • Comfort leading technically ambiguous work and influencing direction across teams without needing direct authority to get things done
  • A history of improving reliability through engineering and automation, not just putting out fires manually
  • Strong communication skills and real experience mentoring engineers or shaping technical decision-making on your team
  • Sound judgment about AI/LLM tooling: you know where it genuinely helps in operational workflows and where it adds noise instead of signal
  • Bonus: Experience with LLM-based systems, retrieval workflows, SaaS platform operations, or building tooling for support or developer teams

What we value
  • We strongly believe in the value of growing a diverse team and encourage people of all backgrounds, genders, ethnicities, abilities, and sexual orientations to apply
  • We value a growth mindset. High-performing creative individuals who dig into problems and see the opportunities for success
  • We believe in individuals who seek truth and speak the truth and can be their whole selves at work.
  • We value all of you that believe improving is always possible. At Domino, everything is a work in progress - we can do better at everything.
  • We emphasize an environment of teaching and learning to equip employees with the tools needed to be successful in their function and the company.

#LI-Remote

The annual US base salary range for this role is listed below. For sales roles, the range provided is the role's On Target Earnings ("OTE") range, meaning that the range includes both the sales commissions/sales bonuses target and annual base salary for the role. This salary range will be narrowed during the interview process based on a number of factors, including the candidate's experience, qualifications, and location. Additional benefits for this role may include: equity, company bonus or sales commissions/bonuses; 401(k) plan; medical, dental, and vision benefits; and wellness stipends.

Compensation Range

$200,000-$230,000 USD

About Domino Data Lab

Domino Data Lab is a software company that provides a platform for data science teams to collaborate and build models. The company was founded in 2013 and is headquartered in Oakland, California. Domino Data Lab's platform allows data scientists to work together on projects, share code and data, and track experiments. The company also offers tools for model deployment and management. Domino Data Lab aims to help organizations make better decisions by leveraging the power of data science.
Learn more about Domino Data Lab
Size
200 employees
Industry
Founded
2013

Similar Jobs

More Jobs at Domino Data Lab

More Information Technology Jobs

Find similar Staff Site Reliability Engineer jobs: