About the RoleWe're looking for a Site Reliability Engineer to drive the reliability of Tinker end-to-end. You'll work alongside the engineers building the platform and research teams to make every layer of the system more robust and resilient.
What You'll Do- Define and own end-to-end reliability, from CI/CD flows to production observability and incident response.
- Develop appropriate Service Level Objectives for distributed training systems, balancing job completion reliability and scheduling latency with development velocity.
- Design and implement monitoring and observability across the full training path.
- Drive incident response for Tinker platform issues, ensuring rapid recovery, thorough incident reviews, and systematic improvements that prevent recurrence.
- Harden multi-tenant isolation and resource scheduling so that LoRA-based workload co-scheduling maximizes utilization without compromising reliability or data separation
- Collaborate with security teams to address production vulnerabilities
Skills and QualificationsMinimum qualifications:- Bachelor's degree or equivalent experience in computer science, engineering, or similar.
- Experience in distributed systems, cloud infrastructure, or site reliability engineering.
- Proficiency writing software to solve reliability problems, including building tooling and automation.
- Experience with production incident response, postmortems, and systematic reliability improvement.
- Strong communication skills and track record of coordination across engineering and research teams.
Preferred qualifications - we encourage you to apply if you meet some but not all of these:- Deep experience operating production cloud services at scale (e.g., public cloud platforms, internal cloud services)
- Background in distributed training frameworks and how infrastructure failures surface in training behavior.
- Track record building checkpoint and recovery systems for long-running distributed jobs.
- Expertise in Kubernetes at scale: deploying, operating, debugging, and tuning clusters handling heterogeneous GPU workloads.
Logistics- Location: This role is based in San Francisco, California.
- Compensation: Depending on background, skills and experience, the expected annual salary range for this position is $350,000 - $475,000 USD.
- Visa sponsorship: We sponsor visas. While we can't guarantee success for every candidate or role, if you're the right fit, we're committed to working through the visa process together.
- Benefits: Thinking Machines offers generous health, dental, and vision benefits, unlimited PTO, paid parental leave, and relocation support as needed.