Interested in building and managing platforms at massive scale that enable billions of dollars of revenue? Does living in Kubernetes world, running 100s of Kubernetes clusters that scale over 1000+ nodes sound like a fun challenge? Intuit is seeking a Platform Site Reliability Engineer (SRE) for its Modern SaaS platform that powers TurboTax, Quickbooks and Mint. Platform SREs operate right at the intersection of Software Engineering and Infrastructure Engineering to build and operate large scale systems that are secure, fault-tolerant, highly available, affordable, and scalable. Using industry best practices, tools, and principles from software engineering, architecture, and security to build them into our software tools to solve operations problems.
What you'll bring
- Bachelor's degree in Computer Science (or related technical field involving programming), or equivalent work experience
- Expertise with building, upgrading and managing Kubernetes clusters
How you will lead
- Design and develop observability components for massive scale platforms, in order to detect issues quickly and isolate the problem as part of fast recovery.
- Build tools to enable platform consumers to troubleshoot and triage issues in a self-serve mode.
- Enable progressive rollout for platform changes via canary deployment and auto-rollbacks based on platform health.
- Conduct the performance testing for the platform, focusing on responsiveness and optimal resource usage.
- Contribute to FMEA (Failure Mode Effective Analysis) and Chaos Engineering for critical platform components, identifying resiliency gaps and preparing the team for faster recovery from production incidents.
- Contribute to the cost and capacity management for various platform components, uncovering cost saving opportunities and automation to enforce them.
- Troubleshooting complex issues, and managing stakeholders expectations during incidents while troubleshooting.
- Contribute to open-source projects (Kubernetes, Keiko, Argo etc.)
- Participate in 12/7 oncall rotations along with dev team
- Supporting and coaching other engineers, pair programming or peer reviewing code, helping to ensure that all engineers are growing and part of a community
- Drive and own Root Cause Analysis (RCA) for specific applications.