Come join the Identity SRE Team as a Staff Site Reliability Engineer.
We are looking for a strong Staff engineer with a passion for automation, ready for an opportunity to tackle the complex problems of scale in AWS and traditional data centers, which are unique to Intuit while using your expertise in operations, automation, big data and large-scale systems. We are looking for someone who can design, code and maintain high performance and available infrastructure.
- Applies full understanding of the business, the customer, and the solutions that a business offers to effectively design, develop, and implement operational capabilities, tools and processes that enable highly available, scalable & reliable customer experiences.
- Utilizes their deep knowledge of operations engineering, connected services, and information technology plus their knowledge of industry best practices to innovate and influence operational approaches and solutions
- Be the Technical Lead and works on significant assignments that are broad in scope and complexity, may cross several functional and organizational boundaries, and cover a wide range of issues
- Exercises independent judgment in the selection of methods and techniques used to deliver operational solutions.
- Creates formal internal and external networks outside of own area of expertise to leverage and adopt ideas, technologies and best practices that helps the organization move fast
- Support a lively and fast paced group with bi-weekly/monthly agile releases.
- Support the migration to AWS with automation all the way and with no manual intervention anywhere in the flow.
- Manage all operational aspects of Production and Pre-Production environments in AWS and traditional data centers.
- Includes creating, developing & managing the deployment architecture for the Identity Data Stores
- Work with development on design, testing, and implementing data objects in support of mission critical applications
- Implement, monitor, and test backup and resiliency methods for Identity Data Stores
- Developing the monitoring architecture and implementing monitoring agents, dashboards, escalations and alerts
- Working closely with Product development for operational aspects of the release, for Resiliency patterns and ensuring that the customer experience is monitored, measured and improved release over release.
- Provide Tier-2 support and participate in 24x7 on-call incident escalation rotations.
- Utilize proven skills and knowledge, to provide troubleshooting and timely resolution of application, performance, systems and infrastructure incidents.
- Developing and driving incident management processes, playbooks and stakeholder communication mechanisms.
- Coaches and mentors other Site Reliability engineers.
- B.S. or higher in Computer Science
- 7+ years systems administration experience managing enterprise Unix/Linux environments
- 4+ years of strong experience in configuring and managing services in AWS.
- 2+ years experience in a DevOps role that goes beyond traditional operations to enable building the right frameworks to automate deployment and restacking on AWS.
- Scripting and programming experience e.g. python or ruby
- Experience with Web Tier (httpd or Nginx or equivalent), App Tier (Tomcat or JBoss or Mule or equivalent) and data tier (RDS, Dynamo, Cassandra or equivalent)
- Experience with Configuration management e.g. chef, cloud formation or equivalent and automated deployments.
- Experience with Metrics, Monitoring and Alerting tools such as New Relic, AppDynamics, Graphite, Splunk etc.
- Good understanding of web traffic load balancing and http health checks.
- Good communication skills (verbal and written)