About the Role:
As part of Splunk's Cloud-First mission, Site Reliability Engineering (SRE) is accountable for the overall reliability of services running in our cloud production environments. We are systems and software engineers who engage with product and infrastructure teams at every level, from directly embedding on their teams to tagging in for the gnarliest of production challenges. Our goal is to make Splunk's production environments more transparent, more predictable, and less cognitively demanding for Splunk's service owners to operate their services in.
You Will:
- Lead a team of tight-knit, super smart engineers passionate about large-scale distributed systems
- Seek out every path to support and improve your team's happiness, engagement, and effectiveness
- Champion a culture of learning, continuous improvement, and blameless retrospection within your team and across the company
- Mentor and grow your junior engineers, and empower and unblock your senior ones
- Partner with our Talent Acquisition team as we recruit, interview and hire the best engineering talent to join Splunk’s growing SRE team!
You Are:
- An experienced manager familiar with the challenges of herding both sheep and cats in large-scale production environments.
- Conversant with a wide range of relevant tools and technologies! We don't expect managers to be writing code in their day-to-day job, but your interests should include some or all of: AWS, GCP, C++, Go, Kubernetes, CI/CD, distributed systems, Terraform, and Puppet. Some familiarity with compliance environments like SOC2 and FedRAMP would be ideal.