Summary
At SPG, Site Reliability Engineers (SREs) are responsible for keeping all our developer-facing services and production systems running smoothly. Our SREs are a blend of pragmatic operators and software craftspeople that apply sound engineering principles, operational discipline, and mature automation to our operating environments which are multi-platform and span multiple public and private clouds. Our platforms and tools are heterogeneous and have strong ties to both the Apple ecosystem and Linux. Join Apple and help us leave the world better than we found it!
Key Qualifications
Act as service owner for a combination of off-the-shelf and custom software and tools targeting software development. Be on call (PagerDuty) rotation to respond to incidents that impact our developer tooling and services. Use your on-call shift to prevent incidents from ever happening or at least ever happening again. Run our infrastructure with custom Go code, Bazel, Pulumi, Ansible and Kubernetes. Build monitoring that proactively alerts before outages happen. Focus on documentation to turn actions into repeatable process and finally automation. Improve operational processes (such as deployments, backup and restore procedures and upgrades) to make them as boring as possible. Design, build and maintain core infrastructure that enables a large development team. Debug production issues across services and levels of the stack. Think about systems: edge cases, failure modes, behaviors, specific implementations. Think of solutions to operation problems as being primarily software driven. Collaborate and communicate extensively and asynchronously. Find ways to not have to learn the same thing twice. Have a hard-working, go-for-it attitude. When you see something broken, you can't help but fix it. Understand how to deliver quickly and effectively while maintaining a high quality bar. Know your way around Linux, macOS and the Unix Shell. Have built complete systems in Java, Go & Python. Have a strong understanding of canonical Web Application design and scaling patterns. Use configuration management systems like Ansible and Pulumi, and know when to use which. Have a strong tendency to write custom tools or wrappers around OSS tools to patch, work around and simplify them. Have experience with using and extending Docker, Kubernetes, Nginx, Containerd, Pulumi, or similar technologies. Have experience with operating off-the-shelf systems reliably and at scale.
Description
Specifically, you will need general knowledge of most of the following technical areas, with deep knowledge in at least 3: - Ansible: Basic syntax, tasks, playbooks. - Build tools: Jenkins, CI/CD configuration, jobs/pipelines, Bazel. - Clouds: Provisioning and configuration using Pulumi/Terraform. - Kubernetes: Deep understanding of the platform, sidecars, custom operators and plugins. - Monitoring: Prometheus, Thanos, Grafana, FluentBit, Splunk - Networking: VPCs, proxies, SDNs and CDNs. - Operating systems: Linux and macOS configuration, package management, startup and troubleshooting. - Programming: Java, Golang, Python and optionally Jsonnet. - System types: Multi-tier Web Applications (from LAMP to modern React systems) and CLIs. - Third party developer tooling ecosystem: Operating Github, Gerrit, Jenkins, Artifactory etc.