About the role
As a Site Reliability, Engineer for Punchh, you will be working with our developers, and DevOps teams implementing our next generation infrastructure. We are looking for a self-motivated, responsible, team players who love designing systems that scale. Punchh provides a rich engineering environment where can you be creative, learn new technologies, solve engineering problems, all while delivering business objectives. The DevOps culture here is one of immense trust and responsibility. You will be given the opportunity to make an impact as their are no silos here.
What You'll Do
- Deliver SLA and business objectives through whole lifecycle design of services through inception to implementation.
- Scale our systems and services through continuous integration, infrastructure as code, and gradual refactoring in an agile environment.
- Maintain services once a project is live by monitoring and measuring availability, latency, and overall system and application health.
- Write and maintain software that runs the infrastructure that powers the Loyalty and Data platform for some of the world's largest brands.
- 12x7 oncall for Level 2 and higher escalations. We have fellow DevOps distributed globally and a 24x7 NOC team so you can sleep soundly at night.
- Respond to incidents and write blameless RCA's/postmortems
- Implement and practice proper security controls and processes
What You'll Need
- Bachelor of Science degree in Computer Science, Mathematics, Engineering, or equivalent practical experience
- 7-10 years of experience
- At least one language from this list: Python, Ruby, Golang
- Proficient in shell scripting, and most importantly, know when to stop scripting and start developing
- Production usage of one or the following configuration management frameworks: Ansible, SaltStack, of Chef.
- Knowledge of fundamental systems engineering principles such as CAP Theorem, Concurrency Control, etc.
- In-depth knowledge of the Linux operating system and administration
- Understanding of the network fundamentals: OSI, TCP/IP, topologies, etc.
- Experience with various load balancing technologies such as Amazon ALB/ELB, HA Proxy, F5, or Netscaler
- Production experience with a major cloud provider such Amazon AWS or Google GCE
- Understand of Web Standards (REST APIs, OWASP, HTTP, TLS)
- Experience with Docker in Production environment using cloud native tooling (Amazon ECS or Google GCE), or one of the mainstream orchestration engines
- Usage of Terraform or Cloudfront in Amazon AWS
- Knowledge of web server technologies such as Nginx or Apache
- Knowledge of Redis, Memcache, or one of the many in-memory data stores
- Comfortable with large-scale, highly-available distributed systems
Bonus Points If You
- Familiarity with CI/CD tooling such as Jenkins, CircleCI, Travis CI
- Production experience with Kubernetes, Mesos, or Amazon ECS
- Production experience with Hashicorp products such as Terraform, Vault, and Consul
- Expertise in designing, analyzing troubleshooting large-scale distributed systems.
- Experience in an PCI environment
- Experience with Big Data distributions from Cloudera, MapR, or Hortonworks
- Experience maintaining and scaling database applications