A global data based financial firm is seeking a Site Reliability Engineer to join their team in New York. In this role you will help build a large-scale distributed systems to develop mission-critical system infrastructure. You will be part of a team that builds the foundation to support a multi-cloud environment.
- Identify and automate developer workflows. Provide development teams self serve tools to provision infrastructure, deploy/manage applications and to manage their operational environments.
- Implement industry-wide best practices around public and private cloud infrastructure. Adopt tools and technologies like Terraform, Kubernetes that help abstract underlying infrastructure.
- Develop and maintain documentation, training and SLA for managedinfrastructure and systems to socialize and be agents of change
- Work closely with development teams to evolve legacy systems with modern, Internet-scale design patterns. An example of this which the team is currently involved in is the move to Kubernetes for stateless services.
- 3+ years of experience working on highly available, fault-tolerant distributed systems
- A strong understanding of operating systems and the nuances of Linux
- Experience with datacenter networktroubleshooting including IP fundamentals, DNS, load balancing, proxies and firewalls
- Familiarity with configuration management systems such as Chef, Puppet or Ansible
- Proficiency in at least one of the following languages: Python, Ruby, C/C++, Go or Java
- A solid understanding of the modern software development lifecycle (SDLC) processes such as Continuous Integration and delivery
- Expertise in analyzing and troubleshooting large-scale distributed systems
- A deep understanding of web operations and cloud infrastructure (AWS, Azure, Google)
- Knowledge of network and application performance analysis using standard UNIX tools
- Experience with maintaining and managing a community around open source software