The Site Reliability Engineer works with a cross-functional team of highly skilled technology professionals to design, implement, administer, and support a robust and stable service architecture. This requires advanced application of best practices and techniques in programming, automation, security, and optimization across servers, firewalls, cloud services, and custom applications. The successful candidate will have exceptional technical and interpersonal skills.
Who You Are
- You immerse yourself in all aspects of technology management.
- You are looking for an opportunity that will test your skills, and help you grow and learn in the ever changing world of technology.
- You are ready to face and spearhead operational support issues.
- You welcome and accept challenges.
What You’ll Be Doing
- Maintain and troubleshoot MSWindows Server 2012 R2/2016 in a large scale server environment.
- Assist with configuration and administration of IIS, DNS, SQL Server 2012/2016, WSUS, F5 BIG-IP Load balancing and Active Directory.
- Provide support for Microsoft Azure and Amazon Cloud services.
- Provide Tier 3 Customer/User Technical Support.
- Coordinates resolution of identified system problems and works with Technical Support team, vendors and development team to resolve.
- Proactively identifies and spearheads resolution to operational support issues and works closely with Development and QA to ensure end-to-end quality.
- Assists with the implementation of new technologies in coordination with other staff members.
- Works closely with Network Engineer and other members of the Web Server Administrator team to resolve technical support issues within the production environment.
- Work with Technical writers to develop system diagrams and documentation in order to help with troubleshooting efforts.
- Develop tooling, alerts, and processes to identify and mitigate reliabilityrisks.
- Proactively identify and initiate resolution of production service issues.
- Work closely with Operations and Development staff to ensure high levels of security, stability, and automation.
- Provide on-call support for critical systems.
- Take part in disaster recovery practice, ensuring failover and recovery scenarios work as expected.
- Collaborate with developers to debug and optimize applications.
- Design and implement automation of routine/recurring tasks, especially in support of reliable and rapid deployment across development, QA, and production environments.
- Coordinate and collaborate with staff and vendors to achieve rapid resolution of production issues
- Identify and spread knowledge of new technologies that enhance production services and architecture.
- Maintain multiple, consistent environments (Development, Test, QA and Production).
- Work with technical writers to develop accurate and detailed system diagrams and documentation.
Experience We Are Looking For
- BS in Computer Science or a related field, or an equivalent combination of education and experience.
- 3+ years of professional experience in a technology role.
- Proficiency in one or more programming languages is required.
- Experience with cloud and virtualization technologies is preferred.
- Experiencewith any of the following is a plus:
- Linux internals and administration (inodes, system calls, SystemD)
- Networking systems (TCP/IP, routing, network topologies, traffic managers)
- Statistical analysis, big-o notation, and monitoring/alerting tools.
- Systems automation tools (e.g. Ansible, Chef, Puppet, SaltStack)
- The Hashicorp stack (e.g. Terraform, Consul, Nomad, Vault)
- Setting up and maintaining continuous integration/deployment processes.
- Designing, analyzing and troubleshooting distributed systems.
- Full-stack web development (database, backend and/or frontend).
- Knowledge of information security best practices is preferred. Industry certifications such as CISSP or CISM are a plus.