ABOUT THE ROLE
Reporting to the Director of Cloud Operations, this position will be responsible for the development and maintenance of automation, tools, and configurations, and systems & application service uptime in a high-availability customer-facing business critical 24x7 SaaS environment where uptime is critical and requires immediate response to service impacting issues. You will have or will develop skills in assessing the tradeoffs in installation, configuration, and diagnostics in open source Linux systems in a large-scale DevOps environment. The right candidate will have excellent verbal and written communication skills with demonstrated ability to work across departments towards a common goal. Passion for implementing open source tools, systems / network / application diagnostics frameworks, CI/CD environments for a SaaS enterprise with a structured approach to achieve high-quality sustainable production operations will be required. Candidate will have knowledge of deployment of Java and/or Node.js and/or other typical enterprise application frameworks and languages.
- Take on new DevOps projects, prototype, and manage execution to completion.
- Develop and manage consistent and coherent DevOps processes and practices to support software development, testing, builds and deployment.
- Guide and develop infrastructure & tools architecture design to enable high uptime, minimize failures, ensure applications & data security and expedite diagnostics.
- Identify, diagnose, and resolve complex technical issues efficiently in a live production environment and drive to quick resolutions – as well as – leverage those events to improve current technology & processes towards prevention of such issues.
- Work closely with the Engineering teams to escalate and/or triage issues to resolution.
- Review tickets and diagnostics with a post-mortem to identify trends/chronic issues.
- Hands-on implementation & upgrade of tools for monitoring, trending & diagnostics.
- Audit proactive monitoring of all systems to detect and resolve problems to ensure uninterrupted operation of all infrastructure systems.
- Update corresponding documentation on installation process & configurations.
- Consider security concerns with all work.
- Automate, Automate, Automate everything.
SKILLS AND REQUIREMENTS
- FIve-year technical degree or equivalent job-specific work experience.
- Solid knowledge of cloud architecture concepts and practices
- Knowledge of architectural design patterns, e.g. immutable production, fail fast, stateless etc.
- Strong understanding of Application release management & configuration, upgrades/patches & support of Unix/linux systems – applications on Node.js or similar in a SaaS environment.
- Passion for troubleshooting and triage of incidents, bringing issues to rapid resolution.
- Ability to apply detailed knowledge of organizational procedures to make independent decisions and serve as a credible resource for technology teams.
- Strong verbal and written communication skills, with the ability to work effectively across organizations
- Excellent problem-solving skills with the ability to analyze situations, identify existing or potential problems and recommend solutions
- Software engineering skills and computer science knowledge
- Excellent understanding of scalable, micro-service based architectures and experience in applying them to real-world problems
- Ability to take on-call escalation rotation & co-ordinate work under production critical situations is essential.
- Knowledge of the use and maintenance of continuous integration and continuous deployment systems.