As a Senior Service Reliability Engineer, you will own and improve Service Reliability and Availability of this DataRobot's AI platform. You will be tasked to make DataRobots AI/ML platform more reliable, efficient, and scalable. You will play a key role in how the DataRobot tools and practices enable seamless scale while preventing failures. As an SRE, you will be part of the team that builds and enable the DevSecOps toolchain while continuously improving our ML/AI platform at scale. You will contribute to the full-service lifecycle: from service development to live service response, as we continuously deploy new and innovative functionality for our customers
Responsibilities:
- Must be familiar with AWS, GCP, and Azure architecture patterns and capabilities
- Well versed in Software Defined Network definitions, capabilities, and limitations
- Handle high-pressure situations in a calm and professional manner
- Lead resolution effort of complex service problems from the network layer to the application at scale
- Motivate, encourage, and provide technical leadership to team members
- Work hand-in-hand with software developers to facilitate the adoption of "Paved Road" solutions
- Build and support large-scale services across multiple platforms (Azure, AWS, and GCP)
- Diagnose and repair issues by editing code in node.js, modifying MongoDB, Postgres, Redis, and configuration changes in cloud service providers
- Create, edit, and maintain ad hoc scripts to resolve issues quickly with minimal user impact
- Contribute to the development of new tools and automation that ensures the service can be optimized and tuned with minimal human intervention
- Support periodic on-call duty
-
Technologies:
- MongoDB, Mongo MMS, node.js/IIS on AWS/GCP/Azure
- Demonstrable experience in one or more languages: Python, Perl, PHP is a plus
- Strong knowledge of TCP/IP networking, SMTP, HTTP, load-balancers, highly available network servers
- GitHub/Artifactory/RabbitMQ, Application Performance Monitoring principles, CDN, DNS
- Knowledge of IP networking, network analysis, performance, and application issues using tools like fiddler and Wireshark
-
Requirements:
- A passion for automating everything
- A passion for collaborating and tearing down communication silos
- Experience maintaining large scale infrastructure, 100+ servers minimum
- 5+ Years experience with AWS
- 3+ Years experience with Terraform or CloudFormation
- 5+ Years experience with Linux (Ubuntu, RedHat, or similar)
Qualifications:
Bachelor's Degree in CS, MIS, or equivalent experience; 6+ years of relevant experience with Windows/Unix systems fundamentals, monitoring, cloud services, networking, storage, database, and application knowledge; Solid communications skills