Senior Site Reliability Engineer in Mountain View, CA

$100K - $150K(Ladders Estimates)

Array Health Solutions   •  

Mountain View, CA 94035

Industry: Technical Services

  •  

5 - 7 years

Posted 51 days ago

At GetInsured, we're on a mission to improve the way people shop for and enroll in health insurance. And we're building world-class software to do it. Our drive and expertise are what make GetInsured a leading provider of health insurance e-commerce technology in the U.S. Our team players have earned their stripes at leading companies, such as Amazon, Accenture, WebMD, Microsoft, Alere, General Electric, McKesson, Avanade, and Group Health Cooperative. Our customers are consumers, employers, benefits consultants and health insurers, and our solution incorporates the best from the retail e-commerce industry to power their online marketplaces.

Our operations stack includes Cloudflare, HAProxy, Tomcat, node.js, Postgres, Couchbase, Solr, and Redis running on CentOS on VMWare. We have multiple data centers in Rackspace, Azure, and AWS as well as on-prem VMWare. Our tools include Puppet, Jenkins, Splunk, icinga, PagerDuty, and the various Atlassian services such as Jira, Confluence, and BitBucket.

We have been strongly focused on DevOps cultural and organization changes for several years now and have seen great success. We still have much to do, so we are looking for a Site Reliability Engineer AKA DevOps Engineer to join our team to continue to build out our infrastructure, bring in new tools, and improve our processes.

Responsibilities

  • Identify new technologies, tools, and processes. Actively pursue learning and prototyping.
  • Identify, diagnose, and resolve complex technical issues efficiently in live production environment and drive to quick resolutions – as well as – leverage those events to improve current technology and processes towards prevention of such issues.
  • Help ensure that production systems are always up and running.
  • Work closely with the Engineering team to escalate issues for triage and resolution.
  • Routinely review tickets and diagnostics to identify trends/chronic issues then put processes and tool in place to prevent problems.
  • Hands-on implementation and upgrade of tools for monitoring.
  • Audit proactive monitoring of all systems to detect and resolve problems to ensure uninterrupted operation of all infrastructure systems.

Requirements

  • Strong background in Linux/Unix administration.
  • Strong technical systems and application operations/release management experience with a passion for troubleshooting and triage of incidents, bringing issues to rapid resolution.
  • Experience with automation/configuration management using either Puppet, Chef or an equivalent.
  • Knowledge of Jenkins and Java builds is a plus
  • Knowledge of AWS is a plus
  • Ready & willing to participate in production systems support.
  • Ability to use a wide variety of open source technologies and cloud services.
  • Good experience with SQL and with Postgres or similar RDBMS.
  • Good experience with networking.
  • Good understanding of code and Bash scripting.
  • Knowledge of best practices and IT operations in an always-up, always-available service
  • 5+ years of experience working in operations.
  • Experience working in a DevOps group environment.


Valid Through: 2019-10-21