Senior Site Reliability Engineer

New Relic   •  

Portland, OR

Industry: Technology


Not Specified years

Posted 180 days ago

This job is no longer available.

Responsibilities of the team include:

  • Enhancing an ecosystem that enables product teams to build their products/features quickly and without friction
  • Partnering with product engineering teams to design and deliver tools that support our product reliability
  • Identifying and advocating for opportunities to use our own products and minimize the use of 3rd party tools and services

Examples of what you’ll work on:

  • Reviewing designs with an eye toward increasing the holistic stability of our platform and identifying potential risks
  • Running “game days” to test assumptions about reliability and learn what will break before it matters to customers
  • Improving our monitoring and alerting systems to make sure engineers get paged when it matters (and don’t get paged when it doesn’t)
  • Operationalizing horizontally scalable data stores and configuring systems for high reliability
  • Improving our deployment and testing automation pipelines to ensure we can continue to move quickly and with confidence
  • Writing runbooks and improving documentation
  • Troubleshooting OS and network issues
  • Mentoring other engineers in reliability-related skills

What skills will be helpful?

  • Experience working in a SaaS environment at scale
  • Troubleshooting in a complex environment
  • Fluency coding in either Go, Python, Ruby, or Java
  • Experience administering Linux systems
  • Foundation in systems knowledge including some of the following:
    • Jenkins or other continuous integration/deployment tools
    • Configuration management through Ansible/Chef/Puppet
    • Service Oriented Architecture or microservices
    • AWS or other large network provisioning and architecture
    • Docker/Kubernetes/Mesos or other containerization solutions
    • Kafka or other messaging queues
    • Cassandra, MySQL, Postgres, or Elastic Search
    • Load balancing, storage, and clustering technologies
    • System-level monitoring and alerting tools such as Nagios