Site Reliability Engineer - OpenStack Private Cloud

Rackspace   •  

Austin, TX

Industry: Internet Services


5 - 7 years

Posted 290 days ago

This job is no longer available.

Overview & Responsibilities

Rackspace is seeking a Site Reliability Engineer - OpenStack Private Cloud to join our team. This is a full time REMOTE Work From Home position.

Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. SRE ensures that Rackspace's managed service offerings & customer deployments have reliability and uptime appropriate to users' needs and a fast rate of improvement while monitoring and validating capacity and performance. An engineer in this role, acts as a Technical Lead for a portfolio of managed service deployments. They focus on reliability, scalability and the development of automation to manage a set of repetitive tasks at scale.

As a Rackspace Private Cloud (RPC) Operational Fabric SRE, you will help support our organizational vision: To become the preferred brand for consuming complex open source infrastructure, as-a-Service, delivering industry-leading simplicity, efficiency, and reliability.

Our team supports this mission by improving observability(i.e. insight into) and controllability (i.e. state mutation) of cloud deployments located in datacenters throughout the world.

About Us

  • Team Size : Currently 8 senior developers/engineers and a technical manager

  • Hours : We have flexible working hours

  • Location : US/UK work from home with offices in US/UK (most states supported)

  • Travel : About once or twice a year for a conference or team gathering


  • Help drive the architecture, develop, and maintain observability of customer environments

  • Help drive zero downtime (0DD) CI/CD for our team's infrastructure and app stack

  • Work directly with support engineers, operations, product management, and end-users

  • Own automation to improve efficiency, reliability, and simplicity of clouds

  • Take high level problems and autonomously deliver solutions supported by data

  • Develop solutions that leverage and integrate existing open source technologies

  • Engage asynchronously with team members via Slack, Jira, github, and email

  • Engage synchronously with team members via video calls

Job Complexity:

  • Support high complexity deployments and internal teams on an as-needed basis

  • Responsible for the roll-out and operations of large scale, complex systems automation

  • Collaborate with other teams on tools for systems automation

  • Work in conjunction with multiple teams to ensure up-time and reliability of customer deployments


Knowledge, Skills, Ability:

  • Deep Experience in one or more of: C, C++, Java, Perl, Python, Bash or Go

  • Deep experience working with Unix/Linux systems from kernel to shell and beyond, with experience working with system libraries, file systems, and client-server protocols

  • Networking: TCP/IP, UDP, ICMP, MAC addresses, IP packets, DNS, SDN, OSI layers, and load balancing

  • Strong expertise in designing, analyzing and troubleshooting large-scale distributed systems

  • Familiarity with algorithms, data structures and complexity analysis

  • Deep experience designing complex SaaS applications for cloud reliability and scalability

  • Expert-level experience with GCP, AWS or Openstack APIs

  • Deep experience with cloud infrastructure automation and CI/CD pipeline design

  • Experience with deployment orchestration (Mesos, Kubernetes, Docker Swarm, etc.)

  • Expertise in operational monitoring and management tools (Datadog, Prometheus, Terraform, etc.)

  • Advanced written & verbal communication skills, both highly technical and non-technical

  • Ability to work closely with non-technical stakeholders and executives

  • Capable of providing strategic technical advice

  • Systematic problem-solving approach, coupled with a strong sense of ownership and drive

  • Additional skills may be required depending on role (Ceph, Helm, etc.)

Experience Education:

  • High school diploma or equivalent required

  • Bachelor's degree in Computer Science or equivalent experience

  • 5+ years of information systems design/architecture/development experience

  • May require additional certifications depending on specialization

The ideal candidate will have the following ( NOT hard requirements):

  • Experience defining, developing, deploying, and operating systems that run at scale

  • Experience working on distributed teams

  • Experience with 0DD CI/CD on Kubernetes using Helm

  • Experience with multiple cloud computing technologies (e.g. GCP, OpenStack)

  • Experience working in a polyglot environment (e.g. Python + Go + CoffeeScript)

  • Good working knowledge of network security and secure coding fundamentals

  • Good working knowledge of technologies supporting event-sourced architecture (e.g. Kafka)

  • Good working knowledge of document, relational, and time-series databases (e.g. Mongo, MySQL, Influx)

  • Good working knowledge of monitoring technologies (e.g. Prometheus, Telegraf)