Site Reliability Engineering at Collective Health is a discipline combining software and systems engineering skills. We apply modern infrastructure, systems, software, architecture, and development practices to give our customers a more reliable healthcare management experience.
Partnering with engineering teams, Site Reliability Engineers build on public cloud services to deliver a comprehensive platform that enables our developers to rapidly deliver high-quality, impactful, scalable, and reliable services. As a broader team of Site Reliability Engineers, we collaborate and identify themes and solutions to benefit Collective Health at large, engage in regular knowledge sharing activities and retrospectives, and relentlessly support one another in order to gain knowledge, remove barriers, and grow as individuals and a team.
At Collective Health, we care about creating a culture of diversity, openness, and transparency, while engaging our intellectual curiosity, problem solving and software engineering skills. This is vital to maintaining an agile engineering culture while putting a robust user experience front and center. We bring together people with a wide variety of backgrounds and perspectives, while creating an environment where their passions can be supported, and mentored so they can learn and grow. Together, we’re building the next generation healthcare platform—join us!
On any given day, you may need to:
- Collaborate on and/or lead engineering efforts from requirements to production, solving problems of developer productivity and presenting complex technical concepts to the team, engineering org, and leadership audiences
- Write code that is well-tested, easily understood, and maintainable by others
- Troubleshoot and fix complex production issues related to availability or performance, even if they are outside your comfort zone
- Work independently and autonomously
- Advise, critique, or comment on engineering designs
- Help our internal customers solve their problems in as efficient and future-proof a manner as possible
You’ll be successful in this role if you have:
- 2+ years of work experience in DevOps, Site Reliability Engineering, or Software Engineering
- Experience in supporting customer-facing production systems and responding to incidents
- Knowledge of data structures, algorithms, distributed systems, and information retrieval
- Experience with at least one of the following or similar technologies, including: Kubernetes, Docker, Postgres, etcd, Elasticsearch, or related scheduling and persistence services. Apache Kafka, or related eventing systems
- Experience in at least one of the following areas of software development: refactoring code, test-driven development, build infrastructure, debugging, building tools and testing frameworks
- Understanding of networking concepts such as routing, firewalls, load balancers, and secure communication—especially in the context of cloud infrastructure
- Methodical problem-solving approach, coupled with strong communication skills and an ability to own and drive projects to completion
Imposter syndrome is real. If you are hesitant to apply because of not checking all the boxes, or you’ve had a less-traditional pathway into Site Reliability Engineering, we encourage you to still apply and mention why you’re interested in the role.
- 5+ years of work experience in DevOps, Site Reliability Engineering, or Software Engineering, or an advanced degree in Computer Science or related technical field
- Good understanding of private and public cloud design considerations and limitations in the areas of infrastructure, distributed systems, data storage, Linux-based operating systems, and security