Qualifications
Responsibilities
Benefits
ABOUT THIS POSITION
Operates company's complex high traffic, business critical internet site communications and/or network-based (cloud) product systems. Plans, designs and implements scalable local and wide-area network solutions between multiple platforms and protocols (including IP and VOIP). Responsible for system performance; supports/troubleshoots network issues and coordinates installation of such items as routers and switches with appropriate vendors. Develops tools to automate the deployment, administration and monitoring of a network system. Provides training and assists with proposal writing. Conducts project planning, cost analysis and vendor comparisons and works on project implementation. Works with development teams to enhance and improve system operability. Conducts tests of network redundancy, resilience and failover of network elements to ensure up-time standards are fully achieved. May be required to provide on-call service coverage with other department employees.WHAT YOU'LL DO
* Design, implement, and maintain automation for infrastructure provisioning, configuration management, and application deployments across various environments (on-premise and cloud).
* Proactively monitor system health, performance, and availability, utilizing a range of observability tools and defining key performance indicators (KPIs) and service level objectives (SLOs).
* Lead the investigation and resolution of complex production incidents, perform root cause analysis, and implement preventative measures to minimize future occurrences.
* Collaborate with development teams to ensure software is designed for reliability, scalability, and operational efficiency, participating in architectural reviews and providing expert guidance.
* Develop and maintain robust incident response procedures, runbooks, and disaster recovery plans.
* Contribute to the evolution of our SRE practices, tooling, and best standards, driving continuous improvement and knowledge sharing within the team.
* Participate in an on-call rotation to provide 24/7 support for critical production systems.
* Mentor junior SREs and contribute to the growth and development of the team.
* Evaluate and implement new technologies and solutions to enhance system reliability and operational efficiency.
WHAT YOU'LL NEED
* Bachelor's degree in Computer Science, Engineering, or a related technical field, or equivalent practical experience.
* 5+ years of experience in a Site Reliability Engineering, DevOps, or highly related infrastructure engineering role.
* Strong proficiency in at least one scripting/programming language (e.g., Python, Go, Java, Ruby, Bash).
* Extensive experience with cloud platforms (AWS, Azure, GCP) including services related to compute, networking, storage, and databases.
* Deep understanding of Linux operating systems and networking fundamentals.
* Proven experience with infrastructure as code tools (e.g., Terraform, CloudFormation, Ansible).
* Solid experience with CI/CD pipelines and related tools (e.g., Jenkins, GitLab CI, GitHub Actions).
* Demonstrable expertise in monitoring and alerting systems (e.g., Prometheus, Grafana, Datadog, Splunk).
* Strong problem-solving skills with a methodical approach to debugging complex distributed systems.
* Excellent communication and collaboration skills, with the ability to work effectively across cross-functional teams.
* Experience with containerization technologies (Docker, Kubernetes) is highly desirable.
* Familiarity with database technologies (relational and NoSQL) and their operational challenges.
WAYSTAR PERKS
About Waystar
Similar Jobs
More Jobs at Waystar




More Information Technology Jobs