Full Job Description
About You:
Versana is seeking a motivated SRE/DevOps Engineer with strong observability experience to join
our growing Platform Engineering squad. The squad's goal is to manage public cloud, improve
DevOps practices, and monitor Versana's real-time syndicated loan data platform. The ideal
candidate will have a deep understanding of cloud-native applications, distributed computing,
CI/CD implementation, observability tools and practices.
Key Responsibilities:
• Design, implement and enhance system observability and monitoring tools
• Monitor system performance, create incident response plans, and implement observability
practices to gain insights into system behavior.
• Implement and monitor service-level objectives (SLOs) and indicators.
• Improve system reliability and resiliency.
• Conduct post-incident reviews and implement necessary changes to prevent system
failures.
• Assist teams in implementing observability tools and leveraging available telemetry data to
troubleshoot and resolve incidents and problems.
• Leverage observability and event management to improve key incident management
metrics, such as mean time to detect and mean time to restore services.
• Continually optimize systems and workflows by improving architecture, infrastructure,
automation, CI/CD, and observability.
• Collaborate with developers to ensure applications are designed with DevOps best
practices in mind.
• Participate in a rotating on-call schedule for weekend releases and being available to
respond to production issues outside of regular working hours, including weekends and
holidays.
Must Have:
• 5+ years of experience as a Site Reliability Engineer or similar role.
• 3+ years of work experience with public cloud (Azure, AWS or GCP).
• 3+ years of direct experience with observability tools like Datadog, Elasticsearch, and
Grafana Labs, etc.
• 3+ years of experience with containerization and orchestration technologies like Docker
and Kubernetes.
• 2+ years of experience in development and management of CI/CD pipelines (e.g., Azure
DevOps, Gitlab CI/CD, Github Actions, Jenkins, etc).
• 2+ years of experience with Infrastructure-as-code tools like Terraform, Azure Bicep, Cloud
Formation, etc.
• 1+ years of experience with site reliability tools like Gremlin, Chaos Mesh, or similar.
• Proven track record leveraging core observability concepts, end-user monitoring, and
infrastructure monitoring with SaaS solutions.
• Experience with messaging services like Kafka or Azure Event Hubs.
• Good understanding of the Linux operating system.
Nice to Have:
• Experience in at least one coding language such as Java, JavaScript, Python, GoLang, or .NET.
• Certifications in cloud technologies.
• Experience with Azure cloud or Azure DevOps.
• Experience with Datadog or similar modern observability tools.