Position SummaryWe are seeking a highly skilled and hands-on Disaster Recovery Engineer to support and enhance enterprise resiliency, business continuity, and infrastructure recovery capabilities across a large-scale global environment. This role is responsible for coordinating and executing disaster recovery planning, testing, and recovery operations while also serving as a technical contributor across server, infrastructure, cloud, and platform engineering functions.
The ideal candidate will possess a strong blend of operational coordination, infrastructure engineering expertise, and technical troubleshooting capabilities. This individual must be comfortable working across cross-functional IT teams, participating in infrastructure modernization initiatives, and supporting both on-premises and cloud-based environments.
Key ResponsibilitiesDisaster Recovery & Business Continuity- Develop, maintain, and continuously improve enterprise disaster recovery plans, runbooks, and recovery procedures.
- Coordinate and lead disaster recovery testing activities, including tabletop exercises, failover testing, and full recovery simulations.
- Validate backup integrity, recovery point objectives (RPO), and recovery time objectives (RTO).
- Partner with application, infrastructure, security, networking, and business teams to ensure recovery readiness.
- Identify gaps, risks, and dependencies within infrastructure and application recovery processes.
- Maintain documentation related to DR architecture, recovery workflows, and operational standards.
- Participate in incident response and major outage coordination efforts when required.
- Assist with audit, compliance, and governance activities related to business continuity and disaster recovery.
Infrastructure Engineering & Administration- Perform administration and engineering support for enterprise infrastructure environments, including servers, virtualization platforms, storage, cloud, and data center technologies.
- Support Windows and/or Linux server environments including provisioning, patching, performance monitoring, and troubleshooting.
- Assist with infrastructure modernization, automation, and resiliency initiatives.
- Support virtualization technologies such as VMware, Hyper-V, or Nutanix environments.
- Participate in infrastructure lifecycle management, capacity planning, and operational support activities.
- Troubleshoot infrastructure performance, replication, backup, and connectivity issues.
- Collaborate with networking, security, cloud, and operations teams to resolve complex technical problems.
- Support infrastructure monitoring, alerting, and operational reporting platforms.
Required Qualifications- 5+ years of experience in Disaster Recovery, Infrastructure and Systems Administration, or related IT disciplines.
- Strong understanding of disaster recovery methodologies, high availability architectures, and business continuity planning.
- Experience supporting enterprise server infrastructure in large-scale environments.
- Hands-on experience with Windows Server and/or Linux administration.
- Experience with virtualization platforms such as Nutanix, VMware vSphere, or Hyper-V.
- Deep understanding of backup and replication technologies.
- Familiarity with cloud platforms such as AWS, Azure, or Google Cloud.
- Knowledge of infrastructure monitoring and operational support processes.
- Strong troubleshooting and root cause analysis skills.
- Excellent communication, coordination, and documentation abilities.
Preferred Qualifications- Experience supporting retail, distributed enterprise, or large multi-site environments that have Point-of-Sale endpoints.
- Experience with automation and scripting technologies such as PowerShell, Python, or Ansible.
- Familiarity with ITIL operational practices and change management processes.
- Experience with data center operations and infrastructure resiliency design.
- Knowledge of storage platforms, networking fundamentals, and cybersecurity best practices.
- Relevant certifications such as:
- VMware VCP
- Microsoft Certified
- AWS or Azure certifications
- CBCP, ISO 22301, or DR-related certifications
Key Competencies- Strong operational ownership and accountability
- Ability to remain calm and organized during critical incidents
- Cross-functional collaboration and leadership
- Process-oriented mindset with attention to detail
- Ability to balance strategic resiliency planning with hands-on technical execution
Work Environment- On-site role depending on business needs
- Participation in after-hours recovery testing and major incident support may be required
- Occasional travel may be required for data center, office, or recovery site support