Full Job Description
Define and lead the operational strategy for observability, monitoring, incident management, and reliability engineering across the VCF platform.
Establish enterprise-level standards for service health modeling, operational telemetry, dashboard architecture, alert governance, and SLO adoption.
Own the major incident operational model for platform services, including escalation design, command structure, stakeholder communication, and recovery accountability.
Drive long-term reliability improvement initiatives by identifying systemic issues, recurring failure points, and architectural opportunities.
Lead design decisions related to integration of VCF Operations with enterprise observability, ITSM, CMDB, automation, and reporting ecosystems.
Oversee operational readiness for additional VCF platform tools and shared services, ensuring consistent support models and telemetry coverage.
Direct capacity and performance strategies for platform growth, resiliency, and service sustainability.
Embed cloud security operations, monitoring controls, policy adherence, and compliance reporting into platform operations practices.
Partner with architecture, cloud, security, compliance, and service management teams to align platform operations with enterprise standards and risk controls.
Establish operational governance for alert quality, incident trends, post-incident action closure, and service performance reporting.
Provide leadership, coaching, and technical direction to Platform Operations Engineers across all levels.
Influence roadmap priorities for automation, resilience engineering, self-healing capabilities, and platform operational maturity.
Schedule & Presence: This on-site role supports 24/7 operations through real-time collaboration, standard shifts occur within a 6:00 AM - 6:00 PM window, Monday through Friday. Additionally, this position requires scheduled on-call flexibility and the ability to remain reasonably reachable during off-hours for critical business continuity.
Preferred Qualifications
Deep hands-on experience with VMware Cloud Foundation Operations, Aria Operations, Aria Operations for Logs, and adjacent VCF platform tools.
Experience supporting hybrid cloud environments and integrating public cloud operations with on-premises platforms.
Strong familiarity with Dynatrace, ServiceNow, CMDB/ITOM, and enterprise event management ecosystems.
Experience establishing operational controls aligned to internal audit, security policy, and compliance frameworks.
Experience with infrastructure resilience, service restoration planning, and operational risk reduction initiatives.
Relevant advanced certifications in VMware, Azure, security, ITIL, or observability disciplines.
Required Qualifications
Bachelor's degree in Information Technology, Computer Science, Engineering, or related field, or equivalent experience.
Minimum 7 years of experience in platform operations, infrastructure engineering, observability, site reliability, or related enterprise operations roles.
Deep experience leading monitoring, incident management, and reliability programs for complex enterprise infrastructure.
Expert knowledge of VMware vSphere and strong operational knowledge of VMware Cloud Foundation platforms and dependencies.
Demonstrated experience designing service reliability frameworks, operational governance models, and metrics-based improvement programs.
Strong experience with enterprise observability architecture, integration strategy, and automation design.
Experience leading cross-functional technical initiatives involving operations, security, cloud, and compliance stakeholders.
Strong understanding of cloud security practices, operational controls, and regulatory/compliance considerations.