Job Description:
The Principal Cloud and Production Operations Engineer serves as the senior technical authority responsible for architecting, automating, and optimizing hybrid and cloud-native production environments that power critical customer-facing services and enterprise applications.
This role combines deep cloud infrastructure expertise with strong production reliability and operational engineering skills. The Principal Engineer acts as both architect and hands-on builder, ensuring scalability, resilience, and security across multi-cloud and on-prem environments.
Reporting to the Associate Director of IT and Infrastructure, this position will collaborate closely with Engineering, DevOps, Security, and IT Operations to drive a culture of automation, observability, and continuous improvement across the production ecosystem.
Key Responsibilities:
Cloud Architecture and Engineering
• Design, implement, and maintain cloud and hybrid infrastructure supporting production workloads, enterprise systems, and CI/CD pipelines
• Lead the adoption of infrastructure-as-code (IaC) using Terraform, CloudFormation, or similar tools to enable repeatable, auditable, and secure deployments
• Architect scalable and fault-tolerant solutions across OCI, AWS, Azure, and on-prem data centers, ensuring high availability and cost efficiency
• Evaluate emerging cloud services and technologies for applicability to business needs and long-term scalability goals
Production Operations and Reliability
• Serve as the technical lead for production operations, ensuring uptime, performance, and reliability of customer-facing and internal systems
• Develop and maintain observability frameworks leveraging metrics, logs, and traces to ensure proactive detection and rapid response
• Partner with engineering teams to implement SRE-inspired practices, including service level objectives (SLOs), error budgets, and post-incident reviews
• Drive root cause analysis, performance tuning, and continuous improvement of production services
Automation and CI/CD Enablement
• Collaborate with DevOps and application engineering teams to build and optimize automated deployment pipelines supporting frequent, low-risk releases
• Integrate security and compliance checks into CI/CD workflows to ensure production readiness and alignment with internal standards
• Design self-healing infrastructure and automated rollback mechanisms to reduce operational risk
• Ensure secure and reliable configuration management and environment orchestration using tools such as Ansible, Chef, or Puppet
Operational Governance and Collaboration
• Establish and enforce operational best practices for monitoring, patching, and change management across production systems
• Lead production readiness reviews for new releases and large-scale changes
• Collaborate with the Security and Compliance teams to ensure systems adhere to policy, hardening standards, and regulatory requirements
• Participate in and occasionally lead on-call rotations for critical production systems, ensuring rapid triage and resolution
Leadership and Mentorship
• Act as a technical mentor to cloud and infrastructure engineers, fostering a culture of knowledge sharing and engineering excellence
• Lead architectural reviews, design sessions, and capacity planning discussions
• Serve as a trusted advisor to management on cloud modernization, resilience engineering, and cost optimization strategies
Qualifications:
• Bachelor's degree in Computer Science, Information Systems, or related field; Master's preferred
• 10+ years of experience in cloud and infrastructure engineering, including 3+ years in a senior or principal role
• Expertise with OCI (preferred), AWS and/or Azure cloud services, including networking, compute, storage, and identity management
• Proven experience managing production-scale environments supporting mission-critical applications and services
• Strong proficiency in:
-Infrastructure-as-code (Terraform, CloudFormation)
-CI/CD and DevOps toolchains (Jenkins, GitLab, ArgoCD)
-Container orchestration (Kubernetes, Docker)
-Monitoring and observability platforms (Prometheus, Grafana, Datadog, ELK)
-Scripting and automation (Python, Bash, PowerShell)
• Solid understanding of security, compliance, and networking principles in hybrid environments
• Exceptional analytical, problem-solving, and incident management skills
• Demonstrated ability to lead complex, cross-functional initiatives from concept to execution
Preferred Experience:
• Experience in high-availability SaaS or networking environments
• Knowledge of FinOps, cost optimization, and multi-cloud governance frameworks
• Familiarity with Zero Trust, identity federation, and cloud access security model
• Exposure to AI/ML infrastructure or data-driven pipelines is a plus