Zscaler

Principal Production Engineer

Zscaler$164K — $235K *
US-AnywhereRemote in San Jose, CA
Enterprise Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years managing reliability and scalability for large-scale production services
  • Deep expertise in programming languages such as Python, Go, or C/C++
  • Strong background in networking protocols and Linux/RHEL systems
  • Experience with high-stakes incident management and 24/7 on-call rotations
  • Proficiency in ITIL frameworks and systematic problem management

Responsibilities

  • Design and implement scalable infrastructure across AWS, GCP, and bare-metal environments
  • Drive an 'automation-first' culture by coding to eliminate manual processes
  • Implement and maintain sophisticated observability metrics and error budgets
  • Lead incident response efforts and develop response playbooks
  • Collaborate on operability reviews with engineering teams

Benefits

  • Various health plans
  • Time off for vacation and sick leave
  • Parental leave options
  • Retirement plans
  • Education reimbursement
  • In-office perks, and more!
Full Job Description
Role

We are looking for a Principal Production Engineer to join our team. This role is available as a hybrid opportunity 3 days a week in San Jose, CA or Remote reporting to Production Engineering in the Cloud Infrastructure & Operations department. Join Zscaler to be a force multiplier for the reliability of a global platform processing 200+ billion transactions daily across tens of millions of enterprise users.

In this role, you will provide the technical vision and hands-on execution to drive an "automation-first" culture across the company. By maturing our observability and architectural standards, you will directly reduce our Mean Time to Mitigate (MTTM) and shape the scalability of our globally distributed, multi-cloud infrastructure.

What you'll do (Role Expectations)
  • Design and implement highly available, scalable infrastructure across AWS, GCP, and bare-metal environments
  • Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems
  • Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets
  • Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses
  • Partner with Engineering and partner teams to conduct operability reviews

Who You Are (Success Profile)
  • You act like an owner with a bias for action and integrity.
  • You are a pragmatic builder obsessed with creating, iterating, and shipping.
  • You champion simplicity by distilling complex problems into clear, actionable plans.
  • You are data-driven, valuing evidence over assumptions.
  • You think at scale, building solutions and processes built to last a high-growth global organization.

What We're Looking for (Minimum Qualifications)
  • 10+ years of experience managing reliability, scalability, and availability for large-scale production services
  • Deep expertise in programming (e.g., Python, Go, or C/C++)
  • Strong background in networking protocols, Linux/RHEL systems, and distributed architecture
  • Experience in high-stakes incident management and participation in a 24/7 on-call rotation
  • Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews

What Will Make You Stand Out (Preferred Qualifications)
  • Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform, Helm, Temporal)
  • Experience with chaos engineering and disaster recovery planning at scale
  • Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals

#LI-Hybrid #LI-RT101

Zscaler's salary ranges are benchmarked and are determined by role and level. The range displayed on each job posting reflects the minimum and maximum target for new hire salaries for the position across all US locations and could be higher or lower based on a multitude of factors, including job-related skills, experience, and relevant education or training.

The base salary range listed for this full-time position excludes commission/ bonus/ equity (if applicable) + benefits.

Base Pay Range

$164,500-$235,000 USD

Our Benefits program is one of the most important ways we support our employees. Zscaler proudly offers comprehensive and inclusive benefits to meet the diverse needs of our employees and their families throughout their life stages, including:

  • Various health plans
  • Time off plans for vacation and sick time
  • Parental leave options
  • Retirement options
  • Education reimbursement
  • In-office perks, and more!


Learn more about Zscaler's hybrid working model and benefits here.

About Zscaler

Zscaler is a cloud-based information security company that provides Internet security, web security, firewalls, sandboxing, SSL inspection, antivirus, vulnerability management and granular control of user activity in cloud computing, mobile and Internet of things environments. The company is headquartered in San Jose, California, and has offices in Australia, India, Japan, Singapore, the United Kingdom, and the United States.
Learn more about Zscaler
Size
3,153 employees
Market Cap
$15.5 billion
Industry
Net Income
-$191.4 million
Founded
2008
5 Year Trend
+54.1%
Revenue
$536 million
NASDAQ

Similar Jobs

More Jobs at Zscaler

More Enterprise Technology Jobs

Find similar Principal Production Engineer jobs: