Site Reliability Engineer, DNS

Optimum

$83K — $137K *
Information Technology
5 - 7 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree in Computer Science, Telecommunications, or a related field, or equivalent experience
  • Minimum 5 years in networking or systems engineering focusing on SRE principles
  • Hands-on experience with DNS technologies like BIND, Unbound, or PowerDNS
  • Solid understanding of TCP/IP and DNS protocols including DNSSEC, DoH, and DoT
  • Strong Linux/Unix administration skills and scripting knowledge (Python, Bash, or Go)
  • Experience with Grafana and OpenTelemetry for performance monitoring

Responsibilities

  • Lead architectural design and evolution of global DNS systems
  • Serve as primary technical authority in vendor relations
  • Oversee DNS platform lifecycle and capacity management
  • Define and enforce DNS management standards and policies
  • Translate strategic design into operational reality through SLOs and error budgets
  • Manage UDP/TCP DNS protocols and handle security mitigations
  • Automate workflows to reduce manual management efforts

Benefits

  • Hybrid work environment with both remote and on-site opportunities
  • Participation in on-call rotations for continuous service reliability
  • Opportunity to engage in high-pressure production environments
  • Focus on continuous improvement and skill development in DNS technologies
Full Job Description
Job Summary

The Role DNS Engineer - SRE is a high-impact role responsible for the architecture, scalability, and reliability of the mission-critical DNS infrastructure powering our ISP and core network services. This position is designed for an engineer who views infrastructure through the lens of Site Reliability Engineering (SRE) prioritizing automation, observability, and self-healing systems over manual intervention. You will combine deep IP networking and DNS expertise with modern security protocols to ensure our platforms remain resilient against evolving threats and perform at the highest level for millions of users.
The Impact This is a collaborative and influential role. Beyond core engineering, you will serve as a technical authority, leading cross-functional initiatives with Product, Security, and Service Assurance teams. Your goal is to deliver a carrier-grade DNS ecosystem that balances cutting-edge privacy standards (DoH/DoT) with the uncompromising availability required by Tier-1 network operations.

Responsibilities

Core Platform Strategy & Leadership
• Architectural Ownership: Lead the design and evolution of global DNS architectures, ensuring high availability through Anycast routing, multi-provider redundancy, and automated failover mechanisms.
• Strategic Vendor Relations: Act as the primary technical authority in engagements with DNS and infrastructure vendors, driving roadmaps that align with our long-term reliability and security goals.
• Lifecycle & Capacity Management: Oversee the full lifecycle of DNS platforms-including automated software deployments, hardware refreshes, and proactive capacity planning-to stay ahead of traffic growth.
• Standardization & Policy: Optimize, Define and enforce organization-wide standards for DNS record management, security protocols (DNSSEC), and traffic steering policies to optimize user latency.
• Reliability Engineering: Convert "Strategic Design" into "Operational Reality" by defining Service Level Objectives (SLOs) and Error Budgets for all core name services.

Cross-Domain DNS Operations & SRE
• Protocol Management: Manage the nuances of UDP/TCP port 53, recursion vs. iteration, and complex record types (A, AAAA, CNAME, MX, TXT, SRV).
• Security & Mitigation: Implement and manage DNSSEC to prevent cache poisoning; act as a subject matter expert in mitigating DDoS and DNS amplification attacks.
• Automation (Eliminating Toil): Replace manual updates and "pool" management with automated workflows using Python, Go, Ansible, or Terraform.
• Performance Tuning: Perform Linux kernel tuning for high-performance network throughput and conduct deep-dive log analysis on systems like BIND, Unbound, or PowerDNS.
• Observability: Utilize Prometheus, Grafana, and dnstap to monitor query rates and latency, providing actionable insights into error codes (NXDOMAIN, SERVFAIL).

Qualifications

Minimum Qualifications
• Education: Bachelor's degree in Computer Science, Telecommunications, or a related field (or equivalent practical experience in networking and security)
• Experience: 5+ years in a networking or systems engineering role, with a focus on SRE principles (automation, reliability, and monitoring) in production environments
• DNS Fundamentals: Hands-on experience configuring and maintaining at least two of the following: BIND, Unbound, PowerDNS, AWS Route 53, or Azure DNS
• Networking Protocols: Functional understanding of TCP/IP (IPv4/v6) and DNS-specific protocols including DNSSEC and encrypted transport (DoH/DoT)
• Systems & Automation: Strong Linux/Unix administration skills and proficiency in at least one scripting language (Python, Bash, or Go) for task automation
• Observability: Experience using Grafana and OpenTelemetry (or similar tools) to monitor service health and performance
Preferred Qualifications
• DNS Systems: Hands-on experience managing BIND, Unbound, or PowerDNS in high-traffic environments, alongside cloud-native solutions (AWS Route 53, Azure DNS, Google Cloud DNS)
• Protocol Expertise: Mastery of DNS-specific protocols including DNSSEC, DoT, and DoH, with a firm grasp of underlying transport layers (UDP/TCP) and dual-stack (IPv4/IPv6) networking
• Observability: Experience building dashboards and alerts using Prometheus, ELK, or OpenTelemetry to monitor DNS query latency and error rates
• Automation: Proven ability to manage "DNS as Code" using Terraform or Ansible and writing scripts (Python/Go) to automate routine zone updates
• Scale & Security: Background in Tier-1/Tier-2 service provider environments with a focus on service resilience, Anycast distribution, and DDoS protection

Working Conditions
• Hybrid remote/on-site, with participation in 24/7 on-call rotations
• Availability for after-hours maintenance and urgent service restoration activities
• Ability to work in high-pressure, high-reliability production environments

The Ideal Candidate
The ideal candidate is a proactive engineer who values precision and operational excellence. You don't just manage systems; you architect for reliability, anticipating bottlenecks before they impact the user. We are looking for someone who balances deep technical mastery with an organized approach to delivery, consistently driving improvements in performance, monitoring, and overall service resilience.

Pay is competitive and based on a number of job-related factors, including skills and experience. The starting pay rate/range at time of hire for this position in New York is $83,538.00 - $137,241.00 / year. For other locations, please inquire with your recruiter. The rates/ranges provided herein are the anticipated pay at the time of hire, and do not reflect future job opportunity.

Similar Jobs

More Jobs at Optimum

More Information Technology Jobs

Find similar Site Reliability Engineer, DNS jobs: