Full Job Description
GEICO's Cyber Security Engineering & Analytics, Automation (SEA) team is seeking a Staff Cyber Site Reliability Engineer (SRE) - a hands-on, engineering-minded practitioner who is passionate about building reliable, observable, and scalable systems at the intersection of security and infrastructure. This is a strong individual contributor role for someone who bridges the gap between software development and infrastructure engineering, thrives writing code and automation to solve operational problems, and takes pride in keeping mission-critical security platforms running at their best. If you love making systems more reliable through engineering - not just process - this role is for you.
Position Description
As a Staff Cyber SRE, you will be embedded in the Cybersecurity Engineering & Analytics team, partnering directly with software developers and infrastructure engineers to improve the reliability, performance, and operability of GEICO's security platforms and tooling. You will write production-quality code and automation, own observability and incident response practices, and continuously drive improvements that reduce toil and increase system resilience. Python expertise is required; Golang experience is strongly preferred. You operate in a high-velocity agile environment with a bias toward shipping working software and measurable reliability improvements. Experience with AI/ML and working knowledge of LLMs is a meaningful differentiator.
Position Responsibilities
As a Staff Cyber SRE, you will:
- Own Reliability Engineering: Define and drive reliability standards for cybersecurity platforms - establishing SLIs, SLOs, and error budgets; identifying systemic weaknesses; and engineering solutions that improve uptime, latency, and fault tolerance.
- Write Code and Build Automation: Develop production-quality software in Python (required) and Golang (preferred) to automate operational workflows, build internal tooling, eliminate toil, and improve the day-to-day velocity of security engineering teams.
- Partner with Developers and Infrastructure Engineers: Work closely with software engineers and infrastructure teams to review system designs for reliability, provide feedback on deployability and operability, and ensure that what gets built can be confidently operated and maintained in production.
- Drive Observability: Instrument security platforms and pipelines with meaningful metrics, logs, and traces; build dashboards and alerting that give the team real operational visibility using tools like Grafana, Prometheus, and similar observability stacks.
- Lead Incident Response and Post-Mortems: Be a first-responder for production issues affecting security systems; drive structured incident response, coordinate resolution, and produce blameless post-mortems with actionable follow-through to prevent recurrence.
- Build and Maintain CI/CD & Infrastructure as Code: Develop and own deployment pipelines (GitHub Actions, Jenkins) and infrastructure automation (Terraform, Ansible) that enable safe, repeatable, and fast delivery of security platform changes.
- Improve Security Platform Performance: Profile, benchmark, and tune security services, detection pipelines, and data ingestion workflows - identifying bottlenecks and shipping targeted improvements that matter.
- Contribute Actively in Agile: Be a high-output contributor in a fast-moving agile squad: write code every sprint, engage in design and architecture reviews, participate in code reviews, and help the team maintain quality and momentum.
- Apply Object-Oriented Engineering Fundamentals: Write clean, testable, and maintainable code using strong OOP principles and SOLID patterns - because operability starts with code quality.
- Explore AI/ML & LLMs (Plus): Apply knowledge of AI/ML development, large language models, or generative AI to identify practical opportunities in anomaly detection, alert triage automation, or operational intelligence.
- Share Knowledge: Contribute to technical discussions, participate in code reviews, and share operational insights with developers and infrastructure partners - not as a formal mandate, but as a natural part of working on a great engineering team.
Qualifications
- Python Expertise (Required): Demonstrated production-level Python development - used for automation, tooling, and operational software. This is a non-negotiable requirement for consideration.
- Golang Proficiency (Preferred): Hands-on Golang experience, especially in systems tooling, infrastructure software, or performance-sensitive services.
- SRE / Platform Engineering Foundation: Proven background in site reliability engineering, platform engineering, or DevOps with a strong software development component - not purely operations.
- Object-Oriented Design: Applied knowledge of OOP design patterns and SOLID principles demonstrated through production code and tooling.
- Observability & Monitoring: Hands-on experience with Grafana, Prometheus, or equivalent; able to design meaningful SLIs/SLOs, build useful dashboards, and write alerts that reduce noise rather than add to it.
- Incident Response: Experience leading structured incident response, conducting blameless post-mortems, and driving systemic follow-through on reliability improvements.
- CI/CD & Infrastructure as Code: Proficiency with CI/CD pipelines (GitHub Actions, Jenkins) and IaC tooling (Terraform, Ansible); experience enabling fast, safe, and repeatable deployments.
- Cloud Proficiency: Hands-on experience with AWS, Azure, or GCP; familiarity with cloud-native reliability and infrastructure patterns.
- Agile Team Contributor: Comfortable delivering consistently within a high-velocity agile team; strong bias toward iterative delivery and fast feedback.
- Security Domain Familiarity (Preferred): Exposure to security platforms, SIEMs, EDRs, detection pipelines, or vulnerability management tooling; DevSecOps experience is a strong plus.
- AI/ML & LLM Experience (Plus): Working knowledge of AI/ML development or applied experience with LLMs and generative AI, particularly for operational intelligence or anomaly detection use cases.
- Communication: Able to communicate clearly with both developers and infrastructure engineers; bridges technical disciplines without jargon overload.
Experience
- 8+ years of professional engineering experience spanning software development and site reliability / platform engineering.
- 5+ years in SRE, DevOps, or platform engineering roles with a strong software development component.
- 4+ years working in cloud-native environments (AWS, Azure, or GCP).
- 3+ years delivering within agile teams in a high-velocity environment.
- Production Python development is required; Golang experience is a strong differentiator.
- Experience with AI/ML development, LLMs, or generative AI tooling is a meaningful plus.
- Cybersecurity platform experience, security engineering, or DevSecOps background is a plus.
- Experience working with audit or compliance teams is a plus.
Education
- Bachelor's degree in Computer Science, Software Engineering, Cybersecurity, or a related field (or equivalent practical
Annual Salary
$110,000.00 - $230,000.00
The above annual salary range is a general guideline. Multiple factors are taken into consideration to arrive at the final hourly rate/ annual salary to be offered to the selected candidate. Factors include, but are not limited to, the scope and responsibilities of the role, the selected candidate's work experience, education and training, the work location as well as market and business considerations.
At this time, GEICO will not sponsor a new applicant for employment authorization for this position.