Senior Software Engineer - Application Reliability , Hybrid

Cisco • $199K — $254K *

San Jose, CA 95123Hybrid

Information Technology

8 - 10 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years in software engineering focusing on reliability and observability
Strong Python skills for production tooling and automation
GCP experience deploying on Kubernetes, extensive SQL expertise with BigQuery
Proven design and operation of application-level SLI/SLO frameworks
Strong debugging skills at the application layer

Responsibilities

Define and enforce SLIs, SLOs, and error budgets for user-facing features
Build application observability systems using Looker on BigQuery
Design LangGraph-based agents for automated issue identification
Develop agent evaluation harnesses for benchmarking and regression testing
Analyze application usage trends to identify reliability risks
Partner with development teams to embed reliability practices in lifecycle
Lead application-level incident response and postmortems

Benefits

Hybrid work model with flexibility in location
Opportunities for professional growth and development
Engagement with cutting-edge AI technologies
Collaboration across diverse teams and expertise
Access to advanced tools and frameworks for reliability enhancement

Full Job Description

The application window is expected to close on: 06/20/2026
Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.

This position is based in San Jose, CA or North Carolina and operates under a hybrid work model.

As a Senior Software Engineer in Application Reliability, you will own the reliability of our AI-powered applications and features from the user's perspective.

While our infrastructure SRE team ensures the platform is healthy, your focus will be on feature uptime, usage trends, automated issue identification, and self-healing remediation at the application layer. You will build LangGraph-based agents for automated diagnostics, Looker dashboards for observability, and evaluation harnesses for agent quality - all powered by BigQuery, BigTable, and Python. You will partner closely with application developers, data engineers, and infrastructure SREs to ensure our APIs, RAG systems, agents, and user-facing features are reliable, observable, and continuously improving.

Your Impact

Define, implement, and enforce feature-level SLIs, SLOs, and error budgets for APIs, RAG systems, AI agents, and user-facing applications.
Build and maintain application observability systems using Looker dashboards on BigQuery and BigTable - providing real-time visibility into feature health, error patterns, and usage trends for developers, PMs, and leadership.
Design and build LangGraph-based agents for automated issue identification and remediation: anomaly detection on BQ logs, root cause diagnosis, auto-rollback, feature flag kill switches, and self-healing workflows.
Develop agent evaluation harnesses to benchmark agent performance, test multi-step workflows, handle non-deterministic outputs, and run regression testing as agents evolve.
Write complex SQL (BigQuery) for usage trend analysis, anomaly detection, and operational analytics; design BQ table schemas optimized for observability and debugging.
Analyze application usage trends and adoption metrics to proactively identify reliability risks, capacity needs, and degraded user experiences before they become incidents.
Partner with application development teams to embed reliability practices into the development lifecycle: deployment safety (canary, progressive rollout), structured logging standards, and distributed tracing.
Lead application-level incident response, root cause analysis, and blameless postmortems focused on feature impact rather than infrastructure symptoms.
Build Python-based tooling and automation to reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for application-layer issues.
Stay current with the rapidly evolving AI landscape (new frameworks, tools, and paradigms) and apply emerging techniques to improve platform reliability and developer productivity.

Minimum Qualifications

10+ years of experience in software engineering with significant focus on reliability, observability, or production operations; Bachelor's or Master's Degree in Computer Science, Engineering, or a related technical discipline.
Strong Python development skills, with experience building production tooling, automation, and agent-based systems.
Production GCP experience - deploying and managing applications on GKE (Kubernetes), deep SQL expertise with BigQuery (complex queries, window functions, schema design, cost optimization), and hands-on experience with BigTable (or equivalent) for high-throughput operational data.
Proven experience designing and operating application-level SLI/SLO frameworks, burn-rate alerting, and error budget policies.
Strong debugging skills at the application layer - distributed tracing, profiling, structured log analysis, and dependency mapping.

Preferred Qualifications

Experience building agent evaluation harnesses (benchmarking, regression testing, guardrail validation for AI agents).
Familiarity with A2A protocols, streaming architectures, and event-driven systems.
Experience with deployment safety patterns: feature flags, canary deployments, progressive rollouts, and automated rollback.
Experience with GCP observability services (Cloud Logging, Cloud Trace, Cloud Monitoring).
Exposure to AIOps concepts: ML-driven anomaly detection, automated root cause analysis, intelligent alerting.
Experience driving reliability culture across engineering teams - SLO adoption, postmortem processes, and reliability reviews.
Active engagement with the evolving AI ecosystem; awareness of emerging tools and frameworks.
Hands-on experience with GenAI application development: LangGraph, agent engineering, prompt design, and agentic workflows.
Experience building Looker dashboards and Look ML models for operational observability.

About Cisco

Cisco Careers

Join the vibrant team at Cisco, a global leader in networking and cybersecurity solutions, where innovation and leadership thrive. Cisco offers a plethora of job opportunities that cater to a range of skills and experiences, making it an ideal place for both seasoned professionals and those seeking an internship to jumpstart their career. Work You’ll Do At Cisco, you’ll be part of a culture that values diversity, leadership, and professional growth. Engage in work that matters with a team that combines technology, creativity, and the power of human connection to redefine networking. Cisco’s commitment to innovation isn’t just about technology, but also about transforming the way we work and collaborate. Cisco’s employment philosophy supports career advancement and nurtures a leadership pipeline that is equipped with diversity training and opportunities for growth. Whether you’re applying your skills to drive our latest innovations or using our vast networking capabilities to solve complex problems, at Cisco, every role is impactful. Join Our Dynamic Team Explore job opportunities in areas ranging from engineering to marketing, sales to cybersecurity. Cisco is hiring individuals who are passionate, curious, and ready to drive change. Positions at Cisco offer competitive benefits, a supportive culture, and the chance to work with cutting-edge technology. Internship Programs Kickstart your career with a Cisco internship. Gain invaluable industry experience, enhance your resume, and build professional networks that last a lifetime. Our internships provide hands-on experience and the chance to work on projects that matter. Leadership and Development Cisco is committed to fostering leadership skills and providing employees with the training needed to succeed. Our leadership programs help you develop new skills, manage teams effectively, and lead with confidence. Cisco’s commitment to professional development ensures that your career path is as dynamic as our technologies. Benefits and Culture Cisco understands the importance of a balanced life. Our benefits package is designed to ensure that our team members are healthy, happy, and secure. At Cisco, you’ll find a supportive culture that encourages open communication, teamwork, and mutual respect. Stay Connected Join Cisco’s Talent Network Stay informed about new positions that match your skills and interests. At Cisco, we value the curiosity and unique perspectives of our team members. Subscribe to receive personalized job alerts and insider tips directly from our hiring managers. Explore Cisco Jobs Ready to advance your career at Cisco? Search open positions, prepare your resume, and get ready for an interview that could lead to your next big opportunity. At Cisco, we’re not just filling positions—we’re investing in leaders. Keep Up to Date Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here. READ CAREERS BLOG Job Alert Emails Customize your subscription to receive job alerts, the latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding career opportunities that await you at Cisco.

Learn more about Cisco

Size

79,500 employees

Market Cap

$194.5 billion

Industry

Telecommunications & Hardware

Net Income

$10.1 billion

Founded

2014

5 Year Trend

+1.4%

Revenue

$48 billion

NASDAQ

CSCO

* Ladders Estimates

Similar Jobs

Senior Software Engineer, CUDA Core Libraries
$184K — $356K *
NVIDIA Corporation
Remote
Reposted Today
Senior Software Engineer, CUDA Core Libraries
$184K — $356K *
NVIDIA Corporation
Santa Clara, CA 95051 (Santa Clara County)
Reposted Today
Staff, Software Engineer
$143K — $286K *
Walmart
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Senior, Software Engineer
$117K — $234K *
Walmart
Sunnyvale, CA 94087 (Santa Clara County)
Reposted Today
Software Engineer, ML Platform
$187K — $395K *
Gem.com
San Francisco, CA 94112 (San Francisco County)
Today
Software Engineer, ML Platform
$187K — $395K *
Gem.com
Redwood City, CA 94061 (San Mateo County)
Today

Get Ready For Your
Next Interview

More Jobs at Cisco

ASIC Design Verification Engineering Technical Leader
$183K — $263K *
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Telecommunications & Hardware
In-Person
Program Manager - Supply Chain AI Transformation
$146K — $190K *
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Manufacturing & Automotive
Hybrid
Software Engineer Data/AI/Intelligent Systems I (Full Time) - United States
$92K — $153K *
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Information Technology
In-Person
Software Engineer Data/AI/Intelligent Systems I (Full Time) - United States
$92K — $153K *
Hillsboro, OR 97124 (Washington County)
Reposted Today
Information Technology
In-Person
Software Engineer Data/AI/Intelligent Systems I (Full Time) - United States
$92K — $153K *
Maynard, MA 01754 (Middlesex County)
Reposted Today
Information Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Customer Support
Confidential Company
Austin, TX 78701 (Travis County)
2 weeks ago
Sr Assoc, Cyber Sec ThreatMgmt - Detection Engineer
$88K — $151K *
Northern Trust
Naperville, IL 60540 (Dupage County)
Today
Global Director – Vulnerability Management & Security Configuration
$164K — $288K *
Northern Trust
Chicago, IL 60629 (Cook County)
Today

Find similar Senior Software Engineer - Application Reliability , Hybrid jobs:

Nationwide San Jose, CA