Cisco

Senior Software Engineer - Application Reliability , Hybrid

Cisco$199K — $254K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years in software engineering focusing on reliability and observability
  • Strong Python skills for production tooling and automation
  • GCP experience deploying on Kubernetes, extensive SQL expertise with BigQuery
  • Proven design and operation of application-level SLI/SLO frameworks
  • Strong debugging skills at the application layer

Responsibilities

  • Define and enforce SLIs, SLOs, and error budgets for user-facing features
  • Build application observability systems using Looker on BigQuery
  • Design LangGraph-based agents for automated issue identification
  • Develop agent evaluation harnesses for benchmarking and regression testing
  • Analyze application usage trends to identify reliability risks
  • Partner with development teams to embed reliability practices in lifecycle
  • Lead application-level incident response and postmortems

Benefits

  • Hybrid work model with flexibility in location
  • Opportunities for professional growth and development
  • Engagement with cutting-edge AI technologies
  • Collaboration across diverse teams and expertise
  • Access to advanced tools and frameworks for reliability enhancement
Full Job Description
The application window is expected to close on: 06/20/2026
Job posting may be removed earlier if the position is filled or if a sufficient number of applications are received.

This position is based in San Jose, CA or North Carolina and operates under a hybrid work model.

As a Senior Software Engineer in Application Reliability, you will own the reliability of our AI-powered applications and features from the user's perspective.

While our infrastructure SRE team ensures the platform is healthy, your focus will be on feature uptime, usage trends, automated issue identification, and self-healing remediation at the application layer. You will build LangGraph-based agents for automated diagnostics, Looker dashboards for observability, and evaluation harnesses for agent quality - all powered by BigQuery, BigTable, and Python. You will partner closely with application developers, data engineers, and infrastructure SREs to ensure our APIs, RAG systems, agents, and user-facing features are reliable, observable, and continuously improving.

Your Impact

  • Define, implement, and enforce feature-level SLIs, SLOs, and error budgets for APIs, RAG systems, AI agents, and user-facing applications.
  • Build and maintain application observability systems using Looker dashboards on BigQuery and BigTable - providing real-time visibility into feature health, error patterns, and usage trends for developers, PMs, and leadership.
  • Design and build LangGraph-based agents for automated issue identification and remediation: anomaly detection on BQ logs, root cause diagnosis, auto-rollback, feature flag kill switches, and self-healing workflows.
  • Develop agent evaluation harnesses to benchmark agent performance, test multi-step workflows, handle non-deterministic outputs, and run regression testing as agents evolve.
  • Write complex SQL (BigQuery) for usage trend analysis, anomaly detection, and operational analytics; design BQ table schemas optimized for observability and debugging.
  • Analyze application usage trends and adoption metrics to proactively identify reliability risks, capacity needs, and degraded user experiences before they become incidents.
  • Partner with application development teams to embed reliability practices into the development lifecycle: deployment safety (canary, progressive rollout), structured logging standards, and distributed tracing.
  • Lead application-level incident response, root cause analysis, and blameless postmortems focused on feature impact rather than infrastructure symptoms.
  • Build Python-based tooling and automation to reduce mean time to detect (MTTD) and mean time to resolve (MTTR) for application-layer issues.
  • Stay current with the rapidly evolving AI landscape (new frameworks, tools, and paradigms) and apply emerging techniques to improve platform reliability and developer productivity.


Minimum Qualifications

  • 10+ years of experience in software engineering with significant focus on reliability, observability, or production operations; Bachelor's or Master's Degree in Computer Science, Engineering, or a related technical discipline.
  • Strong Python development skills, with experience building production tooling, automation, and agent-based systems.
  • Production GCP experience - deploying and managing applications on GKE (Kubernetes), deep SQL expertise with BigQuery (complex queries, window functions, schema design, cost optimization), and hands-on experience with BigTable (or equivalent) for high-throughput operational data.
  • Proven experience designing and operating application-level SLI/SLO frameworks, burn-rate alerting, and error budget policies.
  • Strong debugging skills at the application layer - distributed tracing, profiling, structured log analysis, and dependency mapping.


Preferred Qualifications

  • Experience building agent evaluation harnesses (benchmarking, regression testing, guardrail validation for AI agents).
  • Familiarity with A2A protocols, streaming architectures, and event-driven systems.
  • Experience with deployment safety patterns: feature flags, canary deployments, progressive rollouts, and automated rollback.
  • Experience with GCP observability services (Cloud Logging, Cloud Trace, Cloud Monitoring).
  • Exposure to AIOps concepts: ML-driven anomaly detection, automated root cause analysis, intelligent alerting.
  • Experience driving reliability culture across engineering teams - SLO adoption, postmortem processes, and reliability reviews.
  • Active engagement with the evolving AI ecosystem; awareness of emerging tools and frameworks.
  • Hands-on experience with GenAI application development: LangGraph, agent engineering, prompt design, and agentic workflows.
  • Experience building Looker dashboards and Look ML models for operational observability.

About Cisco

Cisco Careers

Join the vibrant team at Cisco, a global leader in networking and cybersecurity solutions, where innovation and leadership thrive. Cisco offers a plethora of job opportunities that cater to a range of skills and experiences, making it an ideal place for both seasoned professionals and those seeking an internship to jumpstart their career. Work You’ll Do At Cisco, you’ll be part of a culture that values diversity, leadership, and professional growth. Engage in work that matters with a team that combines technology, creativity, and the power of human connection to redefine networking. Cisco’s commitment to innovation isn’t just about technology, but also about transforming the way we work and collaborate. Cisco’s employment philosophy supports career advancement and nurtures a leadership pipeline that is equipped with diversity training and opportunities for growth. Whether you’re applying your skills to drive our latest innovations or using our vast networking capabilities to solve complex problems, at Cisco, every role is impactful. Join Our Dynamic Team Explore job opportunities in areas ranging from engineering to marketing, sales to cybersecurity. Cisco is hiring individuals who are passionate, curious, and ready to drive change. Positions at Cisco offer competitive benefits, a supportive culture, and the chance to work with cutting-edge technology. Internship Programs Kickstart your career with a Cisco internship. Gain invaluable industry experience, enhance your resume, and build professional networks that last a lifetime. Our internships provide hands-on experience and the chance to work on projects that matter. Leadership and Development Cisco is committed to fostering leadership skills and providing employees with the training needed to succeed. Our leadership programs help you develop new skills, manage teams effectively, and lead with confidence. Cisco’s commitment to professional development ensures that your career path is as dynamic as our technologies. Benefits and Culture Cisco understands the importance of a balanced life. Our benefits package is designed to ensure that our team members are healthy, happy, and secure. At Cisco, you’ll find a supportive culture that encourages open communication, teamwork, and mutual respect. Stay Connected Join Cisco’s Talent Network Stay informed about new positions that match your skills and interests. At Cisco, we value the curiosity and unique perspectives of our team members. Subscribe to receive personalized job alerts and insider tips directly from our hiring managers. Explore Cisco Jobs Ready to advance your career at Cisco? Search open positions, prepare your resume, and get ready for an interview that could lead to your next big opportunity. At Cisco, we’re not just filling positions—we’re investing in leaders. Keep Up to Date Stay ahead with career tips, insider perspectives, and industry-leading insights you can put to use today—all from the people who work here. READ CAREERS BLOG Job Alert Emails Customize your subscription to receive job alerts, the latest news, and insider tips tailored to your preferences. Discover the exciting and rewarding career opportunities that await you at Cisco.
Learn more about Cisco
Size
79,500 employees
Market Cap
$194.5 billion
Industry
Net Income
$10.1 billion
Founded
2014
5 Year Trend
+1.4%
Revenue
$48 billion
NASDAQ

Similar Jobs

More Jobs at Cisco

More Information Technology Jobs

Find similar Senior Software Engineer - Application Reliability , Hybrid jobs: