Stanford University

Senior Site Reliability Engineer

Stanford University$137K — $194K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • Bachelor's degree and eight years of relevant experience, or equivalent education and experience in distributed systems.
  • Experience in SRE, DevOps, or data-intensive systems roles with service building responsibility.
  • Familiarity with production infrastructure like containers, messaging systems, and databases.
  • Knowledge of distributed service architectures and common failure modes under varying loads.
  • Fluency in at least one programming language, preferably Python, with a blend of software engineering and operations experience.
  • Experience handling large-scale datasets or high-throughput data processing systems.
  • Strong communication skills for interacting with engineers and scientists from diverse backgrounds.

Responsibilities

  • Ensure reliable operation of the near-real-time data processing pipeline and timely alert delivery.
  • Design and develop software that enhances system resilience, scalability, and usability.
  • Apply optimizations to improve system performance, increasing throughput and reducing latency.
  • Manage DevOps continuous deployment using modern distributed systems tools and practices.
  • Develop monitoring dashboards and set up a sustainable on-call rotation for prompt processing services.
  • Define KPIs and metrics for pipeline observability and accountability.
  • Participate in team activities like code reviews, troubleshooting, and documentation.

Benefits

  • Flexible work options including on-site, hybrid, and remote opportunities.
  • Commitment to employee accommodation for disabilities.
  • Collaborative team environment with shared ownership of project success.
Full Job Description
SLAC Job Postings

Join the Data Management (DM) team at the Vera C. Rubin Observatory, one of modern astronomy's defining missions. The Rubin Observatory is a new astronomy facility in Chile designed to create a 10-year time-lapse map of the southern sky through the Legacy Survey of Space and Time (LSST).

As part of this team, you'll design, operate, and sustain the systems that process Rubin's data in near real time. LSST will generate 15 TB of raw pixels per night with its 8-meter mirror and 3.2 gigapixel camera, creating one of the most demanding petascale data challenges in science.

The Data Management System's Prompt Processing Framework identifies and distributes Alerts for every astrophysical object that moves, changes, or appears in the sky within minutes of observation. These alerts include potentially hazardous asteroids, supernovae, and entirely new classes of transient phenomena. Your work will directly enable astrophysical discoveries by keeping Rubin's alerts flowing.

You will join a distributed team of roughly 80 scientists and engineers building and operating Rubin's petascale data management systems. Our work spans large-scale image processing, distributed databases, and production services. Python is our lingua franca, and we develop our software openly on GitHub under an open-source license.

Your role:

You will own the reliability and robustness of Rubin Observatory's Prompt Processing Framework, the system responsible for detecting and distributing near-real-time alerts for transient and moving objects in the night sky. The Prompt Processing Framework runs on Kubernetes, with event-driven scaling using Kubernetes Event-Driven Autoscaling (KEDA) integrated with Redis Streams. It interfaces with PostgreSQL databases and Kafka to ingest data and publish alerts to the global astronomy community.

Your responsibilities:
  • Ensure, through both architecture and practice, the reliable operation of the near-real-time data processing pipeline and timely delivery of alerts to downstream brokers.
  • Design and develop software that reduces operational risk and improves system resilience, scalability, and usability, including addressing failure modes, error handling, and contention in shared resources.
  • Improve system performance and resilience by applying architectural and systems-level optimizations to increase throughput and reduce end-to-end latency.
  • Operate DevOps-oriented continuous deployment of services using modern distributed systems tooling and development practices (e.g., Kubernetes, Helm, ArgoCD, Kafka, Redis)
  • Develop monitoring dashboards and alerts for the prompt processing service and work with teammates to design and implement a sustainable on-call rotation that provides coverage during the start of observing hours in Chile (typically 2-5pm Pacific Time), with limited off-hours responsibility.
  • Define KPIs and metrics for observability and accountability of the pipeline.
  • Participate in the collective engineering activities of the team, including performing code reviews, acting as a troubleshooting buddy, participating in design discussions, and writing documentation to effectively capture and communicate architectural and implementation choices.
  • Collaborate with members of the Data Management team to identify opportunities to improve tools, workflows, and operational practices.
  • Share responsibility with the broader team for the overall success of the Data Management system, beyond the Prompt Processing Framework.
Tech Stack

The Prompt Processing Framework is built on a modern, cloud-native foundation. It runs on Kubernetes, with deployments managed via Helm and ArgoCD, and uses event-driven scaling through KEDA and Redis Streams. The system integrates with PostgreSQL and Kafka to ingest data and distribute alerts, with additional databases including Cassandra and InfluxDB. Our primary development language is Python, and our code is developed openly under an open-source model.

To be successful in this position you will bring:
  • Bachelor's degree and eight years of relevant experience, or a combination of education and relevant experience designing and operating distributed systems at-scale in production environments.
  • Experience working in an SRE, DevOps, or data-intensive systems role, with responsibility for building, operating, and improving robust services.
  • Experience engaging with modern production infrastructure (e.g., containerized services, messaging systems, and databases; see above for our current tech stack), with the ability to learn and apply new tools quickly in a production environment.
  • Familiarity with contemporary distributed service architectures, including service-to-service communication patterns, common failure modes, and system behavior under load and scale.
  • Fluency in at least one modern programming language (Python preferred) with experience working across the boundary between software engineering and operations.
  • Experience working with large-scale datasets or high-throughput data processing systems, and an understanding of the operational challenges that come with data volume and velocity.
  • Ability to communicate clearly with engineers and scientists from diverse backgrounds, including explaining technical concepts, participating in design discussions, and documenting systems and decisions.
  • Comfort working with a high degree of autonomy, taking ownership of technical decisions and execution, while being supported by an experienced team with clear priorities and goals.
We expect candidates to bring strength in some of these areas and curiosity to grow in others.

SLAC Employee competencies:
  • Effective Decisions: Uses job knowledge and solid judgment to make quality decisions in a timely manner.
  • Self-Development: Pursues a variety of venues and opportunities to continue learning and developing.
  • Dependability: Can be counted on to deliver results with a sense of personal responsibility for expected outcomes.
  • Initiative: Pursues work and interactions proactively with optimism, positive energy, and motivation to move things forward.
  • Adaptability: Flexes as needed when change occurs, maintains an open outlook while adjusting and accommodating changes.
  • Communication: Ensures effective information flow to various audiences and creates and delivers clear, appropriate written, spoken, presented messages.
  • Relationships: Builds relationships to foster trust, team collaboration, and a positive climate to achieve common goals.
Physical requirements and working conditions:
  • Consistent with its obligations under the law, the University will provide reasonable accommodation to any employee with a disability who requires accommodation to perform the essential functions of his or her job.
  • Given the nature of this position, SLAC is open to on-site, hybrid, and remote work options.
Work standards:
  • Interpersonal Skills: Demonstrates the ability to work well with Stanford colleagues and clients and with external organizations.
  • Promote Culture of Safety: Demonstrates commitment to personal responsibility and value for environment, safety and security; communicates related concerns; uses and promotes safe behaviors based on training and lessons learned. Meets the applicable roles and responsibilities as described in the ESH Manual, Chapter 1General Policy and Responsibilities: http://www-group.slac.stanford.edu/esh/eshmanual/pdfs/ESHch01.pdf
  • Subject to and expected to comply with all applicable University policies and procedures, including but not limited to the personnel policies and other policies found in the University's Administrative Guide, http://adminguide.stanford.edu.


Classification Title: Software Developer 3

Duration: Regular Continuing

Job code: 4823

The expected pay range for this position is $137,773 to $ 194,585 per annum. SLAC National Accelerator Laboratory/Stanford University provides pay ranges representing its good faith estimate of the salary the university reasonably expects to pay for a position upon hire. The pay offered to a selected candidate will be determined based on factors such as (but not limited to) the scope and responsibilities of the position, the qualifications of the selected candidate, departmental budget availability, internal equity, geographic location and external market pay for comparable jobs. At SLAC/Stanford, base pay represents only one aspect of the comprehensive rewards package.

About Stanford University

Stanford University is a private research university located in Stanford, California. The university was founded in 1885 by Leland and Jane Stanford in memory of their son, Leland Stanford Jr. Stanford is known for its academic excellence and research programs, particularly in the fields of engineering, computer science, and the sciences. The university has a diverse student body and offers undergraduate and graduate programs in a wide range of disciplines. Stanford is also home to several research centers and institutes, including the Stanford Research Institute and the Hoover Institution. The university is committed to advancing knowledge and improving the world through education and research.
Learn more about Stanford University
Size
14,945 employees
Industry

Similar Jobs

More Information Technology Jobs

Find similar Senior Site Reliability Engineer jobs: