Senior Cluster Site Reliability Engineer

The Voleon Group • $130K — $180K *

Berkeley, CA 94704In-Person

Information Technology

5 - 7 years of experience

2 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years experience in SRE or DevOps roles, ideally as a senior engineer or tech lead.
Knowledge of HPC/batch compute frameworks (e.g., Slurm, AWS Batch) and/or machine learning training systems (e.g., Kubeflow).
Ability to develop scripts in a common scripting language like Python or Ruby.
Familiarity with infrastructure-as-code tools (e.g., Terraform, Ansible).
Experience with cloud infrastructure, particularly AWS or GCP.
Proficient in designing observability stacks (e.g., Prometheus, Grafana).
Experience with distributed storage technologies (e.g., Lustre, Ceph).
Bachelor's degree in computer science.

Responsibilities

Respond immediately to cluster outages or issues, ensuring quick resolution.
Maintain high cluster uptime, defining and tracking SLAs for reliability.
Identify and address recurring issues through precise engineering solutions.
Create robust metrics for cluster health and develop custom observability mechanisms as necessary.
Advise software and research teams on cluster usage policies and enforce them.
Forecast cluster growth and optimize operational strategies for cost and usability.

Benefits

Opportunity to work on cutting-edge machine learning projects.
Collaborative environment with engineering teams to tackle systemic issues.
Role is central to high-performance computing and research advancements.
Flexibility in working with both on-prem and cloud infrastructure.

Full Job Description

As a Senior Cluster Site Reliability Engineer (SRE), you will help scale our research compute cluster to meet our growing needs, and you will leverage engineering skills to ensure high degrees of uptime, reliability, and robustness. Our research clusters are at the core of our R&D, and you will be directly responsible for keeping this key resource available and performant. Your work will provide a world-class HPC platform for researchers to focus on cutting-edge machine learning problems at scale. You will support both on-prem and cloud infrastructure, and work to provide the best experience to our technical staff. You will leverage IaC, Automation, and SRE principles to refine and hone a product that operates 24/7 to support Voleon.

The Cluster Operations team works on the frontline to triage and mitigate real-time operational issues. You will be an integral member of this team, solving day-to-day issues with high urgency, while also engineering systemic improvements and architectural fixes to prevent recurring issues. You will collaborate with engineering teams to develop improvements to monitoring/telemetry. You will help design and oversee operational frameworks to ensure the cluster operates within a set of rigorous SLAs.

Responsibilities

Be a first responder in the event of cluster outages or issues. Triage and resolve urgent issues as they arise
Ensure a high degree of cluster uptime (measured in multiple nines), and define + track SLAs to quantify reliability
Diagnose systemic/recurring patterns of problems, and engineer precision solutions to them in collaboration with engineering teams
Develop robust metrics and observability for cluster health and use those metrics to inform your work. Build out custom observability mechanisms when off-the-shelf ones won't do
Help software and research teams design policies around fair cluster usage, and help develop enforcement mechanisms for said policies
Assist in forecasting cluster growth, and help select appropriate scale-up strategies. Help optimize operations across dimensions of cost and usability

Requirements

5+ years of experience in SRE or DevOps roles, preferably working as a senior engineer or tech lead
Knowledge of HPC/batch compute frameworks (Slurm, Kueue, AWS/GCP Batch) and/or machine learning training systems (Kubeflow, MLflow, Horovod)
Ability to develop scripts and utilities of moderate complexity in a common scripting language (Python, Ruby, etc.)
Familiarity with infrastructure-as-code and configuration management tools (Terraform, Ansible)
Experience with cloud infrastructure (AWS or GCP)
Familiarity designing and implementing modern observability stacks (Prometheus, Grafana, Loki, ELK, OpenTelemetry)
Experience with distributed storage technologies (Lustre, Ceph, S3)
Embodies a "system engineer" rather than "system administrator" mindset, thinking systematically and leveraging automation
Bachelor degree in computer science

Preferred Qualifications

Hands-on experience with HPC frameworks (Slurm, Grid Engine) and Kubernetes-based job orchestrators (Airflow, Kueue, Kubeflow Pipelines), along with other distributed computing frameworks (Ray, Modin, Dask, Spark)
Familiarity with ML frameworks (PyTorch/Tensorflow, JAX, Horovod, DeepSpeed)
Familiarity with hybrid/on-prem environments
Experience with containerization (Docker, Podman, Singularity), particularly for HPC/batch compute environments
Experience with HPC networking (InfiniBand, RDMA)
Solid security/IAM foundations (Identity management systems, AWS/GCP IAM, Zero Trust)

"Friends of Voleon" Candidate Referral Program

If you have a great candidate in mind for this role and would like to have the potential to earn $15,000 if your referred candidate is successfully hired and employed by The Voleon Group, please use this form to submit your referral. For more details regarding eligibility, terms and conditions please make sure to review the Voleon Referral Bonus Program.

About The Voleon Group

The Voleon Group is a quantitative investment management firm that uses advanced mathematical and statistical techniques to identify and exploit market inefficiencies. The company was founded in 2007 by Michael Kharitonov and Jon McAuliffe and is based in San Francisco, California. Voleon's investment strategies are based on machine learning and artificial intelligence, and the company has a team of over 200 researchers and engineers working to develop and improve its algorithms. Voleon manages several funds, including a long/short equity fund and a futures fund, and has a strong track record of performance. The company is known for its rigorous approach to research and its commitment to transparency and ethical behavior.

Learn more about The Voleon Group

Size

200 employees

Industry

Finance & Insurance

Founded

2007

* Ladders Estimates

Similar Jobs

Infrastructure Systems Engineer III or Sr - Linux
$100K — $130K *
Berkshire Hathaway Energy
Reno, NV 89511 (Washoe County)
Reposted Today
Senior Escalation Engineer
$104K — $135K *
Ooma
Sunnyvale, CA 94087 (Santa Clara County)
Today
Staff Site Reliability Engineer
$119K — $170K *
Zscaler
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Staff Site Reliability Engineer
$119K — $170K *
Zscaler
Remote
Reposted Today
Staff System Engineer
$160K — $185K *
Super Micro Computer, Inc
San Jose, CA 95123 (Santa Clara County)
Reposted Today
Senior Site Reliability Engineer - Remote - USA
$120K — $150K *
FullStack Labs
Remote
Reposted Today

Get Ready For Your
Next Interview

More Jobs at The Voleon Group

Member of Research Staff, Reinforcement Learning, Voleon Securities
$120K — $180K *
New York City, NY 10025 (New York County)
3 days ago
Finance & Insurance
In-Person
Member of Research Staff, Reinforcement Learning, Voleon Securities
$120K — $180K *
Berkeley, CA 94704 (Alameda County)
3 days ago
Finance & Insurance
In-Person
Quantitative Trading Strategist
$120K — $180K *
Berkeley, CA 94704 (Alameda County)
4 days ago
Finance & Insurance
In-Person
Quantitative Trading Strategist
$120K — $180K *
Remote
4 days ago
Finance & Insurance
Remote in Berkeley, CA
Engineering Leader
$130K — $180K *
Berkeley, CA 94704 (Alameda County)
5 days ago
Enterprise Technology
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Senior Data Engineer
$120K — $150K *
ECS
Remote
Today
Engineer I- Software
$70K — $95K *
Microchip Technology
Chandler, AZ 85225 (Maricopa County)
Today
Software Engineer lll - Payments Modernization
$102K — $179K *
Bank of America Corporation
Charlotte, NC 28269 (Mecklenburg County)
Reposted Today

Find similar Senior Cluster Site Reliability Engineer jobs:

Nationwide Berkeley, CA

Senior Cluster Site Reliability Engineer

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Senior Cluster Site Reliability Engineer jobs:

Get Ready For Your
Next Interview