About the RoleWe're looking for a
Staff Database Engineer to join our team during an exciting phase of growth. In this role, you'll be responsible for database architecture, database reliability, infrastructure-adjacent database platforms, performance engineering, observability, automation, operational maturity, incident response, and engineering leadership, working closely with cross-functional partners to support business objectives while upholding our standards for excellence, collaboration, and impact.
What You'll DoDatabase Architecture & Platform Ownership- Design and own database architecture for critical infrastructure and platform services, including PostgreSQL-backed internal platforms, Slurm accounting and operational databases, NetBox and infrastructure source-of-truth databases, custom internal applications and automation services, observability, inventory, and platform metadata systems, future database-backed control plane services.
- Define standard database patterns for high availability, replication, failover, backup and restore, point-in-time recovery, performance baselining, capacity planning, upgrade lifecycle management, access control and operational security.
- Establish database design standards for new internal platforms, including schema review, indexing strategy, query design, service ownership boundaries, and production readiness requirements.
PostgreSQL, MySQL, and Database Reliability- Operate and improve production database environments across PostgreSQL, MySQL, Percona, and adjacent systems.
- Own the lifecycle of database systems, including provisioning, configuration, version upgrades, replication topology design, performance tuning, backup validation, disaster recovery testing, decommissioning, documentation and runbook creation
- Troubleshoot and resolve production database issues involving query latency, lock contention, replication lag, storage I/O bottlenecks, connection exhaustion, poor indexing, schema design problems, database capacity constraints, backup or restore failures
- Drive root cause analysis for database-related incidents and convert findings into durable engineering improvements.
Slurm, NetBox, and Infrastructure Data Systems- Serve as the senior database engineering owner for infrastructure-adjacent database platforms, including Slurm and NetBox.
- For Slurm environments, support and improve database architecture related to SlurmDBD, accounting data, job history, reporting queries, performance and retention strategy, database scaling, backup and recovery, long-term operational reliability
- For NetBox and source-of-truth systems, support PostgreSQL performance, database lifecycle planning, backup and restore validation, data integrity, schema-impact review, integration patterns with automation systems
- Partner with DevOps, Infrastructure, MLOps, and Platform Engineering teams to ensure database-backed systems are designed to scale as the environment grows.
Performance Engineering & Observability- Build deep database observability beyond basic dashboards.
- Develop and maintain visibility into query performance, execution plans, index usage, replication health, locking behavior, buffer/cache efficiency, storage latency, connection pool behavior, OS-level database bottlenecks
- Use tools such as PostgreSQL native statistics, MySQL/Percona tooling, Prometheus, Grafana, PMM, Query logs, slow query logs, eBPF/BCC or equivalent low-level profiling tools, Linux performance tooling
- Create performance baselines and alerting standards for critical database platforms.
- Identify recurring database failure patterns and build preventive monitoring, automation, and operational guardrails.
Automation, Standards, and Operational Maturity- Create database automation patterns that can be integrated with existing infrastructure tooling.
- Partner with DevOps and Infrastructure Engineering to automate database provisioning, configuration standards, backup verification, health checks, replication checks, user and permission management, upgrade workflows, monitoring deployment, runbook-driven recovery procedures
- Contribute database-specific modules, roles, or workflows to Ansible, CI/CD pipelines, or internal automation platforms where appropriate.
- Define production database readiness standards for new services before they are promoted into critical environments.
Incident Response & Engineering Leadership- Act as the technical lead for major database incidents.
- Own or support triage, root cause analysis, cross-team coordination, customer or stakeholder impact analysis, postmortems, corrective action plans, long-term remediation
- Mentor L4 and L5 engineers on database operations, SQL troubleshooting, HA design, incident response, and performance analysis.
- Provide senior technical review for database-impacting changes across infrastructure and platform teams.
Who You AreRequired Qualifications- 8+ years of production database engineering, database administration, or database architecture experience
- Strong hands-on experience with PostgreSQL in production environments
- Strong hands-on experience with MySQL, Percona, or equivalent relational database platforms
- Experience designing and operating highly available database systems
- Experience with replication, failover, backup, restore, and disaster recovery validation
- Deep SQL performance tuning experience, including execution plan analysis, index design, query rewrite, schema optimization, lock contention troubleshooting, storage and I/O analysis
- Strong Linux systems knowledge
- Experience supporting production incidents and performing root cause analysis
- Experience building or improving database monitoring and observability
- Ability to work across infrastructure, DevOps, platform, and application engineering teams
- Ability to define standards, influence architecture, and mentor other engineers without requiring direct management authority
Preferred Qualifications- Experience with SlurmDBD, Slurm accounting databases, or HPC/AI infrastructure database workloads
- Experience with NetBox or other infrastructure source-of-truth platforms
- Experience with Percona XtraDB Cluster, ProxySQL, or advanced MySQL/Percona architectures
- Experience with PostgreSQL HA tooling and replication architectures
- Experience with Prometheus, Grafana, PMM, Splunk, or similar observability platforms
- Experience with eBPF/BCC, perf, strace, or other low-level Linux diagnostic tooling
- Experience supporting databases for SaaS, cloud, HPC, AI infrastructure, or large multi-tenant platforms
- Experience with MongoDB, Oracle, SQL Server, or other secondary database platforms
- Experience with database automation using Ansible, Terraform, CI/CD systems, or internal tooling
- Experience with zero-downtime migrations, major version upgrades, and production database consolidation
What We Offer- 100% paid Medical, Dental, and Vision insurance for Employees
- Company Health Savings Account Contributions
- 100% paid Short Term and Long Term Disability Insurance for Employees
- Life and Voluntary Supplemental Insurance Options
- Other Insurance Options, such as Pet & Legal Insurance
- Various Supplementary Health Benefits, such as discounted Virtual Healthcare Appointments and Serious Illness Support
- Flexible Spending Account
- Employee Assistance Program