Staff+ Data Engineer (ML Infrastructure)

Sanas

• $130K — $180K *

Palo Alto, CA 94303In-Person

Information Technology

8 - 10 years of experience

Reposted 1 month ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

10+ years of experience in Data Engineering, Infrastructure, or ML Systems, with at least 2+ years in a technical leadership role.
Expertise in building distributed batch and real-time data systems.
Proficient in Databases (like Postgres) and Data Lakes (like Snowflake, Databricks, and ClickHouse).
Experience using Data Processing frameworks like Spark, Flink, and Ray.
Deep experience with cloud platforms AWS/GCP, object storage (e.g., S3), and orchestration tools like Airflow and Dagster.
Strong knowledge of data lifecycle management, including privacy, security, compliance, and reproducibility.
Comfortable in a fast-paced startup environment.

Responsibilities

Architect and lead the development of large-scale data pipelines and data lakes for AI model training.
Drive long-term data infrastructure strategy, including metadata management and lakehouse evolution.
Optimize compute fleets and streaming stacks for efficient data processing.
Collaborate with cross-functional teams to align data architecture with business needs.
Promote best practices in data governance and observability across data systems.
Mentor a growing team, elevate their expertise, and support hiring processes.
Make strategic decisions between building and buying data quality tools.

Benefits

Opportunity to work on cutting-edge Voice AI technology.
Joining a rapidly scaling company with high growth potential.
Collaboration with industry-leading experts and Fortune 100 companies.
Participation in shaping the future of human communication.
Access to a dynamic, fast-paced startup environment.

Full Job Description

About the Role

Our models are only as good as the data that trains them. As a Staff Data Engineer, you'll own the infrastructure that takes raw audio - millions of hours across accents, languages, noise conditions, and recording environments - and turns it into clean, reproducible, training-ready data at scale. You'll work directly with AI research scientists and ML engineers to design systems that move fast without breaking the data quality guarantees our models depend on.

Job Description

Data pipeline & lakehouse architecture

Design and implement large-scale data pipelines that ingest, transform, validate, and serve high-quality audio and metadata for AI model training, evaluation, and product telemetry.
Own the lakehouse architecture - table format choices (Iceberg vs. Delta Lake), partitioning strategies, metadata management, and schema evolution - with a bias toward reproducibility and auditability.
Build and maintain batch and streaming pipelines using Spark, Flink, and orchestration tooling (Airflow or Dagster), with a clear-eyed view of when each is the right tool.
Extend and maintain feature store infrastructure to serve low-latency, versioned features for both training and real-time inference.

Audio data at scale

Develop and maintain pipelines purpose-built for the unique challenges of audio data: large file volumes, time-series feature extraction, speaker and language metadata, and annotation versioning.
Build tooling that supports the full audio data lifecycle - from raw ingestion and quality filtering through augmentation, segmentation, and training split generation - with reproducibility guarantees at every stage.
Partner with ML engineers and research scientists to design data schemas, sampling strategies, and evaluation datasets that accurately reflect production conditions.
Own data pipelines that feed human-in-the-loop annotation workflows - ensuring clean round-trips between raw data, labeling platforms, and training-ready outputs.

Platform reliability & governance

Instrument pipelines with observability, data quality checks, lineage tracking, and alerting - so failures surface fast and root causes are traceable.
Drive build vs. buy decisions for data quality, observability, and cataloging tooling with a clear framework grounded in Sanas's scale and roadmap.
Own disaster recovery design for critical data assets - training datasets, evaluation benchmarks, and model checkpoints.

Technical leadership

Set the technical bar for the data engineering team - review designs and code, establish patterns, and document decisions in a way that raises the floor for everyone.
Work cross-functionally with AI research, infrastructure, product, and legal to align data architecture with business needs and regulatory requirements.
Contribute to hiring - identify strong candidates, conduct technical interviews, and help define what great looks like for data engineering at Sanas.

Qualifications

5+ years of experience in data engineering, ML infrastructure, or data platform roles.
Deep expertise building distributed batch and streaming data systems in production.
Strong command of data processing frameworks: Spark, Flink, and Ray; and orchestrators: Airflow or Dagster.
Hands-on experience with cloud data platforms - Snowflake, Databricks, or ClickHouse - and object storage (S3, GCS) on AWS or GCP.
Solid understanding of data lifecycle management: privacy, security, compliance, and reproducibility from ingestion through model training.
Proven ability to work directly with ML researchers and engineers to translate model requirements into data infrastructure decisions.

Bonus

Direct experience with audio data pipelines - file handling at scale, time-series features, speaker metadata, or audio annotation tooling.
Familiarity with ASR, TTS, or speech enhancement model training workflows and the data requirements specific to each.
Experience with MLOps tooling - experiment tracking, dataset versioning (DVC, LakeFS), and training pipeline orchestration.

* Ladders Estimates

Similar Jobs

Senior Data Engineer - Finance
$172K — $203K *
Oura
San Francisco, CA 94112 (San Francisco County)
Today
Senior Data Engineer
$160K — $200K *
Datavant
Remote
Reposted Today
Senior Data Architect
$150K — $190K *
Remote
Reposted Today
Data Engineer, Consultant
$100K — $130K *
Blue Shield Of California
Lodi, CA 95240 (San Joaquin County)
Today
Data Engineer, Consultant
$120K — $150K *
Blue Shield Of California
Oakland, CA 94601 (Alameda County)
Today
Data Engineer, Consultant
$100K — $130K *
Blue Shield Of California
Sacramento, CA 95823 (Sacramento County)
Today

Get Ready For Your
Next Interview

More Jobs at Sanas

Research Engineer (Machine Translation)
$120K — $150K *
Palo Alto, CA 94303 (Santa Clara County)
Reposted 1 week ago
Information Technology
In-Person
Product Designer
$90K — $130K *
Palo Alto, CA 94303 (Santa Clara County)
1 week ago
Consumer Technology
In-Person
Senior Product Marketing Manager
$120K — $160K *
Palo Alto, CA 94303 (Santa Clara County)
2 weeks ago
Enterprise Technology
In-Person
Staff+ Data Engineer (ML Infrastructure)
$130K — $180K *
Palo Alto, CA 94303 (Santa Clara County)
Reposted 1 month ago
Information Technology
In-Person
Research Scientist (Model Evaluation)
$120K — $150K *
Palo Alto, CA 94303 (Santa Clara County)
1 month ago
Consumer Technology
In-Person

More Information Technology Jobs

Sales Operations Specialist
Dotcomteam LLC
Salem, NH 03079 (Rockingham County)
Today
Project Manager 3 (IT Projects)
$100K — $130K *
First Tek, Inc.
Vancouver, WA 98682 (Clark County)
Reposted Today
Conseiller(ère) principal(e) - Gouvernance, risques et conformité en cybersécurité
$90K — $120K *
Exo - Réseau de Transport Métropolitain
Montreal, QC H1A 0A1
Today
Senior Software Engineer, Edge
$130K — $180K *
Kargo
San Francisco, CA 94112 (San Francisco County)
Today
Manager, Data Management
$160K — $185K *
Goodwin Procter
Los Angeles, CA 90011 (Los Angeles County)
Reposted Today

Find similar Staff+ Data Engineer (ML Infrastructure) jobs:

Nationwide Palo Alto, CA

Staff+ Data Engineer (ML Infrastructure)

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Staff+ Data Engineer (ML Infrastructure) jobs:

Get Ready For Your
Next Interview