Staff+ Data Engineer (ML Infrastructure)

Sanas

$130K — $180K *
Information Technology
8 - 10 years of experience
Job Overview by Ladders

Qualifications

  • 10+ years of experience in Data Engineering, Infrastructure, or ML Systems, with at least 2+ years in a technical leadership role.
  • Expertise in building distributed batch and real-time data systems.
  • Proficient in Databases (like Postgres) and Data Lakes (like Snowflake, Databricks, and ClickHouse).
  • Experience using Data Processing frameworks like Spark, Flink, and Ray.
  • Deep experience with cloud platforms AWS/GCP, object storage (e.g., S3), and orchestration tools like Airflow and Dagster.
  • Strong knowledge of data lifecycle management, including privacy, security, compliance, and reproducibility.
  • Comfortable in a fast-paced startup environment.

Responsibilities

  • Architect and lead the development of large-scale data pipelines and data lakes for AI model training.
  • Drive long-term data infrastructure strategy, including metadata management and lakehouse evolution.
  • Optimize compute fleets and streaming stacks for efficient data processing.
  • Collaborate with cross-functional teams to align data architecture with business needs.
  • Promote best practices in data governance and observability across data systems.
  • Mentor a growing team, elevate their expertise, and support hiring processes.
  • Make strategic decisions between building and buying data quality tools.

Benefits

  • Opportunity to work on cutting-edge Voice AI technology.
  • Joining a rapidly scaling company with high growth potential.
  • Collaboration with industry-leading experts and Fortune 100 companies.
  • Participation in shaping the future of human communication.
  • Access to a dynamic, fast-paced startup environment.
Full Job Description
About the Role

Our models are only as good as the data that trains them. As a Staff Data Engineer, you'll own the infrastructure that takes raw audio - millions of hours across accents, languages, noise conditions, and recording environments - and turns it into clean, reproducible, training-ready data at scale. You'll work directly with AI research scientists and ML engineers to design systems that move fast without breaking the data quality guarantees our models depend on.

Job Description

Data pipeline & lakehouse architecture

  • Design and implement large-scale data pipelines that ingest, transform, validate, and serve high-quality audio and metadata for AI model training, evaluation, and product telemetry.
  • Own the lakehouse architecture - table format choices (Iceberg vs. Delta Lake), partitioning strategies, metadata management, and schema evolution - with a bias toward reproducibility and auditability.
  • Build and maintain batch and streaming pipelines using Spark, Flink, and orchestration tooling (Airflow or Dagster), with a clear-eyed view of when each is the right tool.
  • Extend and maintain feature store infrastructure to serve low-latency, versioned features for both training and real-time inference.

Audio data at scale

  • Develop and maintain pipelines purpose-built for the unique challenges of audio data: large file volumes, time-series feature extraction, speaker and language metadata, and annotation versioning.
  • Build tooling that supports the full audio data lifecycle - from raw ingestion and quality filtering through augmentation, segmentation, and training split generation - with reproducibility guarantees at every stage.
  • Partner with ML engineers and research scientists to design data schemas, sampling strategies, and evaluation datasets that accurately reflect production conditions.
  • Own data pipelines that feed human-in-the-loop annotation workflows - ensuring clean round-trips between raw data, labeling platforms, and training-ready outputs.

Platform reliability & governance

  • Instrument pipelines with observability, data quality checks, lineage tracking, and alerting - so failures surface fast and root causes are traceable.
  • Drive build vs. buy decisions for data quality, observability, and cataloging tooling with a clear framework grounded in Sanas's scale and roadmap.
  • Own disaster recovery design for critical data assets - training datasets, evaluation benchmarks, and model checkpoints.

Technical leadership

  • Set the technical bar for the data engineering team - review designs and code, establish patterns, and document decisions in a way that raises the floor for everyone.
  • Work cross-functionally with AI research, infrastructure, product, and legal to align data architecture with business needs and regulatory requirements.
  • Contribute to hiring - identify strong candidates, conduct technical interviews, and help define what great looks like for data engineering at Sanas.

Qualifications

  • 5+ years of experience in data engineering, ML infrastructure, or data platform roles.
  • Deep expertise building distributed batch and streaming data systems in production.
  • Strong command of data processing frameworks: Spark, Flink, and Ray; and orchestrators: Airflow or Dagster.
  • Hands-on experience with cloud data platforms - Snowflake, Databricks, or ClickHouse - and object storage (S3, GCS) on AWS or GCP.
  • Solid understanding of data lifecycle management: privacy, security, compliance, and reproducibility from ingestion through model training.
  • Proven ability to work directly with ML researchers and engineers to translate model requirements into data infrastructure decisions.

Bonus

  • Direct experience with audio data pipelines - file handling at scale, time-series features, speaker metadata, or audio annotation tooling.
  • Familiarity with ASR, TTS, or speech enhancement model training workflows and the data requirements specific to each.
  • Experience with MLOps tooling - experiment tracking, dataset versioning (DVC, LakeFS), and training pipeline orchestration.

Similar Jobs

More Jobs at Sanas

More Information Technology Jobs

Find similar Staff+ Data Engineer (ML Infrastructure) jobs: