Research Scientist, Data

Pika

• $120K — $160K *

Palo Alto, CA 94303In-Person

Information Technology

5 - 7 years of experience

Today

Be an Early Applicant

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years building and scaling data pipelines for ML applications at staff or lead engineer level
Strong background in data engineering and ML data curation for large-scale multimodal models
Expertise in distributed data systems like Spark or Hadoop
Proven ability to create scalable, production-grade data infrastructure for ML workflows
Experience with data labeling, filtering, and management tools
Strong programming skills in Python and familiarity with cloud platforms
Knowledge of privacy and compliance in data management
Excellent collaboration and communication skills

Responsibilities

Own large-scale data pipeline architecture to support model training and research workflows
Partner with teams to curate, clean, and manage diverse sensory-rich datasets
Develop strategies for scalable data ingestion and augmentation
Ensure data quality, reliability, and compliance throughout the data lifecycle
Optimize data processing for large-scale distributed training pipelines
Prototype new methods for dataset creation and management based on researcher needs
Contribute to integrating research-driven data advancements into production systems
Stay updated on data engineering and ML data management best practices

Benefits

Competitive salary and substantial equity in a high-growth startup
Full health benefits and 401k matching
Collaborative, mission-driven team environment with growth opportunities
Flexible on-site/remote hybrid working arrangement

Full Job Description

About the Role

At Pika, we are pioneering the next generation of creative infrastructure built around real-time, multimodal generation and intelligent agentic platforms. We are looking for a staff or lead-level Research Engineer, Data to architect and scale data engineering systems supporting model training for our advanced multimodal foundation models. This pivotal role will strengthen our research teams by building, optimizing, and owning large-scale data pipelines and robust ML data curation, ensuring our foundation models have access to the highest quality and most diverse datasets. If you are passionate about powerful data infrastructure and innovative research-engineering, join us to make an impact for millions of creators.

What You'll Do

Take ownership of large-scale data pipeline architecture and implementation to support model training and research workflows for text, image, audio, and video datasets
Partner with research and engineering teams to curate, clean, and manage diverse, sensory-rich datasets for pre-training and mid-training of multimodal models
Develop strategies and tools for scalable data ingestion, labeling, filtering, augmentation, and storage
Ensure data quality, reliability, and compliance, including managing privacy and ethical considerations throughout the data lifecycle
Optimize data processing, transformation, and delivery for large-scale distributed training pipelines
Prototype and productionize new methods for dataset creation, management, and continuous improvement in response to researcher needs
Contribute to the integration of research-driven data advancements into production-ready systems
Stay informed on emerging data engineering and ML data management developments, bringing best practices to our systems

What We're Looking For

5+ years of experience building and scaling data pipelines for machine learning applications at staff or lead engineer level, ideally in research or model training environments
Strong background in data engineering and ML data curation for LLMs, VLMs, or other large-scale multimodal models
Expertise in distributed data systems (e.g., Spark, Hadoop, Ray, or similar) and efficient large dataset processing/ETL workflows
Proven ability to build robust, scalable, and production-grade data infrastructure for ML pipelines
Experience developing tools for data labeling, filtering, deduplication, quality assurance, and dataset management
Strong programming skills (Python, SQL, PySpark, or similar) and familiarity with cloud data platforms (AWS, GCP, Azure)
Knowledge of privacy, compliance, ethics, and best practices in data collection and management
Excellent cross-functional collaboration, problem-solving, and communication skills
Passion for enabling cutting-edge generative AI and creative technology through data excellence

What We Offer

Competitive salary and substantial equity in a high-growth startup
Full health benefits, 401k matching, and more
Collaborative, mission-driven team environment with major growth opportunities
Flexible on-site/remote hybrid (HQ in Palo Alto, CA)

If you are a data-driven research engineer excited to lead and scale the data infrastructure powering real-time multimodal foundation models, we want to hear from you.

* Ladders Estimates

Similar Jobs

Staff Deep Learning Engineer
$130K — $180K *
Hayden AI Technologies
San Francisco, CA 94112 (San Francisco County)
Today
Staff Machine Learning Engineer - Leasing
$130K — $180K *
AppFolio
Santa Barbara, CA 93101 (Santa Barbara County)
1 week ago
Staff Machine Learning Engineer - Tools & Frameworks AI
$140K — $180K *
Apple
Cupertino, CA 95014 (Santa Clara County)
Reposted 1 week ago
Research Engineer - ML Infrastructure
$130K — $180K *
Epsilon Health
San Francisco, CA 94112 (San Francisco County)
1 week ago
Staff Machine Learning Engineer
$153K — $225K *
Credit Acceptance Corporation
Remote
2 weeks ago
ML Engineer
$130K — $180K *
Docker
Palo Alto, CA 94303 (Santa Clara County)
2 weeks ago

Get Ready For Your
Next Interview

More Jobs at Pika

Software Engineer, Backend
$130K — $180K *
Palo Alto, CA 94303 (Santa Clara County)
Today
Information Technology
In-Person
Research Scientist, Data
$120K — $160K *
Palo Alto, CA 94303 (Santa Clara County)
Today
Information Technology
In-Person
Senior Software Engineer, Inference
$130K — $180K *
Palo Alto, CA 94303 (Santa Clara County)
1 week ago
Information Technology
In-Person
Senior Software Engineer, Backend/Infra
$120K — $180K *
Palo Alto, CA 94303 (Santa Clara County)
2 weeks ago
Information Technology
In-Person

More Information Technology Jobs

Product Owner
$120K — $150K *
Deluxe Media, Inc.
Burbank, CA 91502 (Los Angeles County)
Today
Software Developer
$75K — $141K *
Bank of Montreal
Toronto, ON M3C 0E3
Reposted Today
M365 Platform & Power Platform Engineer
$75K — $141K *
Bank of Montreal
Remote
Today
Software Application Developer
$61K — $113K *
Bank of Montreal
Toronto, ON M3C 0E3
Today
Application developer
$75K — $141K *
Bank of Montreal
Toronto, ON M3C 0E3
Today

Find similar Research Scientist, Data jobs:

Nationwide Palo Alto, CA

Research Scientist, Data

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Scientist, Data jobs:

Get Ready For Your
Next Interview