Research Engineer - Data

Periodic Labs

• $350K — $400K *

Menlo Park, CA 94025In-Person

Information Technology

Less than 5 years of experience

3 weeks ago

By clicking Apply, I agree with Ladders' Terms of Use and Privacy Policy

Job Overview by Ladders

Qualifications

5+ years of experience building large-scale data pipelines for LLM pretraining or midtraining.
Expertise in data quality techniques and systems for deduplication, filtering, and normalization.
Background in handling diverse scientific data formats for model consumption.
Proficiency with distributed data processing frameworks like Apache Spark or Dask.
Strong Python engineering skills for production-quality tooling in research.
Experience evaluating and sourcing third-party datasets, considering licensing and quality.
Ability to collaborate directly with ML researchers to translate data needs.

Responsibilities

Own and drive data strategy across training stacks, collaborating with research leads.
Source and procure external datasets from various scientific domains.
Build and maintain data pipelines for large-scale data ingestion and processing.
Design systems to enhance data quality through deduplication and normalization.
Integrate experimental data into training stacks efficiently.
Develop tools for researchers to query and understand their data.
Implement metadata tracking for reproducibility and auditing of experiments.

Benefits

Flexible working locations near Menlo Park or San Francisco.
Visa sponsorship provided with legal support for international candidates.

Full Job Description

About the Role

You will build and drive the data foundation for our research efforts. This means owning data strategy end-to-end: sourcing and procuring external datasets, integrating internally generated experimental data into the training stack, and ensuring the team always has the right data - in the right shape - to train and improve frontier models.

This role sits at the intersection of data engineering, research infrastructure, and strategy. You will work closely with pretraining, midtraining, and RL researchers to understand what data the models need, then build the pipelines and systems to get it there. The work spans collecting and organizing diverse data sources, improving data quality through deduplication and preprocessing, and ensuring that new experimental results are incorporated in a structured, repeatable way that makes them useful for model development.

What You'll Do

Own data strategy across the training stack - identifying gaps, evaluating new sources, and shaping the overall data roadmap in collaboration with research leads
Source, evaluate, and procure external datasets across scientific domains including chemistry, physics, materials science, mathematics, and lab instrumentation
Build and maintain robust pipelines for ingesting, processing, and versioning large-scale datasets from heterogeneous sources
Design and implement data quality systems including deduplication, domain classification, quality filtering, and format normalization at scale
Integrate internally generated experimental data - from lab instrumentation, simulations, and model outputs - into the training stack in a structured and repeatable way
Build tooling that makes it easy for researchers to inspect, query, and understand the data that goes into training runs
Instrument data pipelines with metadata, lineage tracking, and versioning so experiments are reproducible and data decisions are auditable
Collaborate with pretraining and midtraining engineers on token budget management, data mixing ratios, and curriculum design
Stay current with research on data-efficient training, synthetic data generation, and data selection methods - and bring relevant ideas into production

You Will Thrive in This Role If You Have

Experience building large-scale data pipelines for LLM pretraining or midtraining, including web-scale or scientific corpora
Expertise in data quality techniques such as exact and fuzzy deduplication (MinHash, SimHash), perplexity filtering, classifier-based quality scoring, and PII scrubbing
Experience working with diverse scientific data formats - papers, patents, structured databases, simulation outputs, lab instrument exports - and normalizing them for model consumption
Experience with distributed data processing frameworks such as Apache Spark, Ray, or Dask at multi-terabyte to petabyte scale
Familiarity with dataset versioning, lineage tracking, and reproducibility tooling such as DVC, Delta Lake, or custom solutions
Experience sourcing and evaluating third-party datasets, including licensing considerations and quality assessment
Strong Python engineering skills and comfort building production-quality tooling in a research environment
Experience collaborating directly with ML researchers to translate data needs into pipeline requirements and back again
A research-oriented mindset - you run experiments on data, measure outcomes, and iterate with rigor

Especially Strong Candidates May Also Have

Experience curating scientific datasets specifically for domain-adaptive continued pretraining or instruction tuning
Familiarity with synthetic data generation methods, including model-generated data pipelines and quality verification
A background in a physical science or engineering discipline that informs how you think about scientific data quality and structure
Experience with multimodal data - integrating text, structured numerical data, molecular representations, or spectral data into unified training pipelines

Mechanics

Minimum education: Bachelor's degree or an equivalent combination of education and training or experience

Location: Our lab is located in Menlo Park and we prefer folks to be located in Menlo Park or San Francisco but can be flexible based on role

Compensation: The annual base compensation range for this role is $350,000-400,000 commensurate with experience

Visa sponsorship: Yes, we sponsor visas and will do everything we can to assist in this process with our legal support.

* Ladders Estimates

Similar Jobs

Forward Deployed Data Engineering (Chief) Expert
$274K — $500K+*
SAP
Palo Alto, CA 94303 (Santa Clara County)
2 days ago
Forward Deployed Data Engineering (Chief) Expert
$274K — $500K+*
SAP
Palo Alto, CA 94303 (Santa Clara County)
3 days ago
Forward Deployed Data Engineering (Chief) Expert
$274K — $500K+*
SAP
Palo Alto, CA 94303 (Santa Clara County)
3 days ago
Senior Software Engineer, Data Infrastructure
$200K — $400K *
Decagon
San Francisco, CA 94112 (San Francisco County)
3 days ago
Senior Software Engineer - Data Insights
$323K — $428K *
Roku
San Jose, CA 95123 (Santa Clara County)
5 days ago
Software Engineering 4 - Ads Reporting
$250K — $413K *
Netflix
Los Gatos, CA 95032 (Santa Clara County)
1 week ago

Get Ready For Your
Next Interview

More Jobs at Periodic Labs

Research Engineer - Data
$350K — $400K *
Menlo Park, CA 94025 (San Mateo County)
3 weeks ago
Information Technology
In-Person
HPC Engineer
$350K — $450K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
ML Systems Engineer
$300K — $400K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Information Technology
In-Person
HR Business Partner
$200K — $300K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person
Technical Sourcer - physical sciences
$200K — $250K *
Menlo Park, CA 94025 (San Mateo County)
1 month ago
Technical Services
In-Person

More Information Technology Jobs

Business Development Director
$300K — $345K + $120K bonus *
Tier1 IT Services Firm
Kansas City, MO 64116 (Clay County)
6 days ago
Client Partner / Business Developemnt - Banking
$250K — $320K + $70K bonus *
IT Services Firm (client of TechLink Systems)
New York, NY 10001 (New York County)
6 days ago
Software Engineer II, Search & Data Infrastructure -Slack
$117K — $223K *
Salesforce
Washington, DC 20011 (District Of Columbia County)
Reposted Today
Software Engineer Lead
$55K — $158K *
The PNC Financial Services Group, Inc
Dallas, TX 75217 (Dallas County)
Reposted Today
Senior R&D Engineer-17637
$130K — $180K *
Synopsys Inc
Sunnyvale, CA 94087 (Santa Clara County)
Today

Find similar Research Engineer - Data jobs:

Nationwide Menlo Park, CA

Research Engineer - Data

Job Overview by Ladders

Full Job Description

Get Ready For Your Next Interview

Find similar Research Engineer - Data jobs:

Get Ready For Your
Next Interview