sapientinc

sapientinc / data_io

Public

Data pipeline for HRM-Text pretraining

19
2
85% credibility
Found May 19, 2026 at 20 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Data IO is an open-source data preprocessing pipeline designed to prepare training data for AI language models. It takes raw datasets from various sources—including math problems, reasoning tasks, and educational questions—and transforms them into clean, standardized question-answer pairs. The pipeline includes four main stages: cleaning raw data into a uniform format, optionally training a custom text tokenizer, converting text to numerical tokens using a high-performance Rust engine, and creating balanced training datasets through stratified sampling. The project is used for training HRM-Text, an AI model, and includes detailed documentation, contribution guidelines, and references an academic paper. It operates under the Apache 2.0 open-source license.

How It Works

1
🔍 Discovering the project

You find this open-source data pipeline while researching how AI models are trained with question-answer datasets.

2
📚 Understanding the pipeline

You learn this tool transforms messy raw data into clean, standardized question-answer pairs ready for AI training.

3
🧹 Cleaning your datasets

You run the cleaning scripts to convert raw datasets from various sources into a uniform format with instructions and responses.

4
Two paths for processing
🎓
Train your own tokenizer

You train a custom text tokenizer that learns to break down text into meaningful pieces for your specific use case.

🚀
Use an existing tokenizer

You connect a pre-trained tokenizer to immediately start converting your text into numbers.

5
🔢 Converting text to numbers

The high-performance Rust engine transforms all your text into sequences of numbers that AI models can understand.

6
⚖️ Balancing your training data

You configure how much data to pull from each source, ensuring your AI learns evenly from all your datasets.

🎉 Training data is ready

You now have a perfectly balanced, tokenized dataset that you can use to train your AI assistant.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 20 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is data_io?

Data IO is a production-grade data pipeline for preparing instruction-tuning datasets. It takes raw data from sources like GSM8K, FLAN, and OpenMathInstruct, cleans and standardizes it into instruction-response pairs, then converts everything into token arrays ready for training. The pipeline runs in four stages: cleaning, tokenizer training, tokenization, and stratified sampling. Python drives the orchestration while a Rust-based tokenizer handles the heavy lifting of converting text to token IDs at high throughput. Output lands as numpy arrays with metadata for fast training data loading.

Why is it gaining traction?

The Rust tokenizer is the standout feature. It processes data incrementally and prunes stale outputs automatically when source data changes, which saves hours of manual cleanup. The stratified sampler lets you configure per-dataset limits, upsampling ratios, and context truncation behavior through a simple YAML file. Instead of writing custom sampling logic for every training run, you point the sampler at your tokenized data and get balanced epochs. The pipeline also handles both small JSONL datasets and large Parquet clusters, so you can start with a few datasets and scale up without rewriting your data loading code.

Who should use this?

ML engineers building pretraining or instruction-tuning pipelines will get the most value. If you are spending time writing one-off scripts to clean datasets, filter responses, and sample balanced batches, this pipeline replaces that glue code. Researchers replicating data workflows from papers will find the cleaning scripts useful as reference implementations. Data engineers supporting model training teams will appreciate the automated incremental processing. This is not for casual experimentation with small datasets; the RAM requirement alone signals this is infrastructure for serious training runs.

Verdict

Data IO is a well-structured pipeline from a credible research team, but with only 19 stars it remains early-stage and opinionated toward HRM-Text workflows. The documentation is thorough and the Rust tokenizer is genuinely fast, but test coverage is unclear and the 512GiB RAM requirement makes local experimentation difficult. The 0.85% credibility score reflects solid engineering practices rather than community validation. Consider it if you are building pretraining infrastructure and want reference implementations for data cleaning and stratified sampling; otherwise, wait for broader adoption and ecosystem tooling.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.