Leey21

Leey21 / data-lineage

Public

Trace origins, shared sources, and contamination risk

19
1
100% credibility
Found Apr 16, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A research tool that automatically reconstructs the origins and relationships of datasets used in training large language models by analyzing their documentation from Hugging Face, papers, blogs, and GitHub.

How It Works

1
🔍 Discover the tracing tool

You find this helpful tool online that uncovers the hidden origins of data used to train AI models, like a family tree for datasets.

2
🌐 Try the quick online demo

Jump into the ready-to-use web version to instantly trace a dataset and see its connections light up on screen.

3
Choose your way
🖥️
Use web demo

Enter a dataset name and watch results appear right away.

💻
Run locally

Prepare a simple list of datasets and launch the analysis.

4
📝 List your datasets

Jot down the names of datasets you want to explore, one per line in a text file.

5
🚀 Start the tracing

Click to begin and feel the magic as it digs through descriptions, papers, and blogs to find connections.

6
Watch it work

Sit back while it smartly skips already-done ones and builds the full history step by step.

📊 Get your data map

Celebrate with clear files showing the complete lineage graph and details, ready to explore or share.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is data-lineage?

This Python tool uncovers data lineage in post-training LLM datasets by tracing origins, shared sources, and contamination risks from Hugging Face cards, GitHub repos, blogs, and papers. Feed it a list of HF dataset names via CLI or run.sh script, and it builds lineage graphs showing reuse, refinement, or derivation relationships—output as JSONL files for easy diagramming. Powered by LangChain and LangGraph with OpenAI models, it handles text-only or multimodal tracing, skipping processed items for batch efficiency.

Why is it gaining traction?

Unlike manual audits or basic data lineage tools, it automates multi-agent tracing across sources, producing ready-to-visualize graphs with confidence scores and evidence—perfect for data lineage examples or diagrams. The hosted web demo lets you test interactively without setup, while local runs support custom depth limits and incremental outputs. As an open source data lineage Python GitHub project accepted to ACL 2026, it stands out for LLM-specific tracking like contamination risks.

Who should use this?

ML engineers auditing HF datasets for data lineage in Databricks or Snowflake pipelines. Researchers mapping buffalo trace origins in synthetic training data to spot overlaps. Teams needing data lineage Deutsch explanations or agent trace GitHub integrations for compliance reports.

Verdict

Solid for early exploration despite 19 stars and 1.0% credibility score—strong docs and web demo offset light tests. Try the demo first; adopt locally if you batch-trace LLM data lineage regularly.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.