LemonTea03

LemonTea03 / XTF

Public

[ICLR 2026] Repository of "Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets"

70
1
100% credibility
Found Mar 11, 2026 at 69 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This repository provides tools to filter out noisy parts in training data, improving how AI language models learn specific skills in math, coding, medicine, and finance.

How It Works

1
📚 Discover XTF

You come across this helpful tool while reading about smarter ways to train AI on real-world data like math problems or medical questions.

2
💻 Prepare your workspace

You set up a simple environment on your computer to get everything ready for using the tool.

3
📋 Choose your data and AI

Pick a dataset from math, coding, medicine, finance, or similar, and select a starting AI model to work with.

4
✨ Clean the data magically

The tool scans your data, spots and removes the confusing or noisy bits, making it perfect for training.

5
🚀 Train your improved AI

Run the training process to teach your AI using the freshly cleaned data, watching it learn better.

6
📊 Test and compare results

Check how your new AI performs on test questions and see the clear improvements over the original.

🎉 Celebrate better AI

Your AI now handles tasks like solving problems or answering questions much more accurately and reliably!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 69 to 70 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is XTF?

XTF cleans LLM fine-tuning datasets at the token level by scoring and masking noisy tokens based on reasoning importance, knowledge novelty, and task relevance. It turns sentence-level data into optimized inputs that boost model performance on math, code, medical QA, and finance tasks like GSM8K or HumanEval. Built in Python with ModelScope datasets and LoRA fine-tuning, you get CLI pipelines to augment data, train models, and evaluate results directly.

Why is it gaining traction?

This GitHub ICLR 2026 repo stands out with its explainable scoring—developers see exactly why tokens get filtered, unlike black-box cleaners. It delivers measurable gains across domains via simple flags for score weights and thresholds, plus one-command end-to-end runs from data to eval. Amid ICLR 2026 buzz on OpenReview and Reddit, it's hooking researchers chasing token-level tweaks without retraining from scratch.

Who should use this?

ML engineers fine-tuning small LLMs on noisy instruction datasets for math reasoning or code generation. Domain specialists in medical or financial QA prepping PubMedQA/FIQA data. Academic teams replicating ICLR 2026 papers or benchmarking noise filtering before LoRA runs.

Verdict

Grab it if you're experimenting with LLM data quality—solid research code with clear docs and supported datasets, despite 19 stars and 1.0% credibility score signaling early maturity. Test on your pipeline; expect tweaks for production scale.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.