Terminal tool for inspecting and cleaning large LLM training datasets. Handles JSONL, Parquet, and CSV with memory-mapped I/O, near-duplicate detection, token visualization, dataset linting, and an MCP server.
Caret is a terminal tool for viewing, searching, cleaning, and analyzing large datasets used to train AI language models.
How It Works
You hear about a handy tool that lets you explore and tidy up massive files full of training examples for AI chatbots.
You download it and open it in your terminal, ready to dive into your data.
Point it to a file on your computer, and it loads super fast no matter the size.
Type a web link to an online dataset, and it pulls just what you need without a full download.
Scroll through examples, see duplicates glow, broken parts flagged, and words split into AI 'tokens' with colors.
Press a key to scan for repeats, fix formatting glitches, or lint for problems, watching issues vanish.
Switch on sharing mode so your AI assistant can jump around, search, or analyze right alongside you.
You end up with a spotless, duplicate-free collection ready to train even smarter AI models.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.