A toolkit for synthesizing high-quality code training data using LLM agents. It provides three independent pipelines, each producing a different type of training data from real open-source repositories. Technical Report: https://arxiv.org/abs/2603.00575
A toolkit that uses AI agents to automatically create diverse, high-quality synthetic code datasets from open-source repositories for training software engineering models.
How It Works
You stumble upon a helpful collection of recipes for creating practice code examples to train smart coding assistants.
Download the free tools and prepare your workspace so everything is ready to cook.
Link a thinking AI service that powers the magic behind creating realistic code scenarios.
Guides for getting projects running smoothly.
Realistic errors with matching reports.
Clear notes paired with code changes.
Feed in some open-source projects and let the toolkit automatically generate batches of high-quality examples.
Download folders full of ready-to-use training data, complete with tests and descriptions.
Your AI coding assistant now has realistic practice material to learn from real-world scenarios.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.