zhenglw02

zhenglw02 / SWE-Hub

Public

A toolkit for synthesizing high-quality code training data using LLM agents. It provides three independent pipelines, each producing a different type of training data from real open-source repositories. Technical Report: https://arxiv.org/abs/2603.00575

15
2
100% credibility
Found Mar 10, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Shell
AI Summary

A toolkit that uses AI agents to automatically create diverse, high-quality synthetic code datasets from open-source repositories for training software engineering models.

How It Works

1
📚 Discover the toolkit

You stumble upon a helpful collection of recipes for creating practice code examples to train smart coding assistants.

2
🔧 Set up your kitchen

Download the free tools and prepare your workspace so everything is ready to cook.

3
🧠 Connect a smart helper

Link a thinking AI service that powers the magic behind creating realistic code scenarios.

4
Choose your recipe
🏠
Environment setups

Guides for getting projects running smoothly.

🐛
Bug challenges

Realistic errors with matching reports.

📖
Code explanations

Clear notes paired with code changes.

5
Start cooking

Feed in some open-source projects and let the toolkit automatically generate batches of high-quality examples.

6
📥 Collect your results

Download folders full of ready-to-use training data, complete with tests and descriptions.

🎉 Training data ready!

Your AI coding assistant now has realistic practice material to learn from real-world scenarios.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 15 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is SWE-Hub?

SWE-Hub is an AI toolkit on GitHub that leverages LLM agents to synthesize high-quality code training data from real open-source repositories. It runs independent pipelines producing reproducible install/test scripts with Docker images, subtle bug patches paired with GitHub-style issue reports, function-level natural language docs matched to code diffs, and scalable bug generation across languages. Python-based with Kubernetes sandboxes for safe execution, it delivers verified, executable datasets ready for LLM fine-tuning.

Why is it gaining traction?

Unlike raw scrapers or simple mutators, it uses ReAct agents in isolated envs to ensure data realism—like PASS-to-FAIL test regressions validated against pytest oracles. The hook is plug-and-play CLI pipelines with OpenAI-compatible APIs, outputting JSONL for easy integration, plus a cited arXiv report proving production-scale viability. Stands apart from langchain GitHub toolkits by focusing on SWE-specific tasks like bug synthesis over general agents.

Who should use this?

ML engineers at HubSpot SWE levels crafting code LLMs need it for diverse bug-fix pairs; benchmark builders (SWE-HubSpot salary discussions highlight demand) generating evals from repos; researchers prepping nl2repo-style docs for instruction tuning. Ideal for new grad SWE interns at HubSpot tackling data pipelines or Reddit HubSpot SWE threads seeking agent tools beyond Microsoft Toolkit GitHub clones.

Verdict

Grab it if you're in code data synth—solid pipelines and docs outweigh the 1.0% credibility from 14 stars and K8s prereqs. Early but constructive for fork-and-extend; pair with lighter sandboxes for broader appeal.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.