facebookresearch

Can Language Models Rebuild Programs From Scratch?

88
3
100% credibility
Found May 06, 2026 at 88 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

ProgramBench is a benchmark for testing if AI agents can recreate open-source programs' functionality from only their compiled binaries and documentation.

How It Works

1
πŸ” Discover ProgramBench

You find this benchmark on GitHub or its website, curious if AI can rebuild real programs just from their ready-to-run files and help docs.

2
πŸ“₯ Get it ready

You quickly set it up on your computer with a simple install, so everything is prepared to test AI creations.

3
πŸ“¦ Grab test programs

You download bundles of real-world programs, each with hidden source code but full instructions on what they do.

4
πŸš€ Test your AI's work

You feed in your AI agent's rebuilt code and hit go – it checks if it matches the original program's behavior perfectly.

5
πŸ“Š See the scores

You get clear reports on pass/fail rates, warnings, and details for each test, showing exactly how well it did.

πŸ† Join the leaderboard

Your results help rank AI tools worldwide, advancing smarter software-building assistants for everyone.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 88 to 88 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is ProgramBench?

ProgramBench tests if language models can reverse-engineer real CLI tools from compiled binaries and docs alone, rebuilding full codebases that pass original tests. It pulls tasks from GitHub repos spanning top github language ranking like Rust, Go, C, and more, with Dockerized environments for fair eval. Run `programbench eval ` on submission tarballs to score rebuilds via pytest suites, or `programbench info` for summaries.

Why is it gaining traction?

Unlike code-completion benches, it demands end-to-end architecture from black-box inputs, echoing papers like "language models are few-shot learners" amid SWE agent hype. CLI handles parallelism, retries flaky tests, and merges partial results, yielding precise pass rates without manual setup. Tracks github language statistics shifts, like Rust's rise, in realistic rebuild scenarios.

Who should use this?

LM researchers benchmarking agentic coding beyond prompts, like probing if models know their limits per "language models (mostly) know what they know." Devs building autonomous SWE tools needing reverse-eng baselines. Teams eyeing github language ranking 2025, testing rebuilds for tools in github language detection or stats workflows.

Verdict

Grab it if evaluating LM software skillsβ€”solid CLI, Docker isolation, and leaderboard make experiments fast. Low 1.0% credibility score and 88 stars signal early maturity, but Meta backing, clear docs, and paper add trust; scale up as agent evals mature.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.