facebookresearch / ProgramBench

Public

Can Language Models Rebuild Programs From Scratch?

programbench.com

100% credibility

Found May 06, 2026 at 88 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

ProgramBench is a benchmark for testing if AI agents can recreate open-source programs' functionality from only their compiled binaries and documentation.

How It Works

🔍 Discover ProgramBench

You find this benchmark on GitHub or its website, curious if AI can rebuild real programs just from their ready-to-run files and help docs.

📥 Get it ready

You quickly set it up on your computer with a simple install, so everything is prepared to test AI creations.

📦 Grab test programs

You download bundles of real-world programs, each with hidden source code but full instructions on what they do.

🚀 Test your AI's work

You feed in your AI agent's rebuilt code and hit go – it checks if it matches the original program's behavior perfectly.

📊 See the scores

You get clear reports on pass/fail rates, warnings, and details for each test, showing exactly how well it did.

🏆 Join the leaderboard

Your results help rank AI tools worldwide, advancing smarter software-building assistants for everyone.

Sign up to see the full architecture

4 more

Star Growth

See how this repo grew from 88 to 88 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is ProgramBench?

ProgramBench tests if language models can reverse-engineer real CLI tools from compiled binaries and docs alone, rebuilding full codebases that pass original tests. It pulls tasks from GitHub repos spanning top github language ranking like Rust, Go, C, and more, with Dockerized environments for fair eval. Run `programbench eval ` on submission tarballs to score rebuilds via pytest suites, or `programbench info` for summaries.

Why is it gaining traction?

Unlike code-completion benches, it demands end-to-end architecture from black-box inputs, echoing papers like "language models are few-shot learners" amid SWE agent hype. CLI handles parallelism, retries flaky tests, and merges partial results, yielding precise pass rates without manual setup. Tracks github language statistics shifts, like Rust's rise, in realistic rebuild scenarios.

Who should use this?

LM researchers benchmarking agentic coding beyond prompts, like probing if models know their limits per "language models (mostly) know what they know." Devs building autonomous SWE tools needing reverse-eng baselines. Teams eyeing github language ranking 2025, testing rebuilds for tools in github language detection or stats workflows.

Verdict

Grab it if evaluating LM software skills—solid CLI, Docker isolation, and leaderboard make experiments fast. Low 1.0% credibility score and 88 stars signal early maturity, but Meta backing, clear docs, and paper add trust; scale up as agent evals mature.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

35,690

Followers

Base stars: 88 stars

Penalty: Very new repo (2d): -70%

Bonus: AI verified quality (100%)

Account age: 3,749 days

Repo age: 2 days

License: MIT

Updated: May 06, 2026