akitaonrails / llm-coding-benchmark

Public

Simple benchmark to test the most popular open source and commercial LLMs with automated OpenCode

100% credibility

Found Apr 06, 2026 at 12 stars -- GitGems finds repos before they trend. Get early access to the next one.

AI Analysis

Python

AI Summary

This project benchmarks various AI language models on their ability to autonomously generate a complete Ruby on Rails web application from a fixed prompt, including validation of runtime functionality.

How It Works

🔍 Discover the Benchmark

You stumble upon this handy tool that lets everyday folks compare how well different AI helpers can build a full website on their own.

⚙️ Get Your AI Helpers Ready

You prepare your local AI brains by warming them up so they can handle big tasks without hiccups.

📝 Pick Your Challengers

You choose which AI models from the list to test, mixing local ones on your computer and cloud ones.

🚀 Launch the Coding Race

With one simple command, you kick off the challenge where each AI tries to create a complete website step by step.

⏳ Watch Them Work

You sit back as the AIs generate code, files, and even test if everything runs smoothly.

📊 Review the Scorecard

Automatic reports pop up showing times, file counts, success rates, and which AI built the best site.

✅ Celebrate Insights Gained

You now know exactly which AI is the champion coder, with working websites to explore and share.

Sign up to see the full architecture

5 more

Star Growth

See how this repo grew from 12 to 12 stars Sign Up Free

Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose

AI-Generated Review

What is llm-coding-benchmark?

This Python tool runs a simple benchmark ai test on open-source and commercial LLMs, tasking them via OpenCode to build a full Rails app with real LLM integration, tests, Docker setup, and runtime validation. It compares local models on Ollama or llama-swap against cloud providers like OpenRouter, capturing metrics like tokens/sec, file counts, and boot success in JSON and Markdown reports. Developers get an llm coding benchmark dataset and leaderboard to spot models that produce actually runnable code, not just file scaffolds.

Why is it gaining traction?

Unlike generic llm coding benchmarks reddit threads or Hugging Face leaderboards, it exposes the gap between "looks complete" outputs and runtime failures—most models hallucinate APIs despite perfect structure. The llm coding benchmark github repo makes it dead simple to add models, run subsets via CLI (e.g., `python scripts/run_benchmark.py --model claude_opus`), and analyze with warmup scripts for context limits or runtime probes for Docker boots. Results like Claude Opus topping viability rankings hook devs chasing reliable autonomous coding.

Who should use this?

AI engineers benchmarking llm coding benchmark 2025 candidates for agentic tools, Rails teams evaluating LLMs for app scaffolding with RubyLLM integration, or local inference users tweaking llama.cpp GGUFs on Linux for simple benchmark llm tests. Ideal for those tired of flaky opencode runs needing normalized metadata across providers.

Verdict

Grab this llm coding benchmark tool if you're deep into llm coding benchmarking—its runtime audits deliver real insights despite 12 stars and 1.0% credibility score signaling early maturity. Docs are solid for setup, but expect tweaks for your hardware; skip unless Rails/OpenCode fits your stack.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.

Stars

Forks

14,082

Followers

Base stars: 12 stars

Penalty: Very new repo (1d): -70%

Bonus: AI verified quality (100%)

Account age: 6,600 days

Repo age: 1 days

Updated: Apr 06, 2026