pku-liang

pku-liang / hwe-bench

Public

Benchmarking LLM agents on real-world hardware bug repair tasks

13
1
100% credibility
Found Apr 27, 2026 at 13 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

HWE-bench evaluates AI coding agents on real bug fixes from open-source hardware projects written in Verilog, SystemVerilog, and Chisel.

How It Works

1
🔍 Discover HWE-bench

You find this benchmark that tests AI helpers on fixing real bugs in computer chip designs from popular open projects.

2
🛠️ Get everything ready

You prepare your computer with the simple setup tools and grab the example bug cases to start testing.

3
📥 Download bug examples

You pull in the real-world bug data and test setups from hardware projects like Ibex or Rocket-Chip.

4
🚀 Run your AI fixer

You launch your favorite AI coding assistant to tackle the bugs, watching it generate fixes step by step.

5
📝 Collect the fixes

Your assistant creates patches; you gather them up for checking.

6
Score the results

The system automatically tests each fix to see if it truly solves the bugs.

🎉 Celebrate insights

You get a clear report on your AI's hardware bug-fixing skills, ready to improve or compare.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 13 to 13 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is hwe-bench?

hwe-bench lets you benchmark LLM agents on fixing real hardware bugs in RTL code from Verilog, SystemVerilog, and Chisel. It pulls 417 verified cases from six open-source projects like ibex and OpenTitan, where tests fail on buggy commits but pass post-fix via simulation. Download datasets from Hugging Face, generate Dockerized tasks, run agents like Claude Code or Codex CLI, extract patches, and score resolved rates automatically.

Why is it gaining traction?

Unlike software benches, it tackles consequential real-world hardware tasks with end-to-end verification, plus an evolving leaderboard showing top models like Kimi K2.6 hitting 66.9% on the full set. Harbor integration means parallel evals across repos, with recipes for agents including Kimi CLI and DeepSeek, making it dead simple to compare LLM inference on bug repair without custom setups.

Who should use this?

Hardware engineers testing LLM agents for RTL debugging in projects like RISC-V cores. AI researchers benchmarking LLMs on hardware-specific code repair or unit test generation from real functions. Teams evaluating LLM-powered chatbots or swarm intelligence for Verilog/SystemVerilog workflows.

Verdict

Solid starter benchmark for LLM agents on hardware bugs—strong paper, clear quickstart, HF datasets—but 13 stars and 1.0% credibility score show it's early days. Grab it if you're in HWE; skip for mature alternatives until adoption grows.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.