lilakk

Official code for "How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs"

18
1
100% credibility
Found Feb 11, 2026 at 15 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A toolkit for mining structured how-to procedures from web documents to benchmark and improve instruction-following in large language models.

How It Works

1
📖 Discover How2Everything

You stumble upon a helpful toolkit that pulls real step-by-step guides from everyday web articles to test and train AI assistants.

2
🛠️ Get it ready

You follow simple steps to set it up on your computer, connecting your preferred AI service in moments.

3
⛏️ Mine how-to procedures

You feed in web documents and it magically extracts structured goals, steps, and tools from thousands of pages across topics like cooking or taxes.

4
🎯 Test AI assistants

You run benchmarks to score how well different AIs create working instructions, spotting flaws like missing steps.

5
🚀 Build training sets

You turn the mined guides into ready-to-use datasets for improving AI performance.

Smarter AI instructions

Your AI now delivers reliable, complete how-to plans that people can actually follow without frustration.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 15 to 18 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is how2everything?

This Python package provides a CLI-driven pipeline to mine structured how-to procedures (goals, steps, resources) from web documents, turning ~1M pages into 351K examples across 14 topics. It powers How2Bench, a 7K-example dataset for testing LLMs on real-world instructions, using How2Score to flag critical failures like missing prerequisites via an 8B open judge model. Developers get ready-to-run tools for evaluation and RL training data, all backed by HuggingFace datasets in this official GitHub repository.

Why is it gaining traction?

Unlike generic benchmarks, it catches subtle errors in LLM outputs that fluency metrics miss, with reproducible leaderboards and distillation from frontier models for cheap judging. The end-to-end flow—from mining via `h2e mine run` to benching via `h2e bench run`—handles APIs (OpenAI, Anthropic) or local vLLM, plus deduped training splits. Official releases mirror artifacts seamlessly, appealing to teams chasing agentic improvements without custom scraping.

Who should use this?

LLM researchers benchmarking instruction generation on practical tasks like recipes or tax filing. Model trainers seeking web-scale how-to data for RLHF without manual annotation. AI agent builders validating procedure reliability before deployment.

Verdict

Worth forking for LLM evals—strong paper, datasets, and CLI despite 15 stars and 1.0% credibility signaling early maturity. Polish tests for production; it's research-grade gold otherwise.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.