bakrianoo

bakrianoo / tweety

Public

Readable LLMs Evaluations

19
0
100% credibility
Found Apr 26, 2026 at 19 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Tweety is a command-line toolkit that evaluates AI language models on 14 structured tasks covering text comprehension, reasoning, vision, structured outputs, safety, and performance metrics with detailed reports.

How It Works

1
🐦 Discover Tweety

You hear about Tweety, a friendly bird that helps test how smart AI helpers are at understanding text, thinking, seeing pictures, staying safe, and running fast.

2
πŸ’» Set up on your computer

Download and install Tweety easily so it's ready to use right away.

3
πŸ”— Connect a smart checker

Link Tweety to a thinking service like GPT so it can fairly score the AI's answers.

4
πŸ“š Prepare the test questions

Tweety gathers stories, puzzles, pictures, and challenges once, ready for any AI.

5
πŸš€ Test your AI model

Pick your AI and watch Tweety run it through fun challenges, measuring smarts and speed.

6
πŸ“Š Get easy reports

Open colorful charts and summaries showing strengths, weaknesses, and scores.

πŸŽ‰ Celebrate insights

You now know exactly how well your AI performs and where to make it even better!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 19 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is tweety?

Tweety is a Python CLI for running structured evaluations on LLMs across 14 tasks in five groups: text comprehension (QA, summarization, needle-in-haystack), reasoning (math, instructions), vision (QA, scenes, OCR), structured outputs (JSON, function calling), and safety (refusals, injections). Point it at a model via backends like transformers, vllm, ollama, or litellm; it generates scored reports with HTML/markdown summaries, per-task breakdowns, and performance profiling like latency and memory. Preprocess once with `tweety preprocess --all`, then `tweety run --model Llama-3-8B` delivers readable LLM evaluations out of the box.

Why is it gaining traction?

Unlike verbose frameworks, tweety delivers a single-command pipeline with hybrid LLM-as-judge scoring (deterministic + gpt-4o-mini) and resume support for long runs. Multi-backend flexibility means quick switches between local inference and cloud, plus vision tasks set it apart from text-only benches. Developers love the standalone HTML reports for sharing results without setup.

Who should use this?

ML engineers benchmarking open models like Gemma or Qwen before deployment. Teams iterating on fine-tunes need fast safety and reasoning checks. Researchers comparing vision capabilities across backends without custom scripts.

Verdict

Solid for quick, readable LLM evaluations in Pythonβ€”install, preprocess, run, report. At 18 stars and 1.0% credibility, it's early-stage with thin docs and no tests visible, so expect tweaks for production. Worth forking if you need a simply readable GitHub eval suite now.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.