Tencent-Hunyuan

CL-bench: A Benchmark for Context Learning

442
22
100% credibility
Found Feb 03, 2026 at 112 stars 4x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

CL-bench is a high-quality benchmark dataset and evaluation tools for testing AI language models' ability to learn novel knowledge from context in realistic tasks across various domains.

How It Works

1
🔍 Discover CL-bench

You stumble upon this smart set of tests designed to check how well AI chatbots pick up and use brand new information given right in the conversation.

2
📥 Grab the test pack

You download the bundle of ready-to-use challenges, each with questions, new facts to learn, and clear scoring rules.

3
🔗 Link your AI buddy

You connect your chosen AI service so it can chat and respond to these fresh learning tests.

4
🚀 Send tests to AI

You let your AI tackle all the challenges one by one, gathering its answers as it tries to learn and solve them.

5
⚖️ Review the answers

A reliable checker goes through each response, matching it strictly against the rules to decide if it's spot on or not.

📊 Get your results

You see a simple score showing how often your AI fully nailed the challenges by learning from the new info provided.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 112 to 442 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is CL-bench?

CL-bench is a Python benchmark for evaluating language models' context learning—how well they pick up entirely new knowledge from prompts alone, like fictional rules or niche simulations absent from training data. It delivers a 1,899-task dataset in JSONL format on Hugging Face, plus CLI tools for running inference via OpenAI-compatible APIs and automatic rubric-based scoring with another LM as judge. Developers get solving rates across categories like domain reasoning or procedural execution, exposing reliance on memorization versus true adaptation.

Why is it gaining traction?

This cl benchmark stands out from swe bench cl or max bench cl clones by being contamination-free, with expert-crafted tasks that demand learning from context—no pre-training cheats. Even GPT-5.1 scores just 23.7%, making it a brutal test of real capabilities, while rubric verifiers ensure objective, multi-turn eval. The hook: plug-and-play scripts for concurrent runs on any API, plus a public leaderboard at clbench.com, drawing devs chasing honest cl benchomo metrics.

Who should use this?

AI researchers benchmarking LLMs for research papers, model teams at startups validating in-context learning before deployment, or RAG engineers stress-testing pipelines on procedural tasks and empirical discovery. It's ideal for anyone ditching static benchmarks for cl bench block scenarios mimicking production knowledge gaps.

Verdict

Solid pick for cl-bench needs—mature docs and HF dataset make it dead simple, despite modest 383 stars and 1.0% credibility score signaling early niche status. Use it to cut through hype on context learning; skip if you're not chasing that specific edge.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.