hkust-nlp

Benchmarking Language Agents Under Controllable and Extreme Context Growth

33
3
100% credibility
Found Feb 10, 2026 at 22 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

LOCA-bench is a testing playground that measures how well AI agents manage growing amounts of information across games, math, coding, and real-world tasks.

How It Works

1
🔍 Discover LOCA-bench

You hear about a helpful tool that tests how well AI assistants handle really long chats and big piles of info without forgetting details.

2
📥 Get it ready

Download the tool and set it up on your computer with a simple script that installs everything you need.

3
🔗 Link your AI

Connect your favorite AI service like a smart helper so it can join the tests.

4
🎯 Pick your challenge

Choose from easy short tests or super long ones up to giant novels worth of info to see how your AI copes.

5
▶️ Watch it work

Hit start and see your AI tackle puzzles, games, math problems, and real tasks while the info grows huge.

6
📊 See the magic

Check colorful charts and step-by-step replays showing exactly where your AI shines or struggles with length.

🏆 Master long contexts

Now you know your AI's superpower for handling endless details, ready to build better smart helpers!

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 22 to 33 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is LOCA-bench?

LOCA-bench is a Python benchmarking suite for testing language agents—LLMs paired with scaffolds like ReAct or programmatic tool calling—under controllable context growth up to extreme lengths like 128K+ tokens, while keeping task semantics fixed. Developers run evals via a simple CLI (`loca run`) on presets from 8K to 256K tokens across 15+ tasks including games, math problems, code generation, QA, and real-world scenarios like A/B testing analysis or academic warnings via mock services for Google Cloud, Canvas, and email. It outputs detailed trajectories, token stats, and web visualizations for dissecting agent behavior in long-context settings.

Why is it gaining traction?

Unlike standard LLM benchmarks focused on fixed inputs, LOCA-bench lets you scale context dynamically to probe limits in controllable generation under diversified instructions, retrieval-augmented setups, or code syntax understanding—key pain points as models push 1M+ tokens. Built on a solid agent framework with Claude/Anthropic support and context reset strategies, it delivers aggregated results.json plus per-task breakdowns, making it dead simple to compare models like DeepSeek or Sonnet on agentic long-context performance.

Who should use this?

NLP researchers benchmarking large language models on extreme context growth or agents in retrieval-augmented generation. Teams evaluating vision-language models for tasks like handwritten text recognition or cultural understanding via scalable envs. Devs tuning scaffolds for code generation creativity or news summarization under growing inputs.

Verdict

Grab it if you're into llm benchmarking github for agents—solid CLI and outputs make it practical despite 17 stars and 1.0% credibility signaling early days. Docs are clear with arXiv paper, but expect beta rough edges; fork and contribute to mature this long-context bench.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.