mem-eval-suite

LoCoMo Refined: Recalibrating LoCoMo with stricter LLM judging and a cleaned dataset.

16
1
100% credibility
Found Apr 16, 2026 at 16 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

LoCoMo Refined is an enhanced benchmark for rigorously testing AI agents' ability to retain and recall details like times, events, relationships, and preferences from extended conversations.

How It Works

1
🔍 Discover the memory test

You hear about LoCoMo Refined, a reliable way to check if your AI chat buddy remembers details from super long conversations.

2
📥 Grab the test kit

You download the ready-made pack of realistic chat scenarios and tricky memory questions to use as your testing ground.

3
🤖 Quiz your AI

You replay the long chats with your AI and collect its answers to all the memory questions about times, events, and preferences.

4
📊 Score the answers

You feed your AI's answers into the smart checker, which compares them strictly to the correct ones using fair rules.

5
📈 See the breakdown

You get a clear report with scores on accuracy, plus details on what went right or wrong in time recall, facts, and more.

🎉 Boost your AI's memory

With honest insights, you tweak your chat system to remember better over endless talks, making it truly reliable.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 16 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is LoCoMo_refined?

LoCoMo_refined recalibrates the LoCoMo benchmark on GitHub with a cleaned dataset and stricter LLM judging, delivering more trustworthy scores for long-conversation memory in agents. It probes recall of time, events, relationships, and preferences after marathon chats, using Python scripts to score predictions via lexical metrics like F1 and BLEU, plus an LLM judge tuned for human-aligned verdicts. Developers feed in predictions.jsonl and get detailed summaries, exposing flaws like time drift that looser setups miss.

Why is it gaining traction?

It stands out by fixing the original LoCoMo's leniency—new judging demands full coverage without contradictions or extras, boosting human agreement from 44% to 86% on Qwen3-14B. The refined, cleaned LoCoMo dataset on GitHub weeds out 337 buggy samples, making scores predict real-world reliability. Quick CLI evals with original vs refined modes let you benchmark fast without setup hassle.

Who should use this?

Agent builders testing memory frameworks for production chats, where vague recall kills trust. LLM researchers comparing long-context systems head-to-head. Teams iterating on retrieval-augmented agents needing strict, recalibrated metrics beyond fluffy leaderboards.

Verdict

Grab it if you're serious about memory evals—solid README and public dataset make it dead simple despite 16 stars and 1.0% credibility score signaling early days. Skip for casual tests; it's raw but punches above for precision work.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.