kmad

Benchmark harness for evaluating DSPy RLMs on data analysis tasks (InfiAgent-DABench)

14
2
100% credibility
Found Mar 23, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

This repository is a benchmark tool for evaluating AI models on data analysis tasks, where models generate and execute Python code iteratively on CSV datasets to answer questions.

How It Works

1
🔍 Discover the benchmark

You find this tool that tests how smart AI is at solving real-world data puzzles using spreadsheets.

2
📥 Get ready to test

You download the questions, spreadsheets, and set up your computer so everything works smoothly.

3
🧠 Choose your AI helper

You pick a smart AI model to see how it tackles the data challenges.

4
🚀 Launch the tests

You start running tests on easy, medium, or hard questions, watching the AI think and code step by step.

5
📊 Review the scores

You see detailed results like accuracy per difficulty level, average time, and which questions it nailed.

6
🔧 Improve and compare

You retry tough ones, compare different AIs, or tweak the approach to boost performance.

🎉 Unlock AI insights

You get clear scores showing your AI's data analysis superpowers, ready to share or build on.

Sign up to see the full architecture

5 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 14 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is dabench-rlm-eval?

This Python project is a benchmark harness for testing DSPy Recursive Language Models (RLMs) on data analysis tasks from the InfiAgent-DABench dataset. It feeds LLMs a pandas DataFrame and natural language questions—covering stats, ML pipelines, and outlier detection—letting models iteratively write and execute code in a secure REPL sandbox to produce exact answers. Developers get CLI tools to run parallel evals, compare model results, retry failures, and even auto-optimize prompts via GEPA, with automated scoring on 257 questions across easy, medium, and hard levels.

Why is it gaining traction?

Unlike generic LLM benchmarks, this delivers systematic, robust evaluation of RLMs specifically for data analysis, with baselines showing Qwen 3.5 and MiniMax hitting 86% on hard tasks using minimal prompts. The GitHub Action-friendly CLI spits out JSON results for easy comparison, verbose traces for debugging, and one-command setup for Pyodide sandboxes with sklearn/scipy—making it a practical llm benchmark harness that skips manual data wrangling.

Who should use this?

DSPy users fine-tuning RLMs for data workflows, AI researchers benchmarking LLMs on real CSV analysis (like feature engineering or stats tests), or teams evaluating models like GitHub Copilot for programmatic data tasks before production.

Verdict

Grab it if you're in the DSPy/RLM niche—solid docs and user-facing scripts make it instantly usable despite 14 stars and 1.0% credibility score signaling early maturity. Worth forking for custom analysis benchmarks, but watch for upstream DSPy merges.

(198 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.