kmad / dabench-rlm-eval
PublicBenchmark harness for evaluating DSPy RLMs on data analysis tasks (InfiAgent-DABench)
This repository is a benchmark tool for evaluating AI models on data analysis tasks, where models generate and execute Python code iteratively on CSV datasets to answer questions.
How It Works
You find this tool that tests how smart AI is at solving real-world data puzzles using spreadsheets.
You download the questions, spreadsheets, and set up your computer so everything works smoothly.
You pick a smart AI model to see how it tackles the data challenges.
You start running tests on easy, medium, or hard questions, watching the AI think and code step by step.
You see detailed results like accuracy per difficulty level, average time, and which questions it nailed.
You retry tough ones, compare different AIs, or tweak the approach to boost performance.
You get clear scores showing your AI's data analysis superpowers, ready to share or build on.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.