jelllott

Speech-aware KV cache pruning for long-form speech LLMs (Qwen2-Audio, SALMONN). Token/head/chunk-level pruners + eval on LibriSpeech-long & GigaSpeech.

19
0
89% credibility
Found May 25, 2026 at 26 stars 5x -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

Hush KV is a research tool that helps speech AI systems handle long audio recordings more efficiently. When processing audio longer than about 30 seconds, these AI systems normally run out of memory. Hush KV solves this by intelligently deciding which parts of the AI's internal memory to keep and which to discard—similar to how you might take notes instead of remembering every word. The tool offers multiple strategies: some focus on keeping recent information, others score each piece by how important it seems, and one uses a trained helper to distinguish actual speech from silence or filler words. Users can connect different speech models (Qwen2-Audio, SALMONN, Whisper) and test different trimming approaches to find what works best for their needs. The project includes evaluation tools to measure whether trimming hurts accuracy on tasks like transcription and spoken question answering.

How It Works

1
🎧 You have a long audio file

You need to transcribe or analyze a recording that's several minutes long, but most speech AI tools struggle with anything beyond 30 seconds.

2
📦 You install the tool

You download and set up Hush KV, a tool that helps speech AI work efficiently with long recordings by intelligently trimming unnecessary data.

3
🤖 You pick a speech model

You choose from available speech models like Qwen2-Audio, SALMONN, or Whisper depending on whether you need transcription only or full conversation understanding.

4
You choose how to trim the data
Time-based trimming

Keep recent tokens plus important anchor points like sentence starts and silence boundaries

🎯
Importance-based trimming

Let the AI score each piece by how much attention it receives and keep the most important ones

🧠
Smart saliency trimming

Use a trained helper that identifies which audio parts contain actual speech versus silence or filler words

5
▶️ You run the tool on your audio

With one command, you process your long recording and watch as the tool intelligently manages the AI's memory while preserving accuracy.

You get your results

The AI produces accurate transcription or answers while using far less memory than it would have needed without trimming.

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 26 to 19 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is speechkv-trim?

speechkv-trim is a Python library that solves a real pain point with speech LLMs: their KV caches explode once you push past about 30 seconds of audio. The project implements "Hush KV" -- a family of pruning strategies that intelligently drop the least useful entries from the attention cache without tanking your ASR or spoken QA accuracy. It ships with five different pruner strategies (token-level, head-level, and chunk-level approaches) and works with Qwen2-Audio and SALMONN backbones. You run it via a CLI that lets you prune a single audio file or batch-evaluate across a manifest. The eval harness covers LibriSpeech-long, GigaSpeech, and an in-house spoken QA dataset.

Why is it gaining traction?

The hook is straightforward: speech tokenization is dense (50 Hz audio = 3000 tokens before your text prompt even starts), and naive sliding window eviction destroys important "anchor" frames at silence and sentence boundaries. This project scores cache entries using a weighted mix of attention recency and acoustic saliency, then evicts intelligently per-layer with a configurable budget. The results are reproducible via shell scripts, and the CLI makes it dead simple to swap pruners and budgets. It also supports a streaming evictor for incremental autoregressive decode, which is where the real memory pressure hits.

Who should use this?

If you're building applications with long-form speech LLMs and running into OOM errors or latency spikes, this is worth a look. Researchers working on speech model efficiency will find the eval harness and paper draft useful for benchmarking. Teams deploying speech models in production who need to cut memory usage without retraining will benefit most. If you're just doing short utterances, the overhead probably isn't worth it -- but for anything multi-minute, the tradeoffs are compelling.

Verdict

The 0.9% credibility score reflects a small but active project with solid fundamentals: clean CLI, entry-point-based plugin system for pruners, and reproducible eval scripts. At 19 stars it's early-stage, and the paper is still in draft form, so treat it as research code with production potential rather than a finished tool. If you're working in this space, it's worth bookmarking -- the approach is sound and the implementation is well-organized.

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.