jina-ai

Identify which embedding model produced a vector using digit-level tokenization and a tiny transformer

16
1
100% credibility
Found Mar 09, 2026 at 14 stars -- GitGems finds repos before they trend. Get early access to the next one.
Sign Up Free
AI Analysis
Python
AI Summary

A research tool that trains a small neural network to identify which text embedding model generated a given vector by analyzing patterns in its numerical values.

How It Works

1
🔍 Discover the tool

You come across a neat invention that figures out which AI service created a bunch of mysterious numbers from text.

2
📝 Gather sample texts

You create a simple list of everyday sentences, like quotes or questions, to use for testing.

3
Collect number patterns

You feed your sentences to lots of different AI services and save their unique number outputs as training examples.

4
🏋️ Train the identifier

You run a quick training session where a smart little helper learns to spot the differences in each AI's number style.

5
📊 Review the progress

You check colorful charts that show how accurately it's learning to recognize each one.

🎉 Spot any AI's work

Now you can take any unknown numbers and instantly know which AI made them, like a detective solving a mystery!

Sign up to see the full architecture

4 more

Sign Up Free

Star Growth

See how this repo grew from 14 to 16 stars Sign Up Free
Repurpose This Repo

Repurpose is a Pro feature

Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.

Unlock Repurpose
AI-Generated Review

What is embedding-fingerprints?

This Python tool lets you feed it any embedding vector and identify which model generated it—think BGE, Jina, E5, or dozens more—using digit-level tokenization of the floats as unique fingerprints. It solves the black-box problem of tracing embeddings back to their source without metadata, via a CLI workflow: generate training data from text files, train a tiny classifier, and plot results. Check the live demo to test vectors instantly.

Why is it gaining traction?

Unlike generic classifiers, it exploits numerical quirks in embeddings (digit frequencies, ranges) for ~86% accuracy across 68 model variants, with a sub-1M param transformer that trains fast on a single GPU. Developers dig the out-of-box support for 40+ popular models via simple flags like `--models bge-m3,jina-v3`, plus multilingual data handling. It's a clever hack for embedding fingerprints without heavy compute.

Who should use this?

ML engineers debugging production pipelines where embeddings mix from unknown sources, researchers comparing model outputs via fingerprints, or RAG builders needing to detect and route vectors by origin (e.g., query vs. passage prefixes). Ideal for teams generating embeddings at scale who want to hash identify sources or validate digit-level consistency.

Verdict

Promising proof-of-concept for embedding forensics, but with just 13 stars and 1.0% credibility score, it's raw—docs are solid via README and blog, but expect tweaks for your models. Try it if you're experimenting; skip for mission-critical unless you train custom.

(187 words)

Sign up to read the full AI review Sign Up Free

Similar repos coming soon.