laoposkj / multi-modal-agent-ts
PublicTypeScript multimodal AI agent: GPT-4o / Claude / Gemini + Whisper + Ollama (LLaVA). REST API, streaming, Docker. Vision, audio & text in one flow.
A TypeScript library and server for creating AI agents that process and respond to combined text, image, and audio inputs using various multimodal AI models.
How It Works
You hear about this cool tool that lets an AI understand pictures, voice notes, and text questions all in one go.
Download the files and prepare it quickly so it's ready to use right away.
Link it to a thinking service like a cloud AI or a local one running on your machine to give it smarts.
Type a question with a photo or voice clip and see instant results.
Start a service that accepts inputs from anywhere via simple web requests.
Use it inside your program to add multi-sense understanding.
Share images from files or web, audio recordings, or type what you want analyzed.
It transcribes your voice, describes the images, blends with your question, and streams back clever insights.
You get a complete, helpful answer that ties together sight, sound, and text effortlessly.
Star Growth
Repurpose is a Pro feature
Generate ready-to-use prompts for X threads, LinkedIn posts, blog posts, YouTube scripts, and more -- with full repo context baked in.
Unlock RepurposeSimilar repos coming soon.