An AI-powered video analysis system using multi-agent architecture with local small language models (SLMs) for video Q&A through Retrieval-Augmented Generation (RAG) and MCP tools integration.
- Overview
- Key Features
- Architecture
- Prerequisites
- Getting Started
- Project Structure
- Contributing
- Contact
This project implements an end-to-end Video RAG (Retrieval-Augmented Generation) full stack application system that enables users to:
- ποΈ Upload and process video files
- π§ Extract and transcribe audio content
- πΌοΈ Analyze video frames with vision-language models
- π¬ Ask questions about video content using natural language
- π Generate comprehensive PDF reports with insights
- π Perform semantic search across video content using hybrid retrieval (BM25 + dense vectors)
- π₯ Multi-format video upload support
- πΌοΈ Automatic frame extraction at configurable intervals
- π Audio extraction and transcription using Whisper
- ποΈ Frame analysis with vision-language models
- π€ Multi-agent architecture with intelligent routing
- π Hybrid search (BM25 + dense vector search)
- π¬ Context-aware Q&A using RAG
- π Automatic video summarization
- π PDF report generation
- βοΈ Frontend: React + Vite + Tauri (optional desktop app)
- β‘ Backend: FastAPI + LangChain + LangGraph
- ποΈ Database: PostgreSQL + Qdrant Vector Database
- π€ AI Models: Local Hugging Face models (SLMs)
- π Tools: MCP integration for extensible capabilities
The architecture below demonstrates an end-to-end full-stack agentic local SLM multi-agent application for analyzing and querying video content, from frontend to backend with their respective specialized agents.
| Software | Version | Purpose |
|---|---|---|
| Python | 3.11 | Backend runtime |
| Node.js | 18+ | Frontend development |
| PostgreSQL | 15+ | Database |
| Qdrant | Latest | Vector database |
- GPU: NVIDIA GPU with CUDA support (Compute Capability β₯ 7.0)
- VRAM: 8 GB minimum, 16 GB+ recommended
- RAM: 8 GB minimum, 16 GB recommended
- Storage: 10 GB+ free space for models
- Driver: NVIDIA Driver β₯ 530 and CUDA β₯ 12.0
*Note: I tested this setup on RunPod VM with RTX 4090
git clone https://github.com/JennyTan5522/End-to-End-Video-RAG-Understanding-with-Local-Agentic-SLM.gitRefer to the π backend README for detailed installation instructions:
- Install Python dependencies
- Set up PostgreSQL database
- Configure Qdrant vector database
- Download required AI models
Step 1: Start Audio processing MCP server (Teriminal 1)
cd backend
source env/bin/activate # or env\Scripts\activate (Windows)
python -m web.mcp_tools.audio_extractorStep 2: Start Video processing MCP server (Teriminal 2)
cd backend
source env/bin/activate # or env\Scripts\activate (Windows)
python -m web.mcp_tools.video_frames_extractorStep 3: Start FastAPI backend (Teriminal 3)
cd backend
python -m venv env
source env/bin/activate # or env\Scripts\activate (Windows)
python -m web.appServices will be running on:
- FastAPI Backend:
http://localhost:8000 - Audio MCP Server:
http://localhost:8002 - Video MCP Server:
http://localhost:8003
Refer to the π frontend README for detailed installation instructions:
- Install npm dependencies
- Configure environment variables
- Start development server
cd frontend
npm run devFrontend runs on: http://localhost:5173
npm run dev # or npm run dev -- --host 0.0.0.0 --port 5173 --strictPort to access from any device on your network at http://<your-ip>:5173your_project_root_folder/
βββ backend/ # FastAPI backend - AI logic, multi-agent system, and API
β βββ README.md # Backend documentation
β
βββ frontend/ # React frontend - UI for chat, video upload, and processing
β βββ README.md # Frontend documentation
β
βββ sample_videos/ # Sample mp4 video for experiment
β
βββ README.md
For detailed structure of each component:
- π Backend:
backend/README.md - βοΈ Frontend:
frontend/README.md
Contributions are welcome! If you find any issues or have suggestions for improvements:
- Fork the repository
- Create a new branch (
git checkout -b feature/improvement) - Commit your changes (
git commit -m 'Add new feature') - Push to the branch (
git push origin feature/improvement) - Open a Pull Request
Found a bug? Please create an issue with:
- Description of the problem
- Steps to reproduce
- Expected vs actual behavior
- Screenshots (if applicable)
Thank you for helping improve this project! π
- π» GitHub: @JennyTan5522
- π§ Email: [email protected]
Feel free to reach out for questions, suggestions, or collaboration opportunities! Note: This project is for educational and research purposes.
β‘ Built with Vite + React + Tauri
π Backend powered by FastAPI + PostgreSQL + Qdrant + Local SLM Multi-agent + MCP


