This repository contains the complete code and tutorials for implementing a multimodal retrieval-augmented generation (RAG) system capable of processing, storing, and retrieving video content. The system uses BridgeTower for multimodal embeddings, LanceDB as the vector store, and Pixtral as the conversation LLM.
To install the necessary dependencies, run the following command:
pip install -r requirements.txt
-
mm_rag.ipynb
: Complete end-to-end implementation of the multimodal RAG system -
embedding_creation.ipynb
: Deep dive into generating multimodal embeddings using BridgeTower -
vector_store.ipynb
: Detailed guide on setting up and populating LanceDB for vector storage -
preprocessing_video.ipynb
: Comprehensive coverage of video preprocessing techniques, including:- Frame extraction
- Transcript processing
- Handling videos without transcripts
- Transcript optimization strategies
You'll need to set up the following API keys:
MISTRAL_API_KEY
for PixTral model access
The tutorial uses a sample video about a space expedition. You can replace it with any video of your choice, but make sure to:
- Include a transcript file (.vtt format)
- Or generate transcripts using Whisper
- Or use vision language models for caption generation
Contributions are welcome! Some areas for improvement include:
- Adding chat history support
- Prompt engineering refinements
- Alternative retrieval strategies
- Testing different VLMs and embedding models
To contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature-branch
). - Create a new Pull Request.
This project is licensed under the MIT License. See the LICENSE file for details.