Multimodal-Outpost is a collection of Colab notebooks designed for image inference and multimodal vision-language model (VLM) experimentation. It provides tools for OCR, image captioning, video understanding and generating DOCX or PDF documents containing both images and extracted text.
This repository contains a curated collection of notebooks for implementing state-of-the-art multimodal Vision-Language Models (VLMs).
| Notebook Name | Link ↗ |
|---|---|
| Aya-Vision-8B-VideoUnderstanding | Link |
| Behemoth-3B-070225-post.1 | Link |
| Camel-Doc-OCR-080125 | Link |
| Florence-2-Models-Image-Caption | Link |
| Gemma3-VL-VideoUnderstanding | Link |
| Imgscope-OCR-2B-0527-VideoUnderstanding | Link |
| Inkscope-Captions-2B-0526-VideoUnderstanding | Link |
| LFM2-VL-1.6B-LiquidAI | Link |
| LFM2-VL-450M-LiquidAI | Link |
| Lumian-VLR-7B-Thinking-Demo-Notebook | Link |
| Lumian2-VLR-7B-Thinking-Demo-Notebook | Link |
| Megalodon-OCR-Sync-0713-ColabNotebook | Link |
| MiMo-VL-7B-RL-VideoUnderstanding | Link |
| MiMo-VL-7B-SFT-VideoUnderstanding | Link |
| MonkeyOCR-0709 | Link |
| OCRFlux3B | Link |
| Qwen2-VL-MessyOCR-VideoUnderstanding | Link |
| Qwen2-VL-OCR-2B-Instruct | Link |
| Qwen2-VL-VideoUnderstanding | Link |
| Qwen2.5-VL-3B-Abliterated-Caption-it(caption) | Link |
| Qwen2.5-VL-3B-Instruct | Link |
| Qwen2.5-VL-7B-Abliterated-Caption-it | Link |
| Qwen2.5-VL-VideoUnderstanding | Link |
| RolmOCR-Qwen2.5-VL-VideoUnderstanding | Link |
| SmolDocling-256M-preview | Link |
| monkey-OCR | Link |
| moondream2-2025-06-21 | Link |
| nanonets-OCR | Link |
| olmOCR-Qwen2-VL-VideoUnderstanding | Link |
| typhoon-OCR | Link |
| typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding | Link |
- Extracts text from images using various OCR models
- Supports image captioning and multimodal inference
- Embeds images and extracted text into DOCX or PDF formats
- Designed for quick deployment via Google Colab
- Python
- PyTorch
- Hugging Face Transformers
- ReportLab
- Gradio (for UI)
- (Qwen2.5-VL based) / Others
All dependencies are automatically installed in the Colab environment.
Created and maintained by PRITHIVSAKTHIUR