Multimodal-Outpost-Notebooks

Multimodal-Outpost is a collection of Colab notebooks designed for image inference and multimodal vision-language model (VLM) experimentation. It provides tools for OCR, image captioning, video understanding and generating DOCX or PDF documents containing both images and extracted text.

Notebooks List 📘

This repository contains a curated collection of notebooks for implementing state-of-the-art multimodal Vision-Language Models (VLMs).

Notebook Name	Link ↗
Aya-Vision-8B-VideoUnderstanding	Link
Behemoth-3B-070225-post.1	Link
Camel-Doc-OCR-080125	Link
Florence-2-Models-Image-Caption	Link
Gemma3-VL-VideoUnderstanding	Link
Imgscope-OCR-2B-0527-VideoUnderstanding	Link
Inkscope-Captions-2B-0526-VideoUnderstanding	Link
LFM2-VL-1.6B-LiquidAI	Link
LFM2-VL-450M-LiquidAI	Link
Lumian-VLR-7B-Thinking-Demo-Notebook	Link
Lumian2-VLR-7B-Thinking-Demo-Notebook	Link
Megalodon-OCR-Sync-0713-ColabNotebook	Link
MiMo-VL-7B-RL-VideoUnderstanding	Link
MiMo-VL-7B-SFT-VideoUnderstanding	Link
MonkeyOCR-0709	Link
OCRFlux3B	Link
Qwen2-VL-MessyOCR-VideoUnderstanding	Link
Qwen2-VL-OCR-2B-Instruct	Link
Qwen2-VL-VideoUnderstanding	Link
Qwen2.5-VL-3B-Abliterated-Caption-it(caption)	Link
Qwen2.5-VL-3B-Instruct	Link
Qwen2.5-VL-7B-Abliterated-Caption-it	Link
Qwen2.5-VL-VideoUnderstanding	Link
RolmOCR-Qwen2.5-VL-VideoUnderstanding	Link
SmolDocling-256M-preview	Link
monkey-OCR	Link
moondream2-2025-06-21	Link
nanonets-OCR	Link
olmOCR-Qwen2-VL-VideoUnderstanding	Link
typhoon-OCR	Link
typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding	Link

Features

Extracts text from images using various OCR models
Supports image captioning and multimodal inference
Embeds images and extracted text into DOCX or PDF formats
Designed for quick deployment via Google Colab

Dependencies

Python
PyTorch
Hugging Face Transformers
ReportLab
Gradio (for UI)
(Qwen2.5-VL based) / Others

All dependencies are automatically installed in the Colab environment.

Author

Created and maintained by PRITHIVSAKTHIUR

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
Apple-FastVLM-0.5B-Live-Cam		Apple-FastVLM-0.5B-Live-Cam
Apple-FastVLM-0.5B		Apple-FastVLM-0.5B
Apple-FastVLM-1.5B		Apple-FastVLM-1.5B
Aya-Vision-8B-VideoUnderstanding		Aya-Vision-8B-VideoUnderstanding
Behemoth-3B-070225-post0.1		Behemoth-3B-070225-post0.1
Behemoth-3B-070225-post0.1_Traffic_Analysis		Behemoth-3B-070225-post0.1_Traffic_Analysis
Camel-Doc-OCR-080125		Camel-Doc-OCR-080125
Camel-Doc-OCR-Multi-Image-4bit		Camel-Doc-OCR-Multi-Image-4bit
Caption3o-XL-2B-Qwen2VL		Caption3o-XL-2B-Qwen2VL
DeepCaption-VLA-7B[4bit - notebook demo]		DeepCaption-VLA-7B[4bit - notebook demo]
DeepCaption_VLA_V2_0_7B		DeepCaption_VLA_V2_0_7B
Dots.OCR-Notebook		Dots.OCR-Notebook
Florence-2-Models-Image-Caption		Florence-2-Models-Image-Caption
Gemma3-VL-VideoUnderstanding		Gemma3-VL-VideoUnderstanding
Gliese-OCR-7B-Post1.0(4-bit)-reportlab		Gliese-OCR-7B-Post1.0(4-bit)-reportlab
Gliese-OCR-7B-Post1.0-(4bit)-notebook		Gliese-OCR-7B-Post1.0-(4bit)-notebook
Holo1.5-3B		Holo1.5-3B
Imgscope-OCR-2B-0527--VideoUnderstanding		Imgscope-OCR-2B-0527--VideoUnderstanding
Inkscope-Captions-2B-0526-VideoUnderstanding		Inkscope-Captions-2B-0526-VideoUnderstanding
InternVL-3.5-Notebook		InternVL-3.5-Notebook
LFM2-VL-1.6B-LiquidAI		LFM2-VL-1.6B-LiquidAI
LFM2-VL-450M-LiquidAI		LFM2-VL-450M-LiquidAI
LiquidAI-LFM2-VL-Live-Cam		LiquidAI-LFM2-VL-Live-Cam
Logics-Parsing-4bit		Logics-Parsing-4bit
Lumian-VLR-7B-Thinking-Demo-Notebook		Lumian-VLR-7B-Thinking-Demo-Notebook
Lumian2-VLR-7B-Thinking(4bit)		Lumian2-VLR-7B-Thinking(4bit)
Lumian2-VLR-7B-Thinking-Demo-Notebook		Lumian2-VLR-7B-Thinking-Demo-Notebook
Megalodon-OCR-Sync-0713-ColabNotebook		Megalodon-OCR-Sync-0713-ColabNotebook
MiMo-VL-7B-RL-VideoUnderstanding		MiMo-VL-7B-RL-VideoUnderstanding
MiMo-VL-7B-SFT-VideoUnderstanding		MiMo-VL-7B-SFT-VideoUnderstanding
Microsoft-Kosmos-2.5-Demo		Microsoft-Kosmos-2.5-Demo
MinerU2.5-2509-1.2B		MinerU2.5-2509-1.2B
MonkeyOCR-0709		MonkeyOCR-0709
OCRFlux3B		OCRFlux3B
Perseus_Doc_VL_0712		Perseus_Doc_VL_0712
Qwen-2VL-MessyOCR-VideoUnderstanding		Qwen-2VL-MessyOCR-VideoUnderstanding
Qwen2-VL-2B-Abliterated-Caption-it		Qwen2-VL-2B-Abliterated-Caption-it
Qwen2-VL-OCR-2B-Instruct		Qwen2-VL-OCR-2B-Instruct
Qwen2-VL-VideoUnderstanding		Qwen2-VL-VideoUnderstanding
Qwen2.5-VL-3B-Abliterated-Caption-it(caption)		Qwen2.5-VL-3B-Abliterated-Caption-it(caption)
Qwen2.5-VL-3B-Instruct		Qwen2.5-VL-3B-Instruct
Qwen2.5-VL-7B-Abliterated-Caption-it		Qwen2.5-VL-7B-Abliterated-Caption-it
Qwen2.5-VL-VideoUnderstanding		Qwen2.5-VL-VideoUnderstanding
Qwen3-VL-4B-Instruct-abliterated		Qwen3-VL-4B-Instruct-abliterated
Qwen3_VL_4B_Thinking_abliterated		Qwen3_VL_4B_Thinking_abliterated
R-4B-Multimodal-Demo		R-4B-Multimodal-Demo
RolmOCR-Qwen2.5-VL-VideoUnderstanding		RolmOCR-Qwen2.5-VL-VideoUnderstanding
SmolDocling-256M-preview		SmolDocling-256M-preview
deepattricap-vla-3b-colab-notebook-demo		deepattricap-vla-3b-colab-notebook-demo
monkey-OCR		monkey-OCR
moondream2 -2025-06-21		moondream2 -2025-06-21
nanonets-1.5b		nanonets-1.5b
nanonets-OCR		nanonets-OCR
olmOCR-Qwen2-VL-VideoUnderstanding		olmOCR-Qwen2-VL-VideoUnderstanding
tencent-POINTS-Reader		tencent-POINTS-Reader
typhoon-OCR		typhoon-OCR
typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding		typhoon-ocr-7b-Qwen2.5VL-VideoUnderstanding
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Multimodal-Outpost-Notebooks

Notebooks List 📘

Features

Dependencies

Author

About

Uh oh!

Languages

Uh oh!

License

Uh oh!

PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks

Folders and files

Latest commit

History

Repository files navigation

Multimodal-Outpost-Notebooks

Notebooks List 📘

Features

Dependencies

Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages