Tapio

Tapio is a RAG (Retrieval Augmented Generation) tool for extracting, processing, and querying information from websites like Migri.fi (Finnish Immigration Service). It provides complete workflow capabilities including web crawling, content parsing, vectorization, and an interactive chatbot interface.

Features

Multi-site support - Configurable site-specific crawling and parsing
End-to-end pipeline - Crawl → Parse → Vectorize → Query workflow
Local LLM integration - Uses Ollama for private, local inference
Semantic search - ChromaDB vector database for relevant content retrieval
Multiple interfaces - Choose between Gradio (simple) or Google ADK (production-ready)
Agent-based architecture - Built on Google's Agent Development Kit for scalable AI workflows
Flexible crawling - Configurable depth and domain restrictions
Comprehensive testing - Full test suite for reliability

Target Use Cases

Primary Users: EU and non-EU citizens navigating Finnish immigration processes

Students seeking education information
Workers exploring employment options
Families pursuing reunification
Refugees and asylum seekers needing guidance

Core Needs:

Finding relevant, accurate information quickly
Practice conversations on specific topics (family reunification, work permits, etc.)

Installation and Setup

Prerequisites

Python 3.10 or higher
uv - Fast Python package installer
Ollama - For local LLM inference

Quick Start

Clone and setup:

git clone https://github.com/Finntegrate/tapio.git
cd tapio
uv sync

Install required Ollama model:

ollama pull llama3.2

Usage

CLI Overview

Tapio provides a comprehensive workflow with multiple interface options:

crawl - Collect HTML content from websites
parse - Convert HTML to structured Markdown
vectorize - Create vector embeddings for semantic search
adk-server - Launch the production-ready ADK agent server (recommended)
tapio-app - Launch the simple Gradio interface (legacy)

Use uv run -m tapio.cli --help to see all commands or uv run -m tapio.cli <command> --help for command-specific options.

Quick Example

Complete workflow for the Migri website:

# 1. Crawl content (uses site configuration)
uv run -m tapio.cli crawl migri --depth 2

# 2. Parse HTML to Markdown
uv run -m tapio.cli parse migri

# 3. Create vector embeddings
uv run -m tapio.cli vectorize

# 4a. Launch ADK agent server (recommended)
uv run -m tapio.cli adk-server

# 4b. OR launch simple Gradio interface
uv run -m tapio.cli tapio-app

Interface Options

Google ADK Server (Recommended)

The ADK implementation provides a production-ready agent interface:

# Start the ADK server
uv run -m tapio.cli adk-server --model-name llama3.2

# Development mode with auto-reload
uv run -m tapio.cli dev

Features:

Professional web development UI at http://localhost:8000
Full REST API with documentation at http://localhost:8000/docs
Event tracking and debugging capabilities
Production-ready deployment options
Multi-agent workflow support

Gradio Interface (Legacy)

Simple web interface for quick testing:

# Start Gradio interface
uv run -m tapio.cli tapio-app --model-name llama3.2

Note: The Gradio interface is maintained for backward compatibility but the ADK server is recommended for new deployments.

Model Management

List available LLM models:

uv run -m tapio.cli list-models

Supported models include local Ollama models (llama3.2, mistral) and cloud models (Gemini).

Available Sites

To list configured sites:

uv run -m tapio.cli list-sites

To view detailed site configurations:

uv run -m tapio.cli list-sites --verbose

Site Configurations

Site configurations define how to crawl and parse specific websites. They're stored in tapio/config/site_configs.yaml and used by both crawl and parse commands.

Configuration Structure

sites:
  migri:
    base_url: "https://migri.fi"                # Used for crawling and converting relative links
    description: "Finnish Immigration Service website"
    crawler_config:                            # Crawling behavior
      delay_between_requests: 1.0              # Seconds between requests
      max_concurrent: 3                        # Concurrent request limit
    parser_config:                              # Parser-specific configuration
      title_selector: "//title"                # XPath for page titles
      content_selectors:                       # Priority-ordered content extraction
        - '//div[@id="main-content"]'
        - "//main"
        - "//article"
        - '//div[@class="content"]'
      fallback_to_body: true                   # Use <body> if selectors fail
      markdown_config:                         # HTML-to-Markdown options
        ignore_links: false
        body_width: 0                          # No text wrapping
        protect_links: true
        unicode_snob: true
        ignore_images: false
        ignore_tables: false

Required vs Optional Fields

Required:

base_url - Base URL for the site (used for crawling and link resolution)

Optional (with defaults):

description - Human-readable description
parser_config - Parser-specific settings (uses defaults if omitted)
- title_selector - Page title XPath (default: "//title")
- content_selectors - XPath selectors for content extraction (default: ["//main", "//article", "//body"])
- fallback_to_body - Use full body content if selectors fail (default: true)
- markdown_config - HTML conversion settings (uses defaults if omitted)
crawler_config - Crawling behavior settings (uses defaults if omitted)
- delay_between_requests - Delay between requests in seconds (default: 1.0)
- max_concurrent - Maximum concurrent requests (default: 5)

Adding New Sites

Analyze the target website's structure
Identify XPath selectors for content extraction
Add configuration to site_configs.yaml:

sites:
  my_site:
    base_url: "https://example.com"
    description: "Example site configuration"
    parser_config:
      content_selectors:
        - '//div[@class="main-content"]'

Use with commands:

uv run -m tapio.cli crawl my_site
uv run -m tapio.cli parse my_site
uv run -m tapio.cli vectorize
uv run -m tapio.cli tapio-app

Configuration

Tapio uses centralized configuration in tapio/config/settings.py:

DEFAULT_DIRS = {
    "CRAWLED_DIR": "content/crawled",   # HTML storage
    "PARSED_DIR": "content/parsed",     # Markdown storage
    "CHROMA_DIR": "chroma_db",          # Vector database
}

DEFAULT_CHROMA_COLLECTION = "tapio"     # ChromaDB collection name

Site-specific configurations are in tapio/config/site_configs.yaml and automatically handle content extraction and directory organization based on the site's domain.

Contributing

See CONTRIBUTING.md for development guidelines, code style requirements, and how to submit pull requests.

License

Licensed under the European Union Public License version 1.2. See LICENSE for details.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Brylie Christopher Oxley}
🚇

⚠️

📖 🐛 💼 🖋 🤔 🚧 🧑‍🏫 📆 📣 🔬 👀 💻

_AkiKurvinen
🔣 💻

_ResendeTech
💻

This project follows the all-contributors specification. Contributions of any kind welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 295 Commits
.devcontainer		.devcontainer
.github		.github
.vscode		.vscode
tapio		tapio
test_chroma_db		test_chroma_db
tests		tests
.all-contributorsrc		.all-contributorsrc
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
ADK_IMPLEMENTATION.md		ADK_IMPLEMENTATION.md
ADK_SETUP.md		ADK_SETUP.md
API_KEY_SETUP.md		API_KEY_SETUP.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pyrefly.toml		pyrefly.toml
pytest.ini		pytest.ini
setup-adk.sh		setup-adk.sh
test_adk_setup.py		test_adk_setup.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tapio

Features

Target Use Cases

Installation and Setup

Prerequisites

Quick Start

Usage

CLI Overview

Quick Example

Interface Options

Google ADK Server (Recommended)

Gradio Interface (Legacy)

Model Management

Available Sites

Site Configurations

Configuration Structure

Required vs Optional Fields

Adding New Sites

Configuration

Contributing

License

Contributors ✨

About

Uh oh!

Releases

Packages

Languages

License

ResendeTech/tapio

Folders and files

Latest commit

History

Repository files navigation

Tapio

Features

Target Use Cases

Installation and Setup

Prerequisites

Quick Start

Usage

CLI Overview

Quick Example

Interface Options

Google ADK Server (Recommended)

Gradio Interface (Legacy)

Model Management

Available Sites

Site Configurations

Configuration Structure

Required vs Optional Fields

Adding New Sites

Configuration

Contributing

License

Contributors ✨

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages