Skip to content

vectorize-io/vectorize-iris

Repository files navigation

Vectorize Iris

Vectorize Iris

Vectorize Iris is a model-based extraction solution that transforms how RAG systems handle PDFs. It combines extraction and chunking into one streamlined process, making it easier than ever to get clean, usable text from complex documents.

Documentation: docs.vectorize.io

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to understand document structure and context, delivering:

  • 📄 Universal format support - Works with all unstructured document types (PDFs, images, scans, and more)
  • High accuracy - Handles poor quality scans and complex layouts
  • 📊 Structure preservation - Maintains tables, lists, and formatting
  • 🎯 Smart chunking - Semantic splitting for RAG pipelines
  • 🔍 Metadata extraction - Extract specific fields using natural language
  • 🚀 Simple API - One function call to extract text
  • Parallel processing - Process multiple documents simultaneously
  • 🌐 URL support - Extract directly from HTTP/HTTPS URLs
  • 📂 Batch processing - Process entire directories automatically
  • 🔧 Multiple formats - Output as JSON, YAML, or plain text
  • 🪶 Lightweight - Single binary CLI with no dependencies
  • ☁️ Cloud-native - Serverless-ready APIs
  • 🌍 Multi-lingual - 100+ languages including Hindi, Arabic, Chinese
  • 🔌 Multi-platform - Python, Node.js, and CLI support

Quick Start

Choose your preferred tool:

🐍 Python API

from vectorize_iris import extract_text_from_file

result = extract_text_from_file('document.pdf')
print(result.text)

→ See Python examples

📦 Node.js/TypeScript API

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf');
console.log(result.text);

→ See Node.js examples

⚡ CLI

vectorize-iris document.pdf

Installation

CLI:

curl -fsSL https://raw.githubusercontent.com/vectorize-io/vectorize-iris/refs/heads/main/install.sh | sh

Python:

pip install vectorize-iris

Node.js:

npm install @vectorize-io/iris

Features

Basic Text Extraction

Extract clean, structured text from any document format.

Smart Chunking

Split documents into semantic chunks perfect for RAG pipelines:

  • Markdown-aware chunking
  • Configurable chunk sizes
  • Preserves context across chunks

Metadata Extraction

Extract structured data using JSON schemas (OpenAPI spec format recommended):

result = extract_text_from_file('invoice.pdf', options=ExtractionOptions(
    metadata_schemas=[{
        'id': 'invoice-data',
        'schema': {
            'invoice_number': 'string',
            'date': 'string',
            'total_amount': 'number',
            'vendor_name': 'string'
        }
    }]
))
# Returns structured JSON metadata

Parsing Instructions

Guide the extraction with custom instructions:

result = extract_text_from_file('document.pdf', options=ExtractionOptions(
    parsing_instructions='Focus on extracting tables and ignore headers/footers'
))

CLI Examples

Basic Extraction

Beautiful terminal output with progress indicators:

vectorize-iris document.pdf

Output:

✨ Vectorize Iris Extraction
──────────────────────────────────────────────────

✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 7s

─────────────────────────────────────────────────────────
📄 Extracted Text
─────────────────────────────────────────────────────────

Stats: 5536 chars • 1245 words • 89 lines

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from URL

Download and extract files directly from HTTP/HTTPS URLs:

vectorize-iris https://arxiv.org/pdf/2206.01062

JSON Output (for piping)

vectorize-iris document.pdf -o json

Output:

{
  "success": true,
  "text": "This is the extracted text from your PDF document...",
  "chunks": null,
  "metadata": null
}

Pipe to jq:

vectorize-iris document.pdf -o json | jq -r '.text' > output.txt

Plain Text Output

Get only the extracted text:

vectorize-iris document.pdf -o text

Pipe directly:

vectorize-iris document.pdf -o text > output.txt

Save to File

Use -f to save output directly:

vectorize-iris document.pdf -o json -f output.json

Output:

✨ Vectorize Iris Extraction
──────────────────────────────────────────────────

✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 7s
✓ Output written to output.json

Process Directory

Process all files in a directory automatically:

vectorize-iris ./documents -f ./output

Output:

📦 Processing Directory
──────────────────────────────────────────────────

💡 Found 5 files to process

⚙️  Processing 1/5 - report-q1.pdf
✨ Vectorize Iris Extraction
──────────────────────────────────────────────────
✓ Upload prepared
✓ File uploaded successfully
✓ Extraction started
✓ Extraction completed in 8s
✓ Output written to output/report-q1.txt

⚙️  Processing 2/5 - report-q2.pdf
...

──────────────────────────────────────────────────
✨ Batch Processing Complete

  ✓ Successful: 5

With custom output format:

# Extract all PDFs to JSON
vectorize-iris ./documents -o json -f ./output

# Extract all files to plain text
vectorize-iris ./scans -o text -f ./extracted

Chunking for RAG

vectorize-iris long-document.pdf --chunk-size 512

Splits documents at semantic boundaries, perfect for RAG pipelines.

Custom Parsing Instructions

vectorize-iris report.pdf --parsing-instructions "Extract only tables and numerical data, ignore narrative text"

Document Classification

Pass multiple metadata schemas and Iris will automatically classify which schema matches best:

vectorize-iris invoice.pdf \
  --metadata-schema 'invoice:{"invoice_number":"string","date":"string","total_amount":"number","vendor":"string"}' \
  --metadata-schema 'receipt:{"store_name":"string","date":"string","items":"array","total":"number"}' \
  --metadata-schema 'contract:{"parties":"array","effective_date":"string","terms":"string"}' \
  --metadata-schema 'cv:{"name":"string","contact_info":"object","skills":"array","experience":"array"}' \
  -o json

Output:

{
  "success": true,
  "text": "...",
  "metadata": "{\"invoice_number\":\"INV-2024-001\",\"date\":\"2024-01-15\",\"total_amount\":1250.00,\"vendor\":\"Acme Corp\"}",
  "metadataSchema": "invoice"
}

Iris automatically detected this was an invoice and extracted the relevant fields using the matching schema.

Advanced Options

# Custom chunk size with metadata extraction
vectorize-iris document.pdf \
  --chunk-size 256 \
  --infer-metadata-schema \
  --parsing-instructions "Focus on extracting structured data" \
  -o yaml -f output.yaml

# Longer timeout for large documents
vectorize-iris large-document.pdf \
  --timeout 600 \
  --poll-interval 5

Configuration

CLI Configuration

The CLI offers multiple ways to configure your credentials:

Interactive Configuration (Recommended)

The easiest way to get started - opens your browser for authentication:

vectorize-iris configure

What happens:

  1. Opens your browser to the Vectorize platform
  2. Click "Authorize" to grant access
  3. Credentials are automatically saved to ~/.vectorize-iris/credentials
  4. Done! You're ready to extract

Manual Configuration

If you prefer not to use the browser, prompt for credentials manually:

vectorize-iris configure --manual

You'll be asked to enter:

  • API Token
  • Organization ID

Get these from platform.vectorize.io → Account → Org Settings → Access Tokens

Non-Interactive Configuration

For scripts and automation, pass credentials directly:

vectorize-iris configure --api-token "your-token" --org-id "your-org-id"

Environment Variables

Alternatively, set credentials via environment variables (works for all clients):

export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Python & Node.js Configuration

For Python and Node.js clients, use environment variables or pass credentials programmatically:

Environment variables:

export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Python:

from vectorize_iris import VectorizeIrisClient

client = VectorizeIrisClient(
    api_token="your-token",
    org_id="your-org-id"
)

Node.js:

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf', {
    apiToken: 'your-token',
    orgId: 'your-org-id'
});

Documentation

For detailed documentation, API reference, and advanced features:

📚 docs.vectorize.io

License

MIT

Support