Skip to content

alexiob/tantivy_ex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

22 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TantivyEx

Elixir CI Hex Version Hex Docs License

TantivyEx Logo

A comprehensive Elixir wrapper for the Tantivy full-text search engine.

TantivyEx provides a complete, type-safe interface to Tantivy - Rust's fastest full-text search library. Build powerful search applications with support for all Tantivy field types, custom tokenizers, schema introspection, and advanced indexing features.

Features

  • High Performance: Built on Tantivy, one of the fastest search engines available
  • Complete Field Type Support: Text, numeric, boolean, date, facet, bytes, JSON, and IP address fields
  • Advanced Text Processing: Comprehensive tokenizer system with 17+ language support, custom regex patterns, n-grams, and configurable text analyzers
  • Intelligent Text Analysis: Stemming, stop word filtering, case normalization, and language-specific processing
  • Dynamic Tokenizer Management: Runtime tokenizer registration, enumeration, and configuration
  • Schema Management: Dynamic schema building with validation and introspection
  • Flexible Storage: In-memory or persistent disk-based indexes
  • Type Safety: Full Elixir typespecs and compile-time safety
  • Advanced Aggregations: Elasticsearch-compatible bucket aggregations (terms, histogram, date_histogram, range) and metric aggregations (avg, min, max, sum, stats, percentiles) with nested sub-aggregations
  • Distributed Search: Multi-node search coordination with load balancing, failover, and configurable result merging
  • Search Features: Full-text search, faceted search, range queries, and comprehensive analytics

Quick Start

# Register tokenizers (recommended for production)
TantivyEx.Tokenizer.register_default_tokenizers()

# Create a schema with custom tokenizers
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field_with_tokenizer("title", :text_stored, "default")
|> TantivyEx.Schema.add_text_field_with_tokenizer("body", :text, "default")
|> TantivyEx.Schema.add_u64_field("id", :indexed_stored)
|> TantivyEx.Schema.add_date_field("published_at", :fast)

# Create an index
# Option 1: Open existing or create new (recommended for production)
{:ok, index} = TantivyEx.Index.open_or_create("/path/to/index", schema)

# Option 2: Create in-memory index for testing
{:ok, index} = TantivyEx.Index.create_in_ram(schema)

# Option 3: Create new persistent index (fails if exists)
{:ok, index} = TantivyEx.Index.create_in_dir("/path/to/index", schema)

# Option 4: Open existing index (fails if doesn't exist)
{:ok, index} = TantivyEx.Index.open("/path/to/index")

# Get a writer
{:ok, writer} = TantivyEx.IndexWriter.new(index, 50_000_000)

# Add documents
doc = %{
  "title" => "Getting Started with TantivyEx",
  "body" => "This is a comprehensive guide to using TantivyEx...",
  "id" => 1,
  "published_at" => "2024-01-15T10:30:00Z"
}

:ok = TantivyEx.IndexWriter.add_document(writer, doc)
:ok = TantivyEx.IndexWriter.commit(writer)

# Search
{:ok, searcher} = TantivyEx.Searcher.new(index)
{:ok, results} = TantivyEx.Searcher.search(searcher, "comprehensive guide", 10)

# Advanced Aggregations (New in v0.2.0)
{:ok, query} = TantivyEx.Query.all()

# Terms aggregation for category analysis
aggregations = %{
  "categories" => %{
    "terms" => %{
      "field" => "category",
      "size" => 10
    }
  }
}

{:ok, agg_results} = TantivyEx.Aggregation.run(searcher, query, aggregations)

# Histogram aggregation for numerical analysis
price_histogram = %{
  "price_ranges" => %{
    "histogram" => %{
      "field" => "price",
      "interval" => 50.0
    }
  }
}

{:ok, histogram_results} = TantivyEx.Aggregation.run(searcher, query, price_histogram)

# Combined search with aggregations
{:ok, search_results, agg_results} = TantivyEx.Aggregation.search_with_aggregations(
  searcher,
  query,
  aggregations,
  20  # search limit
)

# Distributed Search (New in v0.3.0)
{:ok, _supervisor_pid} = TantivyEx.Distributed.OTP.start_link()

# Add multiple search nodes for horizontal scaling
:ok = TantivyEx.Distributed.OTP.add_node("node1", "http://localhost:9200", 1.0)
:ok = TantivyEx.Distributed.OTP.add_node("node2", "http://localhost:9201", 1.5)

# Configure distributed behavior
:ok = TantivyEx.Distributed.OTP.configure(%{
  timeout_ms: 5000,
  max_retries: 3,
  merge_strategy: :score_desc
})

# Perform distributed search
{:ok, results} = TantivyEx.Distributed.OTP.search("search query", 10, 0)

IO.puts("Found #{length(results)} hits across distributed nodes")

Installation

Add tantivy_ex to your list of dependencies in mix.exs:

def deps do
  [
    {:tantivy_ex, "~> 0.1.0"}
  ]
end

TantivyEx requires:

  • Elixir: 1.18 or later
  • Rust: 1.70 or later (for compilation)
  • System: Compatible with Linux, macOS, and Windows

Documentation

Complete Guides

Browse All Guides →

Getting Started

Development & Production

API Documentation

Core Components

API Reference

Field Types

TantivyEx supports all Tantivy field types with comprehensive options:

Field Type Elixir Function Use Cases Options
Text add_text_field/3 Full-text search, titles, content :text, :text_stored, :stored, :fast, :fast_stored
U64 add_u64_field/3 IDs, counts, timestamps :indexed, :indexed_stored, :fast, :stored, :fast_stored
I64 add_i64_field/3 Signed integers, deltas :indexed, :indexed_stored, :fast, :stored, :fast_stored
F64 add_f64_field/3 Prices, scores, coordinates :indexed, :indexed_stored, :fast, :stored, :fast_stored
Bool add_bool_field/3 Flags, binary states :indexed, :indexed_stored, :fast, :stored, :fast_stored
Date add_date_field/3 Timestamps, publication dates :indexed, :indexed_stored, :fast, :stored, :fast_stored
Facet add_facet_field/2 Categories, hierarchical data Always indexed and stored
Bytes add_bytes_field/3 Binary data, hashes, images :indexed, :indexed_stored, :fast, :stored, :fast_stored
JSON add_json_field/3 Structured objects, metadata :text, :text_stored, :stored
IP Address add_ip_addr_field/3 IPv4/IPv6 addresses :indexed, :indexed_stored, :fast, :stored, :fast_stored

Custom Tokenizers

New in v0.2.0: TantivyEx now provides comprehensive tokenizer support with advanced text analysis capabilities:

# Register default tokenizers (recommended starting point)
TantivyEx.Tokenizer.register_default_tokenizers()

# Create custom text analyzers with advanced processing
TantivyEx.Tokenizer.register_text_analyzer(
  "english_full",
  "simple",           # base tokenizer
  true,               # lowercase
  "english",          # stop words language
  "english",          # stemming language
  50                  # remove tokens longer than 50 chars
)

# Register specialized tokenizers for different use cases
TantivyEx.Tokenizer.register_regex_tokenizer("email", "\\b[\\w._%+-]+@[\\w.-]+\\.[A-Z|a-z]{2,}\\b")
TantivyEx.Tokenizer.register_ngram_tokenizer("fuzzy_search", 2, 4, false)

# Add fields with specific tokenizers
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field_with_tokenizer("content", :text, "english_full")
|> TantivyEx.Schema.add_text_field_with_tokenizer("product_code", :text_stored, "simple")
|> TantivyEx.Schema.add_text_field_with_tokenizer("tags", :text, "whitespace")

Available Tokenizer Types

  • Simple Tokenizer: Basic punctuation and whitespace splitting with lowercase normalization
  • Whitespace Tokenizer: Splits only on whitespace, preserves case and punctuation
  • Regex Tokenizer: Custom pattern-based tokenization for specialized formats
  • N-gram Tokenizer: Character or word n-grams for fuzzy search and autocomplete
  • Text Analyzer: Advanced processing with configurable filters (stemming, stop words, case normalization)

Multi-Language Support

Built-in support for 17+ languages with language-specific stemming and stop words:

# Language-specific analyzers
TantivyEx.Tokenizer.register_language_analyzer("english")  # -> "english_text"
TantivyEx.Tokenizer.register_language_analyzer("french")   # -> "french_text"
TantivyEx.Tokenizer.register_language_analyzer("german")   # -> "german_text"

# Or just stemming
TantivyEx.Tokenizer.register_stemming_tokenizer("spanish") # -> "spanish_stem"

Supported languages: English, French, German, Spanish, Italian, Portuguese, Russian, Arabic, Danish, Dutch, Finnish, Greek, Hungarian, Norwegian, Romanian, Swedish, Tamil, Turkish

Performance Tips

  1. Use appropriate field options: Only index fields you need to search
  2. Leverage fast fields: Use :fast for sorting and aggregations
  3. Batch operations: Commit documents in batches for better performance
  4. Memory management: Tune writer memory budget based on your use case
  5. Choose optimal tokenizers: Match tokenizers to your search requirements
    • Use "default" for natural language content with stemming
    • Use "simple" for structured data like product codes
    • Use "whitespace" for tag fields and technical terms
    • Use "keyword" for exact matching fields
  6. Language-specific optimization: Use language analyzers for better search quality
  7. Register tokenizers once: Call register_default_tokenizers() at application startup
  8. Distributed search optimization:
    • Weight nodes based on their actual capacity
    • Use "score_desc" merge strategy for best relevance
    • Monitor cluster health and adjust timeouts accordingly
    • Consider "health_based" load balancing for production environments

Examples

E-commerce Search

# Product search schema
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field("name", :text_stored)
|> TantivyEx.Schema.add_text_field("description", :text)
|> TantivyEx.Schema.add_f64_field("price", :fast_stored)
|> TantivyEx.Schema.add_facet_field("category")
|> TantivyEx.Schema.add_bool_field("in_stock", :fast)
|> TantivyEx.Schema.add_u64_field("product_id", :indexed_stored)

Blog Search

# Setup tokenizers for blog content
TantivyEx.Tokenizer.register_default_tokenizers()
TantivyEx.Tokenizer.register_language_analyzer("english")

# Blog post schema with optimized tokenizers
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field_with_tokenizer("title", :text_stored, "english_text")
|> TantivyEx.Schema.add_text_field_with_tokenizer("content", :text, "english_text")
|> TantivyEx.Schema.add_text_field_with_tokenizer("tags", :text_stored, "whitespace")
|> TantivyEx.Schema.add_date_field("published_at", :fast_stored)
|> TantivyEx.Schema.add_u64_field("author_id", :fast)

Log Analysis

# Application log schema
schema = TantivyEx.Schema.new()
|> TantivyEx.Schema.add_text_field("message", :text_stored)
|> TantivyEx.Schema.add_text_field("level", :text)
|> TantivyEx.Schema.add_ip_addr_field("client_ip", :indexed_stored)
|> TantivyEx.Schema.add_date_field("timestamp", :fast_stored)
|> TantivyEx.Schema.add_json_field("metadata", :stored)

Distributed Search Setup

# Multi-node search coordinator for horizontal scaling
{:ok, _supervisor_pid} = TantivyEx.Distributed.OTP.start_link()

# Add nodes with different weights based on capacity
:ok = TantivyEx.Distributed.OTP.add_node("primary", "http://search1:9200", 2.0)
:ok = TantivyEx.Distributed.OTP.add_node("secondary", "http://search2:9200", 1.5)
:ok = TantivyEx.Distributed.OTP.add_node("backup", "http://search3:9200", 1.0)

# Configure for production use
:ok = TantivyEx.Distributed.OTP.configure(%{
  timeout_ms: 10_000,
  max_retries: 3,
  merge_strategy: :score_desc
})

# Perform distributed search
{:ok, results} = TantivyEx.Distributed.OTP.search("search query", 50, 0)

# Monitor cluster health
{:ok, cluster_stats} = TantivyEx.Distributed.OTP.get_cluster_stats()
IO.puts("Active nodes: #{cluster_stats.active_nodes}/#{cluster_stats.total_nodes}")

Development Setup

git clone https://github.com/alexiob/tantivy_ex.git
cd tantivy_ex
mix deps.get
mix test

License

TantivyEx is licensed under the Apache License 2.0. See LICENSE for details.

Acknowledgments

  • Tantivy: The powerful Rust search engine that powers TantivyEx
  • Rustler: For excellent Rust-Elixir integration

About

Elixir full-text search engine powered by Tantivy Rust library

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published