Skip to content

ishank-dev/knowledge-graph

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Knowledge Graph Extraction

A multi-source knowledge graph extractor using Large Language Models with a two-phase Extract-Build agentic workflow. Extract meaningful relationships from any text source and build consistent knowledge Graph.

This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.

Key Features

  • Two-phase agentic workflow: Extract relations, then build consistent knowledge graph
  • Entity disambiguation: Maintains consistency across different text sources
  • Flexible LLM backends: Supports Gemini and local HuggingFace models
  • Custom entity types: Define domain-specific entity categories
  • Source linking: Links relations back to original text passages
  • Rich visualizations: Built-in graph visualization using NetworkX and matplotlib
  • Multiple data sources: Works with Wikipedia, custom text, image metadata, and more

Quick Installation

  1. Clone the repository (if not already done)

  2. Install dependencies:

    pip install -r requirements.txt
  3. Set up environment variables: Create a .env file and add your API key:

    GEMINI_API_KEY=your_actual_api_key_here
  4. Run the examples:

    python example_usage.py

Requirements

  • Python 3.12+
  • Dependencies: torch, transformers, google-genai, langchain-huggingface, faiss-cpu, networkx, matplotlib, wikipedia, python-dotenv
  • API Key: Google Gemini API key (get one at Google AI Studio)
  • Optional: CUDA-capable GPU for local HuggingFace models

Quick Start

Run Example Scripts

from example_usage import run_basic_example, run_image_metadata_example

# Extract from Wikipedia article
run_basic_example()

# Extract from personal image metadata
run_image_metadata_example()

Configuration

Edit .env file to customize:

GEMINI_API_KEY=your_actual_api_key_here
DEFAULT_MODEL=gemini-2.0-flash-exp
CHUNK_SIZE=4096
CHUNK_OVERLAP=128

Available Options:

  • GEMINI_API_KEY: Your Google Gemini API key (required)
  • DEFAULT_MODEL: Model to use (default: gemini-2.0-flash-exp)
  • CHUNK_SIZE: Text chunk size for processing (default: 4096)
  • CHUNK_OVERLAP: Overlap between chunks (default: 128)

Usage Examples

The repository includes several ready-to-run examples:

1. Wikipedia Knowledge Extraction

Extract entities and relationships from Wikipedia articles:

python example_usage.py
# Choose option 1 when prompted

2. Personal Image Metadata

Extract knowledge graphs from personal photo metadata:

python example_usage.py
# Choose option 2 when prompted

3. Custom Entity Types

Define your own entity categories for domain-specific extraction:

# Finance domain
allowed_entity_types = ["COMPANY", "PERSON", "FINANCIAL_INSTRUMENT", "MARKET", "REGULATION"]

# Medical domain
allowed_entity_types = ["PERSON", "CONDITION", "TREATMENT", "MEDICATION", "ANATOMY"]

# Academic domain
allowed_entity_types = ["PERSON", "INSTITUTION", "RESEARCH_AREA", "PUBLICATION", "CONCEPT"]

System Architecture

The system uses a two-phase agentic workflow:

  1. Extract Phase: LLM identifies and extracts triplets (subject, relation, object) from text chunks
  2. Build Phase: Another LLM consolidates triplets, performs entity disambiguation, and builds the final knowledge graph

Key components:

  • EBWorkflow: Orchestrates the two-phase process
  • GeminiEngine: Handles LLM interactions with Google Gemini
  • RelationsData: Manages entity types and relationship storage
  • Embeddings: Uses HuggingFace sentence transformers for semantic similarity
  • FAISS: Vector database for efficient similarity search

What Gets Extracted

The system extracts structured knowledge in triplet format:

(subject:entity_type, relation, object:entity_type)

Example triplets from personal data:

(Alex:PERSON, lives_in, San Francisco:LOCATION)
(Alex:PERSON, works_at, Tech Company:ORGANIZATION)
(Birthday Party:EVENT, attended_by, Alex:PERSON)
(Wedding:EVENT, location, Napa Valley:LOCATION)

Supported outputs:

  • JSON knowledge graph with source linking
  • NetworkX graph object for programmatic access
  • PNG visualization with matplotlib
  • Relationship statistics and entity counts

Use Cases

  • Personal Knowledge Management: Extract insights from journals, photos, documents
  • Research & Academia: Build knowledge graphs from papers, notes, research data
  • Business Intelligence: Extract relationships from reports, emails, documents
  • Content Analysis: Understand relationships in articles, books, social media
  • Domain-Specific Extraction: Finance, healthcare, legal, technical documentation

Customization

Custom Entity Types

Define domain-specific entities in your code:

# Medical domain
medical_entities = ["PATIENT", "DOCTOR", "CONDITION", "TREATMENT", "MEDICATION", "HOSPITAL"]

# Legal domain  
legal_entities = ["PERSON", "ORGANIZATION", "LAW", "CASE", "COURT", "CONTRACT"]

# Academic domain
academic_entities = ["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT", "METHOD", "DATASET"]

Custom Prompts

Modify the system prompts in kg_builder/prompts/:

  • extractor_system_prompt.txt - Controls triplet extraction
  • builder_system_prompt.txt - Controls knowledge graph building

Alternative LLM Backends

Switch to local HuggingFace models:

from kg_builder.engine import HuggingFaceEngine

# Use local model instead of Gemini
extractor_engine = HuggingFaceEngine("microsoft/DialoGPT-medium", ExtractorPrompting())

Requirements

  • Python 3.12+
  • CUDA-capable GPU (for local models, optional)
  • Gemini API key (for cloud models)ADME This code is meant for educational purposes only.

This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.

About

A multi-source knowledge graph extractor using Large Language Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •