Knowledge Graph Extraction

A multi-source knowledge graph extractor using Large Language Models with a two-phase Extract-Build agentic workflow. Extract meaningful relationships from any text source and build consistent knowledge Graph.

This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.

Key Features

Two-phase agentic workflow: Extract relations, then build consistent knowledge graph
Entity disambiguation: Maintains consistency across different text sources
Flexible LLM backends: Supports Gemini and local HuggingFace models
Custom entity types: Define domain-specific entity categories
Source linking: Links relations back to original text passages
Rich visualizations: Built-in graph visualization using NetworkX and matplotlib
Multiple data sources: Works with Wikipedia, custom text, image metadata, and more

Quick Installation

Clone the repository (if not already done)
Install dependencies:
```
pip install -r requirements.txt
```
Set up environment variables: Create a .env file and add your API key:
```
GEMINI_API_KEY=your_actual_api_key_here
```
Run the examples:
```
python example_usage.py
```

Requirements

Python 3.12+
Dependencies: torch, transformers, google-genai, langchain-huggingface, faiss-cpu, networkx, matplotlib, wikipedia, python-dotenv
API Key: Google Gemini API key (get one at Google AI Studio)
Optional: CUDA-capable GPU for local HuggingFace models

Quick Start

Run Example Scripts

from example_usage import run_basic_example, run_image_metadata_example

# Extract from Wikipedia article
run_basic_example()

# Extract from personal image metadata
run_image_metadata_example()

Configuration

Edit .env file to customize:

GEMINI_API_KEY=your_actual_api_key_here
DEFAULT_MODEL=gemini-2.0-flash-exp
CHUNK_SIZE=4096
CHUNK_OVERLAP=128

Available Options:

GEMINI_API_KEY: Your Google Gemini API key (required)
DEFAULT_MODEL: Model to use (default: gemini-2.0-flash-exp)
CHUNK_SIZE: Text chunk size for processing (default: 4096)
CHUNK_OVERLAP: Overlap between chunks (default: 128)

Usage Examples

The repository includes several ready-to-run examples:

1. Wikipedia Knowledge Extraction

Extract entities and relationships from Wikipedia articles:

python example_usage.py
# Choose option 1 when prompted

2. Personal Image Metadata

Extract knowledge graphs from personal photo metadata:

python example_usage.py
# Choose option 2 when prompted

3. Custom Entity Types

Define your own entity categories for domain-specific extraction:

# Finance domain
allowed_entity_types = ["COMPANY", "PERSON", "FINANCIAL_INSTRUMENT", "MARKET", "REGULATION"]

# Medical domain
allowed_entity_types = ["PERSON", "CONDITION", "TREATMENT", "MEDICATION", "ANATOMY"]

# Academic domain
allowed_entity_types = ["PERSON", "INSTITUTION", "RESEARCH_AREA", "PUBLICATION", "CONCEPT"]

System Architecture

The system uses a two-phase agentic workflow:

Extract Phase: LLM identifies and extracts triplets (subject, relation, object) from text chunks
Build Phase: Another LLM consolidates triplets, performs entity disambiguation, and builds the final knowledge graph

Key components:

EBWorkflow: Orchestrates the two-phase process
GeminiEngine: Handles LLM interactions with Google Gemini
RelationsData: Manages entity types and relationship storage
Embeddings: Uses HuggingFace sentence transformers for semantic similarity
FAISS: Vector database for efficient similarity search

What Gets Extracted

The system extracts structured knowledge in triplet format:

(subject:entity_type, relation, object:entity_type)

Example triplets from personal data:

(Alex:PERSON, lives_in, San Francisco:LOCATION)
(Alex:PERSON, works_at, Tech Company:ORGANIZATION)
(Birthday Party:EVENT, attended_by, Alex:PERSON)
(Wedding:EVENT, location, Napa Valley:LOCATION)

Supported outputs:

JSON knowledge graph with source linking
NetworkX graph object for programmatic access
PNG visualization with matplotlib
Relationship statistics and entity counts

Use Cases

Personal Knowledge Management: Extract insights from journals, photos, documents
Research & Academia: Build knowledge graphs from papers, notes, research data
Business Intelligence: Extract relationships from reports, emails, documents
Content Analysis: Understand relationships in articles, books, social media
Domain-Specific Extraction: Finance, healthcare, legal, technical documentation

Customization

Custom Entity Types

Define domain-specific entities in your code:

# Medical domain
medical_entities = ["PATIENT", "DOCTOR", "CONDITION", "TREATMENT", "MEDICATION", "HOSPITAL"]

# Legal domain  
legal_entities = ["PERSON", "ORGANIZATION", "LAW", "CASE", "COURT", "CONTRACT"]

# Academic domain
academic_entities = ["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT", "METHOD", "DATASET"]

Custom Prompts

Modify the system prompts in kg_builder/prompts/:

extractor_system_prompt.txt - Controls triplet extraction
builder_system_prompt.txt - Controls knowledge graph building

Alternative LLM Backends

Switch to local HuggingFace models:

from kg_builder.engine import HuggingFaceEngine

# Use local model instead of Gemini
extractor_engine = HuggingFaceEngine("microsoft/DialoGPT-medium", ExtractorPrompting())

Requirements

Python 3.12+
CUDA-capable GPU (for local models, optional)
Gemini API key (for cloud models)ADME This code is meant for educational purposes only.

This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
kg_builder		kg_builder
.gitignore		.gitignore
ANALYSIS.md		ANALYSIS.md
ENTITY_TYPES_FLOW.md		ENTITY_TYPES_FLOW.md
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
TECHNICAL_DOCUMENTATION.md		TECHNICAL_DOCUMENTATION.md
TRIPLET_GENERATION.md		TRIPLET_GENERATION.md
create_diagram.py		create_diagram.py
example_usage.py		example_usage.py
image_metadata_knowledge_graph.json		image_metadata_knowledge_graph.json
image_metadata_knowledge_graph.png		image_metadata_knowledge_graph.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
sample_image_metadata.json		sample_image_metadata.json
setup.sh		setup.sh
system_architecture_diagram.png		system_architecture_diagram.png
test_setup.py		test_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Knowledge Graph Extraction

Key Features

Quick Installation

Requirements

Quick Start

Run Example Scripts

Configuration

Available Options:

Usage Examples

1. Wikipedia Knowledge Extraction

2. Personal Image Metadata

3. Custom Entity Types

System Architecture

What Gets Extracted

Use Cases

Customization

Custom Entity Types

Custom Prompts

Alternative LLM Backends

Requirements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ishank-dev/knowledge-graph

Folders and files

Latest commit

History

Repository files navigation

Knowledge Graph Extraction

Key Features

Quick Installation

Requirements

Quick Start

Run Example Scripts

Configuration

Available Options:

Usage Examples

1. Wikipedia Knowledge Extraction

2. Personal Image Metadata

3. Custom Entity Types

System Architecture

What Gets Extracted

Use Cases

Customization

Custom Entity Types

Custom Prompts

Alternative LLM Backends

Requirements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages