A multi-source knowledge graph extractor using Large Language Models with a two-phase Extract-Build agentic workflow. Extract meaningful relationships from any text source and build consistent knowledge Graph.
This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.
- Two-phase agentic workflow: Extract relations, then build consistent knowledge graph
- Entity disambiguation: Maintains consistency across different text sources
- Flexible LLM backends: Supports Gemini and local HuggingFace models
- Custom entity types: Define domain-specific entity categories
- Source linking: Links relations back to original text passages
- Rich visualizations: Built-in graph visualization using NetworkX and matplotlib
- Multiple data sources: Works with Wikipedia, custom text, image metadata, and more
-
Clone the repository (if not already done)
-
Install dependencies:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file and add your API key:GEMINI_API_KEY=your_actual_api_key_here
-
Run the examples:
python example_usage.py
- Python 3.12+
- Dependencies: torch, transformers, google-genai, langchain-huggingface, faiss-cpu, networkx, matplotlib, wikipedia, python-dotenv
- API Key: Google Gemini API key (get one at Google AI Studio)
- Optional: CUDA-capable GPU for local HuggingFace models
from example_usage import run_basic_example, run_image_metadata_example
# Extract from Wikipedia article
run_basic_example()
# Extract from personal image metadata
run_image_metadata_example()
Edit .env
file to customize:
GEMINI_API_KEY=your_actual_api_key_here
DEFAULT_MODEL=gemini-2.0-flash-exp
CHUNK_SIZE=4096
CHUNK_OVERLAP=128
GEMINI_API_KEY
: Your Google Gemini API key (required)DEFAULT_MODEL
: Model to use (default: gemini-2.0-flash-exp)CHUNK_SIZE
: Text chunk size for processing (default: 4096)CHUNK_OVERLAP
: Overlap between chunks (default: 128)
The repository includes several ready-to-run examples:
Extract entities and relationships from Wikipedia articles:
python example_usage.py
# Choose option 1 when prompted
Extract knowledge graphs from personal photo metadata:
python example_usage.py
# Choose option 2 when prompted
Define your own entity categories for domain-specific extraction:
# Finance domain
allowed_entity_types = ["COMPANY", "PERSON", "FINANCIAL_INSTRUMENT", "MARKET", "REGULATION"]
# Medical domain
allowed_entity_types = ["PERSON", "CONDITION", "TREATMENT", "MEDICATION", "ANATOMY"]
# Academic domain
allowed_entity_types = ["PERSON", "INSTITUTION", "RESEARCH_AREA", "PUBLICATION", "CONCEPT"]
The system uses a two-phase agentic workflow:
- Extract Phase: LLM identifies and extracts triplets (subject, relation, object) from text chunks
- Build Phase: Another LLM consolidates triplets, performs entity disambiguation, and builds the final knowledge graph
Key components:
- EBWorkflow: Orchestrates the two-phase process
- GeminiEngine: Handles LLM interactions with Google Gemini
- RelationsData: Manages entity types and relationship storage
- Embeddings: Uses HuggingFace sentence transformers for semantic similarity
- FAISS: Vector database for efficient similarity search
The system extracts structured knowledge in triplet format:
(subject:entity_type, relation, object:entity_type)
Example triplets from personal data:
(Alex:PERSON, lives_in, San Francisco:LOCATION)
(Alex:PERSON, works_at, Tech Company:ORGANIZATION)
(Birthday Party:EVENT, attended_by, Alex:PERSON)
(Wedding:EVENT, location, Napa Valley:LOCATION)
Supported outputs:
- JSON knowledge graph with source linking
- NetworkX graph object for programmatic access
- PNG visualization with matplotlib
- Relationship statistics and entity counts
- Personal Knowledge Management: Extract insights from journals, photos, documents
- Research & Academia: Build knowledge graphs from papers, notes, research data
- Business Intelligence: Extract relationships from reports, emails, documents
- Content Analysis: Understand relationships in articles, books, social media
- Domain-Specific Extraction: Finance, healthcare, legal, technical documentation
Define domain-specific entities in your code:
# Medical domain
medical_entities = ["PATIENT", "DOCTOR", "CONDITION", "TREATMENT", "MEDICATION", "HOSPITAL"]
# Legal domain
legal_entities = ["PERSON", "ORGANIZATION", "LAW", "CASE", "COURT", "CONTRACT"]
# Academic domain
academic_entities = ["RESEARCHER", "INSTITUTION", "PUBLICATION", "CONCEPT", "METHOD", "DATASET"]
Modify the system prompts in kg_builder/prompts/
:
extractor_system_prompt.txt
- Controls triplet extractionbuilder_system_prompt.txt
- Controls knowledge graph building
Switch to local HuggingFace models:
from kg_builder.engine import HuggingFaceEngine
# Use local model instead of Gemini
extractor_engine = HuggingFaceEngine("microsoft/DialoGPT-medium", ExtractorPrompting())
- Python 3.12+
- CUDA-capable GPU (for local models, optional)
- Gemini API key (for cloud models)ADME This code is meant for educational purposes only.
This code accompanies the blog post How To Build a Multi-Source Knowledge Graph Extractor from Scratch. See the Colab notebook for example usage.