CVE Knowledge Graph & Security Intelligence System with Enhanced RAG
This project combines a comprehensive knowledge graph for structured vulnerability data and relationships with an enhanced Retrieval-Augmented Generation (RAG) system for semantic search. We automate the process of curation, processing and correlation of CVE, CPE, CWE, CAPEC, MITRE ATT&CK, ExploitDB, CISA and other threat intelligence data.
- 190,310 CVEs with rich metadata (CVSS, affected products, CWE, CAPEC, MITRE mappings)
- 124,290 products from 19,692 vendors
- 458 CWEs, 428 CAPECs, 169 MITRE techniques, 37 MITRE tactics
- 2.4M+ relationships between entities
- 80% have CVSS v3 scores (152,676 CVEs)
- 1,060 CVEs in Known Exploited Vulnerabilities (KEV) list
- 335,178 nodes (CVEs, Products, Vendors, CWEs, CAPECs)
- 1,126,306 edges (relationships between entities)
- 246 vulnerability clusters based on CWE and product relationships
- Graph density: 0.000010 (sparse, efficient graph structure)
# Install Python dependencies
pip install -r requirements.txt
# Install Ollama from https://ollama.ai
# Then pull required models:
ollama pull llama3.1:8b
ollama pull llama3.1:70b # Optional: for higher quality responses# Download all CVE data from NVD
python scripts/download_all_cves.py# Collect and process threat intelligence data
python src/collectors/main_collector.py
# Process all CVE data with enrichment
python src/processors/process_all_cves.py
# Parse CPEs and extract products/vendors
python src/constructors/run_cpe_extraction.py
# Build the knowledge graph (JSON-based, no Neo4j required)
python src/constructors/kg_builder_without_neo4j.py# Build NetworkX graph with similarity features and centrality calculations
python src/constructors/networkx_graph_builder.pypython -m src.generators.export_kg_for_rag_direct --stats
# Build vector database with graph-enhanced embeddings
python -m src.generators.rag_system --build# Create training dataset from knowledge graph
python src/training/dataset_preparation.py
# Analyze data quality if needed
python src/training/run_data_analysis.py# Verify system can handle training
python src/training/system_check.pypython src/training/production_training.py# Test the fine-tuned model
python src/training/hf_inference_engine.py# Terminal 1: Start API Server
python -m uvicorn src.api.main:app --host 0.0.0.0 --port 8000 --reload
# Terminal 2: Start Gradio UI (Optional)
python src/ui/gradio_app.py# Test standard RAG
python -m src.generators.rag_system --search "Log4j vulnerability"
# Test API endpoints
curl -X POST http://localhost:8000/api/v1/search \
-H "Content-Type: application/json" \
-d '{"query": "SQL injection vulnerabilities", "top_k": 5}'- Gradio UI: http://localhost:7860
- API Documentatio*: http://localhost:8000/docs
- API Health Check: http://localhost:8000/api/v1/health
Raw CVE Data (NVD) → Processed CVE Data → CPE Extraction → Knowledge Graph → NetworkX Graph → Vector DB
Nodes: CVEs, Products, Vendors, CWEs, CAPECs, MITRE Techniques, MITRE Tactics
Relationships: CVE→Product, CVE→CWE, CVE→CAPEC, CVE→MITRE, Product→Vendor
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Query Input │───▶│ FastAPI API │───▶│ Hybrid Search │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Gradio UI │ │ Graph Features │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ ChromaDB │ │ NetworkX Graph │
└─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Vector Search │ │ Similarity │
└─────────────────┘ │ Matrix │
│ └─────────────────┘
▼
┌─────────────────┐
│ LLM Response │
└─────────────────┘
- Main config:
config.py - RAG config:
src/generators/rag_config.py - NetworkX config: Built into
networkx_graph_builder.py - Example Cypher queries:
src/constructors/neo4j_queries.md
POST /api/v1/query- Full RAG queries with LLM responsesPOST /api/v1/search- Vector search onlyPOST /api/v1/summary- Statistical analysisGET /api/v1/health- System health check
You can find more details on this work in the following paper: CyGATE: Game-Theoretic Cyber Attack-Defense Engine for Patch Strategy Optimization