CyGen: Self-Hosted LLM for Cybersecurity Analysis 🛡️

CyGen is a powerful Retrieval-Augmented Generation (RAG) system built with FastAPI, MongoDB, Qdrant, and Groq LLM, featuring a Streamlit frontend for seamless interaction. This system allows you to upload PDF documents, process them intelligently, and have natural language conversations about their content.

✨ Features

📄 Advanced PDF Document Ingestion
- Multi-threaded PDF processing
- Intelligent text chunking with configurable parameters
- Background task queue for non-blocking operations
- Progress tracking for document processing
🔍 Smart Vector Search
- Semantic similarity search using embeddings
- Context-aware document retrieval
- Configurable relevance thresholds
- Metadata-enhanced document chunks
💬 Interactive Chat Interface
- Real-time chat with HTTP POST endpoint
- Context window management
- Conversation history with MongoDB
- Automatic conversation titles generation
🧠 Groq LLM Integration
- Fast inference with 8k context window
- Optimized prompting strategy
- Balanced context retrieval
- Temperature control for response diversity
🖥️ User-friendly Web UI
- Document upload with progress indicators
- Conversation management
- Responsive design
- Real-time chat updates

🏗️ System Architecture

flowchart TD
    subgraph Client
        UI[Streamlit Frontend]
    end

    subgraph Backend
        API[FastAPI Backend]
        TaskQueue[Background Task Queue]
        VectorDB[(Qdrant Vector DB)]
        MongoDB[(MongoDB)]
        LLM[Groq LLM API]
    end

    subgraph Processing
        PDF[PDF Processor]
        Chunker[Text Chunker]
        Embedder[Embedding Model]
    end

    %% Client to Backend interactions
    UI -->|1. Upload PDF| API
    UI -->|5. Send Query| API
    API -->|8. Stream Response| UI

    %% Document Processing Flow
    API -->|2. Process Document| TaskQueue
    TaskQueue -->|3. Extract & Chunk| PDF
    PDF -->|3.1. Split Text| Chunker
    Chunker -->|3.2. Generate Embeddings| Embedder
    Embedder -->|3.3. Store Vectors| VectorDB
    Embedder -->|3.4. Store Metadata| MongoDB

    %% Query Processing Flow
    API -->|6. Retrieve Context| VectorDB
    API -->|6.1. Get History| MongoDB
    API -->|7. Generate Response| LLM
    VectorDB -->|6.2. Relevant Chunks| API
    MongoDB -->|6.3. Conversation History| API

    %% Styles
    classDef primary fill:#4527A0,stroke:#4527A0,color:white,stroke-width:2px
    classDef secondary fill:#7E57C2,stroke:#7E57C2,color:white
    classDef database fill:#1A237E,stroke:#1A237E,color:white
    classDef processor fill:#FF7043,stroke:#FF7043,color:white
    classDef client fill:#00ACC1,stroke:#00ACC1,color:white
    
    class API,TaskQueue primary
    class PDF,Chunker,Embedder processor
    class VectorDB,MongoDB database
    class LLM secondary
    class UI client

The system comprises several key components that work together:

FastAPI Backend
- RESTful API endpoints and background task processing
- Asynchronous request handling for high concurrency
- Dependency injection for clean service management
- Error handling and logging
MongoDB
- Conversation history storage
- Document metadata and status tracking
- Asynchronous operations with Motor client
- Indexed collections for fast retrieval
Qdrant Vector Database
- High-performance vector storage and retrieval
- Scalable embedding storage
- Similarity search with metadata filtering
- Optimized for semantic retrieval
Groq LLM Integration
- Ultra-fast inference for responsive conversation
- 8k token context window
- Adaptive system prompts based on query context
- Clean API integration with error handling
Streamlit Frontend
- Intuitive user interface for document uploads
- Conversation management and history
- Real-time chat interaction
- Mobile-responsive design

⚙️ Technical Details

PDF Processing Pipeline

Our PDF processing pipeline is designed for efficiency and accuracy:

Text Extraction: Extract raw text from PDF documents using PyPDF2
Text Cleaning: Remove artifacts and normalize text
Chunking Strategy: Implement recursive chunking with smart boundary detection
Metadata Enrichment: Add page numbers, file paths, and other metadata
Vector Embedding: Generate embeddings for each chunk
Storage: Store vectors in Qdrant and metadata in MongoDB

RAG Implementation

The RAG system follows a sophisticated approach to content retrieval:

Query Analysis: Analyze user query for intent and keywords
Context Retrieval: Retrieve relevant document chunks from vector store
Threshold Filtering: Filter results based on similarity score threshold
Context Assembly: Combine retrieved chunks with conversation history
Prompt Construction: Build prompt with system instructions and context
LLM Generation: Generate response using Groq LLM
Response Delivery: Deliver response to user in real-time

🚀 Getting Started

Prerequisites

Docker and Docker Compose
Python 3.11+
uv package manager (recommended for local development)
Groq API key
MongoDB instance (local or Atlas)
Qdrant instance (local or cloud)

Environment Setup

Clone the repository:

git clone https://github.com/yourusername/cygen.git
cd cygen

Copy the example environment file:
```
cp .env.example .env
```

Update the following variables in .env:

GROQ_API_KEY=your_groq_api_key
MONGODB_URL=mongodb://username:password@host:port/db_name
QDRANT_URL=http://qdrant_host:port
MAX_WORKERS=4
CHUNK_SIZE=512
CHUNK_OVERLAP=50
TOP_K=5
RAG_THRESHOLD=0.75
TEMPERATURE=0.7
N_LAST_MESSAGE=5

Running the Application

Option 1: Using the Interactive Launcher Script

chmod +x start.sh
./start.sh

The launcher offers the following options:

Start both the FastAPI backend and Streamlit frontend with Docker Compose
Start only the FastAPI backend
Start only the Streamlit frontend (with Docker or locally)

Option 2: Using Docker Compose

Start all services:

docker-compose up --build

Start only specific services:

docker-compose up --build app      # Backend only
docker-compose up --build streamlit # Frontend only

Option 3: Running Locally (Development)

Create and activate a virtual environment:

uv venv
source .venv/bin/activate  # Linux/macOS
.venv\Scripts\activate     # Windows

Install dependencies:
```
uv pip install -e .
```

Start the FastAPI backend:

uvicorn src.main:app --reload --port 8000

Start the Streamlit frontend (in a separate terminal):

cd streamlit
./run.sh  # or `streamlit run app.py`

Accessing the Application

Streamlit Frontend: http://localhost:8501
FastAPI Swagger Docs: http://localhost:8000/docs
API Base URL: http://localhost:8000/api/v1

📋 Usage Guide

Document Upload

Navigate to the Streamlit web interface
Click on the "Upload Documents" section in the sidebar
Select a PDF file (limit: 200MB per file)
Click "Process Document"
Wait for the processing to complete (progress will be displayed)

Creating a Conversation

Click "New Conversation" in the sidebar
A new conversation will be created with a temporary title
The title will be automatically updated based on your first message

Chatting with Your Documents

Type your question in the chat input
The system will:
- Retrieve relevant context from your documents
- Consider your conversation history
- Generate a comprehensive answer
Continue the conversation with follow-up questions

Managing Conversations

All your conversations are saved and accessible from the sidebar
Select any conversation to continue where you left off
Conversation history is preserved between sessions

🔧 API Endpoints

The system exposes the following key API endpoints:

Documents API

POST /api/v1/documents/upload: Upload a PDF document
GET /api/v1/documents/task/{task_id}: Check document processing status

Chat API

PUT /api/v1/chat/conversation: Create a new conversation
GET /api/v1/chat/conversations: List all conversations
GET /api/v1/chat/conversations/{conversation_id}: Get a specific conversation
DELETE /api/v1/chat/conversations/{conversation_id}: Delete a conversation
POST /api/v1/chat/{conversation_id}: Send a message in a conversation

📁 Project Structure

.
├── docker/                 # Docker configuration files
│   ├── app/                # Backend Docker setup
│   └── streamlit/          # Frontend Docker setup
├── logs/                   # Application logs
├── src/                    # Backend source code
│   ├── router/             # API route definitions
│   │   ├── chat.py         # Chat endpoints
│   │   └── documents.py    # Document endpoints
│   ├── utils/              # Utility modules
│   │   ├── llm.py          # LLM integration
│   │   ├── pdf_processor.py # PDF processing
│   │   ├── text_chunking.py # Text chunking
│   │   └── vector_store.py # Vector database interface
│   ├── main.py             # FastAPI application entry
│   └── settings.py         # Application settings
├── streamlit/              # Streamlit frontend
│   ├── app.py              # Main Streamlit application
│   └── utils.py            # Frontend utilities
├── tests/                  # Test suite
│   ├── unit/               # Unit tests
│   └── integration/        # Integration tests
├── uploads/                # Uploaded documents storage
├── .env.example            # Example environment variables
├── docker-compose.yml      # Docker Compose configuration
├── Dockerfile              # Backend Dockerfile
├── pyproject.toml          # Python project configuration
├── start.sh                # Interactive launcher script
└── README.md               # Project documentation

🛠️ Configuration Options

The system can be configured through environment variables:

Variable	Description	Default
`GROQ_API_KEY`	Groq API key for LLM integration	-
`MONGODB_URL`	MongoDB connection string	mongodb://localhost:27017
`MONGODB_DB_NAME`	MongoDB database name	rag_system
`QDRANT_URL`	Qdrant server URL	http://localhost:6333
`MAX_WORKERS`	Maximum worker threads for PDF processing	4
`CHUNK_SIZE`	Target chunk size for document splitting	512
`CHUNK_OVERLAP`	Overlap between consecutive chunks	50
`TOP_K`	Number of chunks to retrieve per query	5
`RAG_THRESHOLD`	Similarity threshold for relevance	0.75
`TEMPERATURE`	LLM temperature setting	0.7
`N_LAST_MESSAGE`	Number of previous messages to include	5

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a pull request

Please ensure your code follows our style guidelines and includes appropriate tests.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

📧 Contact

Project Link: https://github.com/NnA301023/cygen

🌐 Connect With Us

Our Platforms

Magazine: ITSec Buzz
Engineering Space: ITSec Asia Tech

Social Media

Built with ❤️ by RnD Team

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CyGen: Self-Hosted LLM for Cybersecurity Analysis 🛡️

✨ Features

🏗️ System Architecture

⚙️ Technical Details

PDF Processing Pipeline

RAG Implementation

🚀 Getting Started

Prerequisites

Environment Setup

Running the Application

Option 1: Using the Interactive Launcher Script

Option 2: Using Docker Compose

Option 3: Running Locally (Development)

Accessing the Application

📋 Usage Guide

Document Upload

Creating a Conversation

Chatting with Your Documents

Managing Conversations

🔧 API Endpoints

Documents API

Chat API

📁 Project Structure

🛠️ Configuration Options

🤝 Contributing

📝 License

📧 Contact

🌐 Connect With Us

Our Platforms

Social Media

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
assets		assets
docker		docker
src		src
streamlit		streamlit
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
start.sh		start.sh
uv.lock		uv.lock

License

ITSEC-Research/cygen

Folders and files

Latest commit

History

Repository files navigation

CyGen: Self-Hosted LLM for Cybersecurity Analysis 🛡️

✨ Features

🏗️ System Architecture

⚙️ Technical Details

PDF Processing Pipeline

RAG Implementation

🚀 Getting Started

Prerequisites

Environment Setup

Running the Application

Option 1: Using the Interactive Launcher Script

Option 2: Using Docker Compose

Option 3: Running Locally (Development)

Accessing the Application

📋 Usage Guide

Document Upload

Creating a Conversation

Chatting with Your Documents

Managing Conversations

🔧 API Endpoints

Documents API

Chat API

📁 Project Structure

🛠️ Configuration Options

🤝 Contributing

📝 License

📧 Contact

🌐 Connect With Us

Our Platforms

Social Media

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages