Skip to content

README.md #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

README.md #1

wants to merge 3 commits into from

Conversation

Professor-Codephreak
Copy link

md-embed (c) 2024 web3dguy

A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models.
Features

Data Cleaning: Removes duplicates and filters out unwanted content like '404' pages and lines containing the '©' symbol.
Flexible Input: Supports input from JSON files containing URLs and Markdown data, folders of Markdown files, or single Markdown files.
Document Splitting: Splits documents using Markdown headers or recursive character splitting.
Embedding Options: Supports embedding using HuggingFace or Ollama embeddings.
Vector Store Integration: Stores embeddings in a Chroma vector store for efficient retrieval and analysis.
Customizable Filters: Option to disable filters that remove specific content.
Logging: Generates logs for duplicates and removed files for better traceability.

Installation
Prerequisites

    Python 3.7 or higher
    pip
    Git (optional, for cloning the repository)

Clone the Repository

git clone https://github.com/GATERAGE/mdmbed.git
cd mdmbed

Install Required Packages

Install the required Python packages using pip:

pip install -r requirements.txt

Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc.
Usage

Run the script using Python:

python md-embed.py [--filters-off]

Command-Line Arguments

--filters-off: Disable filters that remove lines containing '©' and skip files containing both '404' and 'page not found'.

Upon running the script, you will be prompted to choose an input method:

JSON Input File Containing URLs and Markdown Data
Folder of Markdown Files
Single Markdown File

JSON Input File

If you choose Option 1, you will be asked to provide:

Path of the JSON input file: The file should be a JSON array of objects, each containing url and markdown keys.
Path of the output folder: The folder where cleaned Markdown files and logs will be saved.

The script will:

Clean the data by removing duplicates.
Save the cleaned Markdown files to the specified output folder.
Generate a file_to_url.json mapping file.
Display a summary of the processing.

Folder of Markdown Files

If you choose Option 2, you will be asked to provide:

Path of the folder containing Markdown files.

The script will:

Load all .md files from the specified folder.
Optionally filter out unwanted content.
Proceed to document splitting.

Single Markdown File

If you choose Option 3, you will be asked to provide:

Path of the Markdown file.

The script will:

Load the specified Markdown file.
Optionally filter out unwanted content.
Proceed to document splitting.

Document Splitting

After loading the documents, you will be prompted to split them:

Split Method: Choose between markdown or recursive splitting.
Remove Links: Optionally remove links from the Markdown content.
Language: Specify the programming language or language of the content.
Additional Settings:
    For Markdown Splitting:
        Header Levels: Specify which header levels (#, ##, etc.) to split on.
    For Recursive Splitting:
        Chunk Size: Specify the maximum size of each chunk.
        Chunk Overlap: Specify the number of overlapping characters between chunks.

You will have the option to preview the split data before proceeding.
Embedding and Saving

After splitting, you will be prompted to embed and save the documents:

Embedding Method: Choose between huggingface or ollama.
    HuggingFace: Enter the embedding model name (default: all-MiniLM-L6-v2).
    Ollama: Enter the Ollama model name (default: nomic-embed-text).
Persist Directory: Specify the directory to save the vector store database.
Collection Name: Enter a name for the Chroma collection.

The script will:

Embed the documents using the chosen embedding method.
Save the embeddings to a Chroma vector store.
Display information about the saved collections.

Examples
Example 1: Process JSON Input File

python md-embed.py

Choose Input Method: 1

Enter the path of the JSON input file: ./data/input.json
Enter the path of the output folder: ./output

Proceed through the prompts to clean data, split documents, and embed them.
Example 2: Process Folder of Markdown Files with Filters Off

python md-embed.py --filters-off

Choose Input Method: 2

Enter the path of the folder containing markdown files: ./markdown_files

Proceed through the prompts to load, split, and embed the documents.
Contributing

Contributions are welcome! Please follow these steps:

Fork the repository.

Create a new branch:
git checkout -b feature/your-feature-name

Make your changes and commit them:

git commit -m "Add your message"

Push to the branch:

git push origin feature/your-feature-name
Open a Pull Request.

Please make sure your code adheres to the existing style and that all tests pass.
License

This project is licensed under the MIT License.
Acknowledgments
web3dguy
LangChain for text splitting and document handling.
HuggingFace for embedding models.
Chroma for the vector store.
TQDM for progress bars.
The open-source community for continuous support and contributions.

Markdown Processor and Embedder

md-embed processes markdown files, cleans and prepares the data, splits the text into manageable chunks, and creates embeddings for use in vector databases (specifically ChromaDB). It supports multiple input methods and provides options for customizing the splitting and embedding process.

Features

  • Multiple Input Methods:
    • JSON file containing URLs and markdown data
    • Folder of markdown files
    • Single markdown file
  • Data Cleaning:
    • Removes duplicate entries based on URL section titles
    • Handles encoding issues
    • Sanitizes filenames for safe saving
    • Optionally filters out files containing "404" and "page not found" (can be disabled)
    • Removes lines containing the copyright symbol "©"
  • Text Splitting:
    • Markdown Header Splitting: Splits text based on specified markdown header levels (e.g., #, ##). Allows for custom header level selection. Preserves header hierarchy in metadata
    • Recursive Character Text Splitting: Splits text into chunks of specified size and overlap
    • Link Removal: Optionally removes markdown links, keeping only the link text
  • **Embedding Generation:*
    • Supports Hugging Face embeddings (using langchain_huggingface). Defaults to all-MiniLM-L6-v2
    • Supports Ollama embeddings (using langchain_community). Defaults to nomic-embed-text, requires a local Ollama server running at http://localhost:11434
  • Vector Database Integration:
    • Uses ChromaDB (langchain_chroma) to store embeddings and associated metadata
    • Allows specifying the collection name and persistence directory
    • Handles large datasets by processing in batches
  • Logging:
    • Comprehensive logging through the logging module
  • Duplicate Logs:
    • Writes URLs with duplicate sections to a log
  • Removed Files Logs
    • Write to a log files that have been removed due to filters

Requirements

  • Python 3.7+
  • langchain (various components - see import statements)
  • chromadb
  • tqdm
  • beautifulsoup4 (if you were scraping, but this script doesn't actually use it)
  • requests (if you were scraping, but this script doesn't actually use it)

To install the required packages, run:

pip install langchain langchain-chroma langchain-huggingface tqdm
If you are planning to use Ollama, you need to:
Install Ollama by following the instructions provided at Ollama's official website.
Run an Ollama server locally on port 11434

md-embed can be run from the command line. It provides a command-line interface using argparse with the following option:

--filters-off: Disables the "404" and "©" filters

The script will then guide you through a series of interactive prompts to configure the processing:

Input Method Selection: Choose between JSON input, a folder of markdown files, or a single markdown file

Input File/Folder/URL: Provide the path to the input file or folder, as appropriate

Output Folder (for JSON input): Specify the directory where cleaned markdown files will be saved

Data Cleaning Options: The script will show total entires and total duplicates

Language: Specify the primary language of the input files (e.g., "TypeScript", "Python")

Splitting Method: Choose between "markdown" (header-based splitting) and "recursive" (chunk size and overlap)

Markdown Splitting Options (if applicable):

Remove Links: Choose whether to remove markdown links

Header Levels: Specify which header levels to split on (e.g., "1,2,3" for #, ##, and ###). Enter "all" for all header levels

Recursive Splitting Options (if applicable):

Remove Links: Choose whether to remove markdown links

Chunk Size: Specify the desired chunk size (in characters)

Chunk Overlap: Specify the desired chunk overlap (in characters)

Preview Splits: Choose whether to preview the split data ("yes", "full", or "no")

Split Again: You'll be prompted to continue or modify the settings

Embedding Method: Choose between "huggingface" and "ollama"

Embedding Model (Hugging Face): Enter the Hugging Face model name (defaults to all-MiniLM-L6-v2)

Embedding Model (Ollama): Enter the Ollama model name (defaults to nomic-embed-text)

Persistence Directory: Specify the directory where the ChromaDB database will be stored

Collection Name: Choose a name for the ChromaDB collection

Example (JSON Input):

python md-embed.py

Follow the prompts, providing the necessary information (input file, output folder, embedding choices, etc.)

Example (Disabling Filters):

python md-embed.py --filters-off

Cleaned Markdown Files (JSON Input): If using JSON input, the script will save cleaned markdown files to the specified output folder

ChromaDB Database: The script will create a ChromaDB database in the specified persistence directory, containing the embeddings and metadata

Logs: The logs directory will contain logs of removed files (if any) and duplicate entries (if using JSON input)

file_to_url.json: Json file that contains the original URL of each document

Error Handling

The script includes error handling for various scenarios, such as:

Invalid input file/folder paths

File I/O errors

Exceptions during data cleaning, splitting, or embedding

Invalid user input for prompts

Errors are logged using the logging module

Notes

The script assumes that the input JSON data has "url" and "markdown" keys for each entry

The script uses uuid4 to generate unique IDs for each document in the vector database

The script processes in batches to deal with a large number of splits

Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant