Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
md-embed (c) 2024 web3dguy
A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models.
Features
Installation
Prerequisites
Python 3.7 or higher pip Git (optional, for cloning the repository)
Clone the Repository
git clone https://github.com/GATERAGE/mdmbed.git cd mdmbed
Install Required Packages
Install the required Python packages using pip:
Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc.
Usage
Run the script using Python:
Command-Line Arguments
Upon running the script, you will be prompted to choose an input method:
JSON Input File
If you choose Option 1, you will be asked to provide:
The script will:
Folder of Markdown Files
If you choose Option 2, you will be asked to provide:
The script will:
Single Markdown File
If you choose Option 3, you will be asked to provide:
The script will:
Document Splitting
After loading the documents, you will be prompted to split them:
You will have the option to preview the split data before proceeding.
Embedding and Saving
After splitting, you will be prompted to embed and save the documents:
The script will:
Examples
Example 1: Process JSON Input File
Choose Input Method: 1
Proceed through the prompts to clean data, split documents, and embed them.
Example 2: Process Folder of Markdown Files with Filters Off
Choose Input Method: 2
Proceed through the prompts to load, split, and embed the documents.
Contributing
Contributions are welcome! Please follow these steps:
Make your changes and commit them:
git commit -m "Add your message"
Push to the branch:
Please make sure your code adheres to the existing style and that all tests pass.
License
This project is licensed under the MIT License.
Acknowledgments
web3dguy
LangChain for text splitting and document handling.
HuggingFace for embedding models.
Chroma for the vector store.
TQDM for progress bars.
The open-source community for continuous support and contributions.
Markdown Processor and Embedder
md-embed processes markdown files, cleans and prepares the data, splits the text into manageable chunks, and creates embeddings for use in vector databases (specifically ChromaDB). It supports multiple input methods and provides options for customizing the splitting and embedding process.
Features
#
,##
). Allows for custom header level selection. Preserves header hierarchy in metadatalangchain_huggingface
). Defaults toall-MiniLM-L6-v2
langchain_community
). Defaults tonomic-embed-text
, requires a local Ollama server running athttp://localhost:11434
langchain_chroma
) to store embeddings and associated metadatalogging
moduleRequirements
langchain
(various components - see import statements)chromadb
tqdm
beautifulsoup4
(if you were scraping, but this script doesn't actually use it)requests
(if you were scraping, but this script doesn't actually use it)To install the required packages, run:
md-embed can be run from the command line. It provides a command-line interface using argparse with the following option:
--filters-off: Disables the "404" and "©" filters
The script will then guide you through a series of interactive prompts to configure the processing:
Input Method Selection: Choose between JSON input, a folder of markdown files, or a single markdown file
Input File/Folder/URL: Provide the path to the input file or folder, as appropriate
Output Folder (for JSON input): Specify the directory where cleaned markdown files will be saved
Data Cleaning Options: The script will show total entires and total duplicates
Language: Specify the primary language of the input files (e.g., "TypeScript", "Python")
Splitting Method: Choose between "markdown" (header-based splitting) and "recursive" (chunk size and overlap)
Markdown Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Header Levels: Specify which header levels to split on (e.g., "1,2,3" for #, ##, and ###). Enter "all" for all header levels
Recursive Splitting Options (if applicable):
Remove Links: Choose whether to remove markdown links
Chunk Size: Specify the desired chunk size (in characters)
Chunk Overlap: Specify the desired chunk overlap (in characters)
Preview Splits: Choose whether to preview the split data ("yes", "full", or "no")
Split Again: You'll be prompted to continue or modify the settings
Embedding Method: Choose between "huggingface" and "ollama"
Embedding Model (Hugging Face): Enter the Hugging Face model name (defaults to all-MiniLM-L6-v2)
Embedding Model (Ollama): Enter the Ollama model name (defaults to nomic-embed-text)
Persistence Directory: Specify the directory where the ChromaDB database will be stored
Collection Name: Choose a name for the ChromaDB collection
Example (JSON Input):
Follow the prompts, providing the necessary information (input file, output folder, embedding choices, etc.)
Example (Disabling Filters):
Cleaned Markdown Files (JSON Input): If using JSON input, the script will save cleaned markdown files to the specified output folder
ChromaDB Database: The script will create a ChromaDB database in the specified persistence directory, containing the embeddings and metadata
Logs: The logs directory will contain logs of removed files (if any) and duplicate entries (if using JSON input)
file_to_url.json: Json file that contains the original URL of each document
Error Handling
The script includes error handling for various scenarios, such as:
Invalid input file/folder paths
File I/O errors
Exceptions during data cleaning, splitting, or embedding
Invalid user input for prompts
Errors are logged using the logging module
Notes
The script assumes that the input JSON data has "url" and "markdown" keys for each entry
The script uses uuid4 to generate unique IDs for each document in the vector database
The script processes in batches to deal with a large number of splits
Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away.