##NOTE: Read the documentation file for complete understanding.
This Full-Text Search Engine is designed to efficiently index and search large sets of documents. It supports powerful features such as inverted indexing, auto-suggestions, and text tokenization with stemming and stop-word removal. By implementing multithreaded document processing, the system is optimized for speed and scalability.
- Inverted Indexing: Enables fast lookup of documents containing specific terms.
- Tokenization: Converts documents into a list of searchable tokens.
- Stop-Word Removal: Filters out common words (like "the", "and") to enhance search relevance.
- Multithreading Support: Speeds up processing by concurrently handling multiple documents.
- C++17 or higher: Make sure your environment supports C++17 for filesystem operations and threading.
- Boost Library: The project uses the Boost C++ library for string manipulation.
Install the necessary dependencies:
sudo apt-get install libboost-all-devMake sure you have .txt files in the Documents/ folder for indexing. The engine will read all text files and generate the inverted index.
The system tokenizes each document, removes stop words, and stems tokens to their root forms. This step is essential to create a more compact and efficient index.
void preProcessTheData();The inverted index structure allows for quick retrieval of documents containing specific terms.
void buildInvertedIndex(const vector<string>& tokens, int docNumber);Once the documents are indexed, users can search for keywords. The system will retrieve and rank documents based on relevance.
- Phrase Searching: Support for searching exact phrases instead of single keywords.
- Synonym Support: Adding synonyms to improve search results.
- Real-Time Updates: Dynamically update the index as documents are added or modified.
This project is licensed under the MIT License - see the LICENSE file for details.
- Sudhanshu Shekhar - Full-text search engine developer
Feel free to contribute, submit pull requests, or report issues!