This project implements a search engine using a compressed Trie and an Inverted index. The code is written in Python using a Jupyter Notebook.
Link: https://youtu.be/cXfTzs-F8Qw
- Input files: The HTML files are saved locally. Six Wikipedia pages related to bears have been used.
- Reading the files: The files are read in using BeautifulSoup. The text for each webpage is stored in a dictionary using the web pages as keys.
- Tokenization: The data is tokenized using regex.
- Filtering: The stopwords are filtered using the stopword corpus from the NLTK library.
- Creation of inverted index: The inverted index is implemented using a dictionary where words are keys and the values are the webpages they occur in along with the number of times the word appears in the page.
- Creation of the Compressed Trie:
- Trie nodes: Each node contains a dictionary of child nodes and a boolean representing whether the child is the end of the word.
- Compressed trie methods:
i) Init: initializes the root node.
ii) Longest_common_prefix: finds and returns the longest common prefix given two strings.
iii) Print_trie: prints out the contents of the tree.
iv) Insert: For inserting items into the tree.
v) Search: searches for a word in the trie and returns True if the word is present in the Trie, otherwise, it returns False.
- Search and ranking:
- Search:
i) Each word in the input string is searched in the trie.
ii) If found in the Trie, the word is searched in the inverted index.
iii) The inverted index provides the occurrence list and number of appearances. - Ranking:
i) The pages are sorted by the number of appearances and printed out.
ii) If there are more than multiple words then the intersection of the sets of pages is printed out.
iii) If the word is not present in the trie then a message that the word is not in the vocabulary is printed out.