Skip to content

rishik18/Search_Engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Search Engine

Introduction

This project implements a search engine using a compressed Trie and an Inverted index. The code is written in Python using a Jupyter Notebook.

Video Demo

Link: https://youtu.be/cXfTzs-F8Qw

Implementation

  1. Input files: The HTML files are saved locally. Six Wikipedia pages related to bears have been used.
  2. Reading the files: The files are read in using BeautifulSoup. The text for each webpage is stored in a dictionary using the web pages as keys.
  3. Tokenization: The data is tokenized using regex.
  4. Filtering: The stopwords are filtered using the stopword corpus from the NLTK library.
  5. Creation of inverted index: The inverted index is implemented using a dictionary where words are keys and the values are the webpages they occur in along with the number of times the word appears in the page.
  6. Creation of the Compressed Trie:
  • Trie nodes: Each node contains a dictionary of child nodes and a boolean representing whether the child is the end of the word.
  • Compressed trie methods:
    i) Init: initializes the root node.
    ii) Longest_common_prefix: finds and returns the longest common prefix given two strings.
    iii) Print_trie: prints out the contents of the tree.
    iv) Insert: For inserting items into the tree.
    v) Search: searches for a word in the trie and returns True if the word is present in the Trie, otherwise, it returns False.
  1. Search and ranking:
  • Search:
    i) Each word in the input string is searched in the trie.
    ii) If found in the Trie, the word is searched in the inverted index.
    iii) The inverted index provides the occurrence list and number of appearances.
  • Ranking:
    i) The pages are sorted by the number of appearances and printed out.
    ii) If there are more than multiple words then the intersection of the sets of pages is printed out.
    iii) If the word is not present in the trie then a message that the word is not in the vocabulary is printed out.

Sample outputs

Sample 1

image

Sample 2

image

Sample 3

image

Sample 4

image

About

Search engine with Trie

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published