Search Engine

Introduction

This project implements a search engine using a compressed Trie and an Inverted index. The code is written in Python using a Jupyter Notebook.

Video Demo

Link: https://youtu.be/cXfTzs-F8Qw

Implementation

Input files: The HTML files are saved locally. Six Wikipedia pages related to bears have been used.
Reading the files: The files are read in using BeautifulSoup. The text for each webpage is stored in a dictionary using the web pages as keys.
Tokenization: The data is tokenized using regex.
Filtering: The stopwords are filtered using the stopword corpus from the NLTK library.
Creation of inverted index: The inverted index is implemented using a dictionary where words are keys and the values are the webpages they occur in along with the number of times the word appears in the page.
Creation of the Compressed Trie:

Trie nodes: Each node contains a dictionary of child nodes and a boolean representing whether the child is the end of the word.
Compressed trie methods:
i) Init: initializes the root node.
ii) Longest_common_prefix: finds and returns the longest common prefix given two strings.
iii) Print_trie: prints out the contents of the tree.
iv) Insert: For inserting items into the tree.
v) Search: searches for a word in the trie and returns True if the word is present in the Trie, otherwise, it returns False.

Search and ranking:

Search:
i) Each word in the input string is searched in the trie.
ii) If found in the Trie, the word is searched in the inverted index.
iii) The inverted index provides the occurrence list and number of appearances.
Ranking:
i) The pages are sorted by the number of appearances and printed out.
ii) If there are more than multiple words then the intersection of the sets of pages is printed out.
iii) If the word is not present in the trie then a message that the word is not in the vocabulary is printed out.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
HTML_files		HTML_files
Project_CS_600.ipynb		Project_CS_600.ipynb
README.md		README.md
Readme.pdf		Readme.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Search Engine

Introduction

Video Demo

Implementation

Sample outputs

Sample 1

Sample 2

Sample 3

Sample 4

About

Uh oh!

Releases

Packages

Languages

rishik18/Search_Engine

Folders and files

Latest commit

History

Repository files navigation

Search Engine

Introduction

Video Demo

Implementation

Sample outputs

Sample 1

Sample 2

Sample 3

Sample 4

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages