GitHub - veer05/Inverted-Index-Creation-with-Text-Processing

steps to follow :

Install Python 2.7.14
import requests
import from bs4 import BeautifulSoup
import re
import time
import os
from string import maketrans

-> Make sure the python file is present in the working directory --> To run the program type python pagerank.py

MAKE SURE THE DESTINATION AND SOURCE PATH IS CHANGED TO REFLECT THE HOST SOURCE FILE LOCATION AND HOST DESTINATION FILE LOCATION

Source Code : Tokenise_doc.py and Inverted_index.py Tokenise_doc.py --> containts the code to clean the document i.e, remove unwanted sections from the corpus remove images and other unwanted sections that do not hold relevant information Inverted_index.py --> Contains the code to generate unigram, bigram and trigram inverted index and later use these index to create unigram-tf,unigram-df,bigram-tf,bigram-df,trigram-tf,trigram-df NOTE : There is also code to store the filename and the number of tokens(unigram,bigram, trigram) the file has More Details are provided below

TEXT FILE INFORMATION Task 3 : Unigram-TF.txt,Unigram-DF.txt,Bigram-TF.txt,Bigram-DF.txt,Trigram-DF.txt, Trigram-TF.txt
Task 2 : Unigram_inverted_index.txt,Bigram-File-Token-Count.txt, Trigram_inverted_index.txt, (Optional Files, Unigram-File-Token-Count.txt, Trigram-File-Token-Count.txt,Bigram-File-Token-Count.txt) Task 1 : Corpus generated form task 1(not uploading)

Citations

IMPORTANT NOTES BEFORE RUNNING PROGRAM :(DATA STRUCTURES USAGE STATED BELOW)

Task 1 File consumes (.html) files that is xyz.html files, Since I am getting the file name by splitting .html from the file name and use it for creating file with name xyz which is of format .txt format

Task 1--> "Tokenise_doc.py" Takes html files --> Generates .txt cleaned corupus that does not contain noise Noise Removed is -> Removing formula and Tables, Formula and other irrelevant information not pertaining to the main topic of the file Other Info :

Option is provided for case folding, Global Variable "Do_Case_Folding" is set to true, If set to False, the corpus created will not have case folding.
Option provided for removing punctuation, Global Variable "Remove_Punctuation" is set to true, If set to False, the corpus created will not have punctuation removal. --> ( "." "," and " ' " handled in such a way that it only removes it in case it is between words it is preserved if present between numbers example "123.34" "19,000miles" and "5'6inches" and if they are present between words they are removed

Other Factors : I Am removing special characters like "!#~`+_^{}<>" I believe they might not be much useful when I consider them in the index. I could include them by just commenting one line that corresponds to the function call that removes the special characters.

----TASK 2, 3.1 and 3.2 DATA STRUCTURES USED : Dictonary for storing unigram Inverted index : format -> "word" -> (DocId, Tf) (DocId,tf) Dictionary for storing bigram Inverted index : format -> "word word" -> (DocId, Tf) (DocId,tf) Dictionary for storing trigram Inverted index : format -> "word word word" -> (DocId, Tf) (DocId,Tf)

NOW These Dictioraies are consumed by the Program to generate Tf and DF

Dictionary for storing unigram tf : format -> "word" -> tf Dictionary for storing bigram tf : format -> "word word" -> tf Dictionary for storing trigram tf : format -> "word word word" -> tf

Dictionary for storing unigram-df : format -> "word" -> docid, docid, df Dictionary for storing bigram-df : format -> "word word" -> docid, docid df Dictionary for storing trigram-df : format -> "word word word" -> docid,docid df

Data Struture for Holding the token count for each file for n grams Dictionary for storing unigram-token-count : format -> "docid" -> number of unigram token Dictionary for storing bigram-token-count : format -> "docid" -> number of bigram token Dictionary for storing trigram-token-count : format -> "docid " -> number of trigram token

NOTE I HAVE USED DICTIONARY SINCE THE LOOKUP TIME IS EFFICIENT which would result in a more faster program

Task 2 file "Inv_Index.py" consumes the CORPUS GENERATED FORM TASK 1 to generate unigram inverted index, bigram inverted index, trigram inverted index.
Make Sure the DOC FED TO TASK 2 ARE FILES GENERATED FORM task 1, Since It may affect the program the type of input file it takes and the operation it performs As the cleaned corpus is read one by one the dictiornary is genetated and then updated. Then dictiornary for inverted index is genreated and printed this dictiornary serves as the input for generating tf and df dictionaty

NOTE : Task 2 asks us to implement a data structire to store number of token in a sepetate data strutre The tokens(unigram,Bigram and trigram tokens) are stored in dictionaty since THEY CAN BE READ AND WRITTEN EFFICIENTLY WITHOUT much overhead or iterations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Bigram-DF.txt		Bigram-DF.txt
Bigram-File-Token-Count.txt		Bigram-File-Token-Count.txt
Bigram-TF.txt		Bigram-TF.txt
Bigram_inverted_index.txt		Bigram_inverted_index.txt
Inverted_index.py		Inverted_index.py
README.md		README.md
ReadMe.txt		ReadMe.txt
Stop List.docx		Stop List.docx
Tokenise_doc.py		Tokenise_doc.py
Trigram-DF.txt		Trigram-DF.txt
Trigram-File-Token-Count.txt		Trigram-File-Token-Count.txt
Trigram-TF.txt		Trigram-TF.txt
Trigram_inverted_index.txt		Trigram_inverted_index.txt
Unigram-DF.txt		Unigram-DF.txt
Unigram-File-Token-Count.txt		Unigram-File-Token-Count.txt
Unigram-TF.txt		Unigram-TF.txt
Unigram_inverted_index.txt		Unigram_inverted_index.txt

veer05/Inverted-Index-Creation-with-Text-Processing

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages