Q: Create a list of the 100 most frequently occurring words with the count of occurrences for each word found in the attached text for Herman Melville's novel, Moby Dick. Ensure this top-100 list does not include any words in the provided stop words list.
Programming language used is python.
The data used is present in data folder in which mobydick.txt contains novel's text data and stop-words.txt is used to store the stop words.
The code used is written in main.py
-
create a virtualenv so that libraries used in this code doesn't effect libraries in core python
-
install libraries using pip install -r requirements.txt
-
Run the code using python main.py
-
For unit testing test.py needs to be executed using python test.py
Note: The folder output will be created once you run this file. It will contain images of output such as word-cloud and frequency distribution of words.
Step 1: the analysis of data file was done by checking what is present in stop-words.txt file first and writing the code for preprocessing the file to usable array format.
Step 2: the analysis of data file was done by checking what is present in mobydick.txt file.
Step 3: Creating the code for calculating unique words and their frequency
Step 4: Analysing the unique words. Creating the preprocessing file to preprocess the sentences by removing symbols.
Step 5: analysing the results again.
Step 6: creating the sorting code and printing it in terminal. RepresentatioPutting it in word cloud for better representation
Step 7: writing the code comments
Step 8: Visualization of frequency distribution graph
Step 9: Code readability and code comments was improved
Step 10: Test cases written
Step 11: Exception Handling in code