Skip to content

aLmktr/COMP3602-Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 

Repository files navigation

COMP3602-Project

project for the course Data analysis and visualization with python

Part 1:

  1. Dataset description
  2. Attributes Description
  3. bar plot to display the frequency distribution of all attributes
  4. frequency distribution table for continues numerical attributes
  5. Research questions (2-3)

Part 2:

  1. Fill mising values with the mean of that column
  2. Use boxplots to indetify features with outliers, replace them with the mean of that feature
  3. Use labelEncoder for catigorical features
  4. Scale the features with minMax normalization
  5. Analyze the dataset as its in long (tidy) or wide format. If it is already in long format, then convert it (by selecting 2 or more variables) to wide and visa-versa.
  6. Implement ANOVA method to identify irrelevant features, compute the F-statistics, bar chat to visualize the computed F- statistics and provide a list of the identified irrelevant features

Part 3

  1. Download the acute-inflammations dataset
  2. Perform all necessary preprocessing steps to clean the dataset
  3. Determine the optimal size of training dataset using the K-NN algorithm
  4. Implement the K-NN algorithm using Euclidean distance
  5. Implement the K-NN algorithm using cosine similarity
  6. Implement the decision tree algorithm
  7. Provide your detailed analysis based on the obtained box-plots of these three algorithms

Part 4

  1. Collect text data from the internet
  2. pre-process the web documents to prepare them for clustering
  3. Apply term frequency-inverse document frequency to extract important keywords
  4. Apply random projection to transform the n x d TF-IDF matrix into an n x p data matrix, set the value of p = 500
  5. Cluster the transformed dataset using the k-means algorithm with different values of k
  6. Display each centeroid of the best clustering result by plotting a graph with feature on x axis and centeroid on the y axis

About

project for the course Data analysis and visualization with python

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •