project for the course Data analysis and visualization with python
- Dataset description
- Attributes Description
- bar plot to display the frequency distribution of all attributes
- frequency distribution table for continues numerical attributes
- Research questions (2-3)
- Fill mising values with the mean of that column
- Use boxplots to indetify features with outliers, replace them with the mean of that feature
- Use labelEncoder for catigorical features
- Scale the features with minMax normalization
- Analyze the dataset as its in long (tidy) or wide format. If it is already in long format, then convert it (by selecting 2 or more variables) to wide and visa-versa.
- Implement ANOVA method to identify irrelevant features, compute the F-statistics, bar chat to visualize the computed F- statistics and provide a list of the identified irrelevant features
- Download the acute-inflammations dataset
- Perform all necessary preprocessing steps to clean the dataset
- Determine the optimal size of training dataset using the K-NN algorithm
- Implement the K-NN algorithm using Euclidean distance
- Implement the K-NN algorithm using cosine similarity
- Implement the decision tree algorithm
- Provide your detailed analysis based on the obtained box-plots of these three algorithms
- Collect text data from the internet
- pre-process the web documents to prepare them for clustering
- Apply term frequency-inverse document frequency to extract important keywords
- Apply random projection to transform the n x d TF-IDF matrix into an n x p data matrix, set the value of p = 500
- Cluster the transformed dataset using the k-means algorithm with different values of k
- Display each centeroid of the best clustering result by plotting a graph with feature on x axis and centeroid on the y axis