A Comparative Analysis of Unsupervised Clustering Methods in R

📄 Project Goal

This project provides an in-depth exploration of unsupervised machine learning techniques, focusing on a comparative analysis of different clustering algorithms. Using the well-known "Hawks" dataset, the goal is not just to apply models, but to understand their strengths and weaknesses in a real-world scenario with noisy, variable data.

The analysis follows a logical progression:

Intensive data preparation to create a clean, reliable dataset.
Application of a centroid-based method (K-Means).
Application of density-based methods (DBSCAN & OPTICS) to address the limitations found in K-Means.
A final, critical comparison of the results using statistical metrics and visualizations.

🛠️ Part I: Data Preparation & Exploratory Data Analysis

Before any modeling, a rigorous data cleaning and preparation pipeline was executed to ensure the quality of the input data. This critical phase is often the most time-consuming in a data science project.

Methodical NA Imputation: Missing values in key biometric variables (Wing, Weight, Culmen, Hallux) were imputed using a conditional mean based on the hawk's Species and Age, preserving the underlying data structure.
Outlier Detection and Handling: The Interquartile Range (IQR) method was used to identify statistical outliers. Instead of removing them, a robust imputation strategy was applied, replacing extreme values with the conditional mean to avoid data loss while correcting anomalies.
Data Integrity Checks: The final cleaned dataset was thoroughly reviewed using visualizations (histograms, boxplots) to ensure data quality before moving to the clustering phase.

🤖 Part II: K-Means Clustering

K-Means was initially applied to segment the data. The analysis explored two hypotheses:

k=3: To cluster the three known hawk species.
k=6: To cluster by both species and age (adult/juvenile).

The clusplot visualization, which uses PCA to map the clusters, showed that K-Means could create a reasonable separation but struggled with significant overlap and was unable to capture the natural, non-spherical shapes of the data groups. This highlighted the limitations of centroid-based methods for this specific dataset.

🔬 Part III: Density-Based Clustering (DBSCAN & OPTICS)

To overcome the challenges faced by K-Means, more advanced density-based algorithms were employed.

OPTICS (Ordering Points To Identify the Clustering Structure): Used to explore the data's density and identify a suitable number of clusters, revealing a potential for 3 to 5 distinct groups of varying densities.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Applied with parameters informed by the OPTICS analysis. DBSCAN proved superior in:
- Identifying non-spherical clusters.
- Effectively isolating noise points that did not belong to any group.
- Providing a more natural segmentation of the hawk species.

📊 Part IV: Comparative Analysis & Conclusion

A statistical comparison using the cluster.stats function provided quantitative validation for the visual findings.

Metric	K-Means (k=3)	DBSCAN	Interpretation
Avg. Silhouette Width	0.71	0.73	Both models group points well, with a slight edge to DBSCAN.
Dunn Index	0.004	0.24	DBSCAN creates much more compact and well-separated clusters.
Noise Handling	No	Yes	DBSCAN correctly identified and excluded outlier data points.
Shape Flexibility	Low (Spherical)	High (Arbitrary)	DBSCAN was better suited for the irregular shapes of the natural data.

Conclusion: For this dataset, characterized by variable density and noise, DBSCAN provided a superior and more realistic clustering solution compared to K-Means.

💻 Technologies Used

Language: R
Core Libraries:
- Stat2Data (for the dataset)
- tidyverse (dplyr, ggplot2)
- dbscan (for DBSCAN and OPTICS)
- fpc (for cluster statistics)
- ggbiplot (for PCA visualization)

🚀 Getting Started

To reproduce this analysis:

Clone this repository.
Make sure R and RStudio are installed.
The R Markdown file (.Rmd) will automatically prompt to install the required packages.
Run the notebook cells sequentially to follow the entire workflow from data cleaning to the final model comparison.

👤 Author

Antonio Barrera Mora

LinkedIn: https://www.linkedin.com/in/anbamo/
GitHub: @Kamaranis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Unsupervised_met_ml_files/figure-gfm		Unsupervised_met_ml_files/figure-gfm
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
Unsupervised_met_ml.Rmd		Unsupervised_met_ml.Rmd
Unsupervised_met_ml.md		Unsupervised_met_ml.md
Unsupervised_met_ml.pdf		Unsupervised_met_ml.pdf
p_brand.html		p_brand.html
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Comparative Analysis of Unsupervised Clustering Methods in R

📄 Project Goal

🛠️ Part I: Data Preparation & Exploratory Data Analysis

🤖 Part II: K-Means Clustering

🔬 Part III: Density-Based Clustering (DBSCAN & OPTICS)

📊 Part IV: Comparative Analysis & Conclusion

💻 Technologies Used

🚀 Getting Started

👤 Author

About

Uh oh!

Languages

License

AnbarTop/Comparative-Clustering-Analysis-R-

Folders and files

Latest commit

History

Repository files navigation

A Comparative Analysis of Unsupervised Clustering Methods in R

📄 Project Goal

🛠️ Part I: Data Preparation & Exploratory Data Analysis

🤖 Part II: K-Means Clustering

🔬 Part III: Density-Based Clustering (DBSCAN & OPTICS)

📊 Part IV: Comparative Analysis & Conclusion

💻 Technologies Used

🚀 Getting Started

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages