K-medoids Coding Exercise #2

gla7 · 2019-04-11T22:04:10Z

Description:
K-medoids is a clustering algorithm which is essentially similar to k-means, but instead of using a center coordinate to define a cluster, it uses one of the data points which we call a medoid. K-means and k-medoids are somewhat analogous to the mean and median respectively. The reason for this is that the k-medoids algorithm is more robust to noise and outliers than k-means the way the median is as compared to the mean.

The basic steps of this algorithm are:

Choose a number of clusters that is less than or equal to the number of data points.
Randomly initialize the medoids on one data point each.
Expectation step: for each non-medoid data point, calculate the distances to each medoid and assign to it to be in the cluster pertaining to the medoid closest to it.
Add all Euclidean distances from each data point to their corresponding medoid. This will be the objective function to minimize, and we will refer to it as the cost.
Repeat the following two steps until no change in medoids is observed or a set threshold of iterations is reached:
(a) Maximization step: for each cluster, choose non-medoid data points at random and calculate the cost within the cluster. If this cost is lower than the cost achieved by the current medoid within the cluster, re-assign this data point as the new cluster medoid.
(b) Perform an expectation step in order to re-draw the cluster boundaries.

In the present implementation, we have given the user the option of repeating steps 2 - 5 multiple times and choosing the model with the lowest cost, as this algorithm can be sensitive to initializations. We have also added the option to predict the cluster of a new user-provided data point.

In the Screenshots section below we can visualize some results obtained by this implementation, and included references for finding the corresponding datasets.

Testing:
A basic variety of unit tests are included in this pull request, including validation tests, a distance calculation test, class methods tests, k-medoids algorithm tests, a prediction test, and (optional and commented) visual inspection tests. In order to run these tests, the following command can be used:

python -m unittest discover -s test -p "test_k_medoids_model.py" -v

Example usages of the KMedoidsModel class can be seen in test/test_k_medoids_model.py as well.

Screenshots and Example Results
The following example comes from the data from model_1 in test/test_k_medoids_model.py. Default parameters were used with number of clusters being 5. The intended medoids ((-1, -1), (-1, 1), (0, 0), (1, 1), (1, -1)) and clusters are obtained.

The following example comes from the data from model_2 in test/test_k_medoids_model.py. Default parameters were used with number of clusters being 2. This example was taken from http://eacharya.inflibnet.ac.in/data-server/eacharya-documents/53e0c6cbe413016f23443704_INFIEP_33/93/LM/33-93-LM-V1-S1__kmedoids.pdf and clusters and medoids obtained match the results in that reference.

The following example comes from the A-set A1 in http://cs.joensuu.fi/sipu/datasets/ , where 50 initializations were chosen with 100 iterations and number of clusters is 20. The ground truths are not quite achieved, though the following was achieved with little tuning.

Gabriel Leon added 2 commits April 11, 2019 16:02

Creates class to facilitate k-medoids clustering and plotting

5494c25

Creates tests for k-medoids clustering class

9eea127

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

K-medoids Coding Exercise #2

K-medoids Coding Exercise #2

gla7 commented Apr 11, 2019 •

edited

Loading

K-medoids Coding Exercise #2

Are you sure you want to change the base?

K-medoids Coding Exercise #2

Conversation

gla7 commented Apr 11, 2019 • edited Loading

gla7 commented Apr 11, 2019 •

edited

Loading