Skip to content

K-medoids Coding Exercise #2

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

K-medoids Coding Exercise #2

wants to merge 2 commits into from

Conversation

gla7
Copy link

@gla7 gla7 commented Apr 11, 2019

Description:
K-medoids is a clustering algorithm which is essentially similar to k-means, but instead of using a center coordinate to define a cluster, it uses one of the data points which we call a medoid. K-means and k-medoids are somewhat analogous to the mean and median respectively. The reason for this is that the k-medoids algorithm is more robust to noise and outliers than k-means the way the median is as compared to the mean.

The basic steps of this algorithm are:

  1. Choose a number of clusters that is less than or equal to the number of data points.
  2. Randomly initialize the medoids on one data point each.
  3. Expectation step: for each non-medoid data point, calculate the distances to each medoid and assign to it to be in the cluster pertaining to the medoid closest to it.
  4. Add all Euclidean distances from each data point to their corresponding medoid. This will be the objective function to minimize, and we will refer to it as the cost.
  5. Repeat the following two steps until no change in medoids is observed or a set threshold of iterations is reached:
    (a) Maximization step: for each cluster, choose non-medoid data points at random and calculate the cost within the cluster. If this cost is lower than the cost achieved by the current medoid within the cluster, re-assign this data point as the new cluster medoid.
    (b) Perform an expectation step in order to re-draw the cluster boundaries.

In the present implementation, we have given the user the option of repeating steps 2 - 5 multiple times and choosing the model with the lowest cost, as this algorithm can be sensitive to initializations. We have also added the option to predict the cluster of a new user-provided data point.

In the Screenshots section below we can visualize some results obtained by this implementation, and included references for finding the corresponding datasets.

Testing:
A basic variety of unit tests are included in this pull request, including validation tests, a distance calculation test, class methods tests, k-medoids algorithm tests, a prediction test, and (optional and commented) visual inspection tests. In order to run these tests, the following command can be used:

python -m unittest discover -s test -p "test_k_medoids_model.py" -v

Example usages of the KMedoidsModel class can be seen in test/test_k_medoids_model.py as well.

Screenshots and Example Results
The following example comes from the data from model_1 in test/test_k_medoids_model.py. Default parameters were used with number of clusters being 5. The intended medoids ((-1, -1), (-1, 1), (0, 0), (1, 1), (1, -1)) and clusters are obtained.
Screen Shot 2019-04-10 at 10 45 48 PM

The following example comes from the data from model_2 in test/test_k_medoids_model.py. Default parameters were used with number of clusters being 2. This example was taken from http://eacharya.inflibnet.ac.in/data-server/eacharya-documents/53e0c6cbe413016f23443704_INFIEP_33/93/LM/33-93-LM-V1-S1__kmedoids.pdf and clusters and medoids obtained match the results in that reference.
Screen Shot 2019-04-10 at 10 46 12 PM

The following example comes from the A-set A1 in http://cs.joensuu.fi/sipu/datasets/ , where 50 initializations were chosen with 100 iterations and number of clusters is 20. The ground truths are not quite achieved, though the following was achieved with little tuning.
Screen Shot 2019-04-10 at 11 58 53 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant