Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
K-medoids is a clustering algorithm which is essentially similar to k-means, but instead of using a center coordinate to define a cluster, it uses one of the data points which we call a medoid. K-means and k-medoids are somewhat analogous to the mean and median respectively. The reason for this is that the k-medoids algorithm is more robust to noise and outliers than k-means the way the median is as compared to the mean.
The basic steps of this algorithm are:
(a) Maximization step: for each cluster, choose non-medoid data points at random and calculate the cost within the cluster. If this cost is lower than the cost achieved by the current medoid within the cluster, re-assign this data point as the new cluster medoid.
(b) Perform an expectation step in order to re-draw the cluster boundaries.
In the present implementation, we have given the user the option of repeating steps 2 - 5 multiple times and choosing the model with the lowest cost, as this algorithm can be sensitive to initializations. We have also added the option to predict the cluster of a new user-provided data point.
In the Screenshots section below we can visualize some results obtained by this implementation, and included references for finding the corresponding datasets.
Testing:
A basic variety of unit tests are included in this pull request, including validation tests, a distance calculation test, class methods tests, k-medoids algorithm tests, a prediction test, and (optional and commented) visual inspection tests. In order to run these tests, the following command can be used:
Example usages of the
KMedoidsModel
class can be seen intest/test_k_medoids_model.py
as well.Screenshots and Example Results

The following example comes from the data from
model_1
intest/test_k_medoids_model.py
. Default parameters were used with number of clusters being5
. The intended medoids ((-1, -1)
,(-1, 1)
,(0, 0)
,(1, 1)
,(1, -1)
) and clusters are obtained.The following example comes from the data from

model_2
intest/test_k_medoids_model.py
. Default parameters were used with number of clusters being2
. This example was taken from http://eacharya.inflibnet.ac.in/data-server/eacharya-documents/53e0c6cbe413016f23443704_INFIEP_33/93/LM/33-93-LM-V1-S1__kmedoids.pdf and clusters and medoids obtained match the results in that reference.The following example comes from the A-set A1 in http://cs.joensuu.fi/sipu/datasets/ , where

50
initializations were chosen with100
iterations and number of clusters is20
. The ground truths are not quite achieved, though the following was achieved with little tuning.