richCluster is a fast C++ agglomerative hierarchical clustering algorithm packaged into easily callable R functions, designed to help cluster biological 'terms' based on how similar of genes are expressed in their activation.
Terms are clustered together based on how many genes are shared between them. We support two different types of similarity scores:
- Kappa score
- Jaccard index
As well as different linkage criteria for iteratively merging clusters together.
Multiple linkage (from DAVID implementation)(not yet implemented)- Single
- Complete
- Average
Ward(not yet implemented)
The package is currently under review in submission to CRAN, but for now users can install the package and try out the clustering and visualization features by installing through GitHub.
install.packages("devtools")
require(devtools)
install_github("hyuncat/richCluster")
The basic flow is as follows. A demo can be followed along in the tests.R
script in the /scripts
subfolder. All demo data can be found in /inst/extdata
.

The user can supply a list of enrichment results, and simply run cluster()
with certain options to return a ClusterResult
named list object which is all you need to pass in order to the various visualization functions.
The goal of the richCluster package is to simplify the workflow of clustering biological terms across multiple enrichment results.
- For instance, if a user has 3 different enrichment data corresponding to the same mouse at different stages of a disease and wants to directly compare those results side by side.
Our cluster()
function accepts any arbitrary number of enrichment results in a list, and then will cluster different $Term
rows together based on how many shared genes they have in the $GeneID
column.
The absolute minimum a user must do is to provide a list of dataframes corresponding to the enrichment results they want to cluster.
enrichment_results
- A list of dataframes, each containing enrichment results. Each dataframe should include at least the columns 'Term', 'GeneID', and 'Padj'.
Some basic options to customize clustering outputs include:
df_names
- An optional character vector of names for the enrichment result dataframes. Must match the length ofenrichment_results
. Default is NULL.min_terms
- Minimum number of terms each final cluster must includemin_value
- Minimum 'Pvalue' a term must have in order to be counted in final clustering
We also allow user to specify which distance metric / cutoff score they want to use to cluster terms together.
distance_metric
- A string specifying the distance metric to use (e.g., "kappa").distance_cutoff
- A numeric value for the distance cutoff (0 < cutoff <= 1).
Note that we technically are using a 'similarity' metric, so the cutoff is the minimum kappa score (for instance) that two terms must share in order to be clustered together. Hence a higher cutoff would lead to stricter clustering / smaller clusters.
After an initial grouping of terms based on having a distance score above a certain cutoff, we merge our initial "seed groups" (pre-clusters) together based on the provided linkage method.
linkage_method
- A string specifying the linkage method to use (e.g., "average").linkage_cutoff
- A numeric value between 0 and 1 for the minimum linkage cutoff needed to merge two groups together.
The default recommended should be average
, which takes the average distance between all terms in the two clusters and uses that as the metric to merge.
The total list of supported metrics includes:
(not yet implemented)"david"
- DAVID multiple linkage membership"single"
"complete"
"average"
(not yet implemented)"ward"
(recommended)
Again, a higher linkage_cutoff leads to stricter (smaller) clusters.
The output of the cluster()
function is a ClusterResult
which can be directly inputted into the visualizations or exported as a csv file with some additional options.
The name of each cluster is determined as the term in the cluster with the highest gene count.
We support two broad categories of cluster visualization:
- Cluster-level: Aggregates pvalue across all terms in the cluster, to compare different clusters to each other
- Term-level: Displays individual term pvalues within a cluster
Where comparisons are also done across the different enrichment results which originally went into clustering.
Using these two categories, we have three different types of plots which are currently supported:
- Heatmaps
- Bar plots
- Dot plots
As well as an option to export as a dataframe (CSV).
cluster_hmap
displays the -log10(pvalue) of all the different clusters across the user's supplied enrichment results.

term_hmap
displays the -log10(pvalue) of the union of all terms in the specified clusters, as well as any additional terms specified.

cluster_bar
displays the -log10(pvalue) of all the different clusters across the user's supplied enrichment results.

term_bar
displays the pvalue of all terms within a single cluster across the different input enrichment results.

cluster_dot
shows the enrichment data with the size of the cluster being associated with dot radius and the x-axis being the -log10(pvalue).

term_dot
shows the enrichment data with the number of genes in each term being associated with dot radius and the x-axis being the -log10(pvalue).

To view the data in excel / as a dataframe, users can export the final clustered data as a dataframe with export_df
and save their results to a csv as follows.
cluster_df <- richCluster::export_df(cluster_result)
write.csv(cluster_df, "~/Downloads/cluster_df.csv", row.names=FALSE)
Otherwise, to retain access to the entire suite of visualization functions users can also download the entire cluster_result as a list object as follows:
saveRDS(cluster_result, file = "~/Downloads/cluster_result.rds")
Then read it back in as an R list as such:
cluster_result <- readRDS('~/Downloads/cluster_result.rds')
to avoid re-clustering large datasets across multiple work sessions.
Note: If you are receiving errors, please try renaming your columns of interest to 'Term' and 'GeneID' across all genesets of interest, and make sure the formatting of the columns is consistent (eg, same Term/GeneID name spellings) across all datasets.
If that doesn't work, feel free to raise a GitHub issue or email @[email protected] and I'll try to help you with your problem.