The Generalised [1], Projected [2], weighted generalised [3], kernel
generalised [4] Covariance Measure tests (GCM, PCM, wGCM, kGCM tests) can be
used to test conditional independence between a real-valued response comets R package implements these covariance measure
tests (COMETs) with a user-friendly interface which allows the user to use any
sufficiently predictive supervised learning algorithm of their choosing. The
default is to use random forests implemented in ranger for all regressions. A
Python version of this package is available
here.
Here, we showcase how to use comets with a simple example in which
set.seed(1)
n <- 300
X <- matrix(rnorm(2 * n), ncol = 2)
colnames(X) <- c("X1", "X2")
Z <- matrix(rnorm(2 * n), ncol = 2)
colnames(Z) <- c("Z1", "Z2")
Y <- X[, 1]^2 + Z[, 2] + rnorm(n)
GCM <- gcm(Y, X, Z) # plot(GCM)The output for the GCM test, which fails to reject the null hypothesis of
conditional independence in this example, is shown below. The residuals for the
plot(GCM)
(not shown here).
##
## Generalized covariance measure test
##
## data: gcm(Y = Y, X = X, Z = Z)
## X-squared = 2.8211, df = 2, p-value = 0.244
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
The PCM test can be run likewise.
PCM <- pcm(Y, X, Z) # plot(PCM)The output is shown below: The PCM test correctly rejects the null hypothesis of conditional independence in this example.
##
## Projected covariance measure test
##
## data: pcm(Y = Y, X = X, Z = Z)
## Z = 4.8589, p-value = 5.901e-07
## alternative hypothesis: true E[Y | X, Z] is not equal to E[Y | Z]
The comets package contains an alternative formula-based interface, in which
Y ~ X | Z with a
corresponding data argument. This interface is implemented in comets() and
shown below.
dat <- data.frame(Y = Y, X, Z)
comets(Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")##
## Generalized covariance measure test
##
## data: comets(formula = Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
## X-squared = 3.2184, df = 2, p-value = 0.2
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
Different regression methods can supplied for both GCM and PCM tests using the
reg_* arguments (for instance, reg_YonZ in gcm() for the regression of "rf" for random forests and "lasso"
for cross-validated residual() (GCM and
PCM) or predict() (PCM only) method and the following structure:
my_regression <- function(y, x, ...) {
ret <- <run the regression>
class(ret) <- "my_regression"
ret
}
predict.my_regression <- function(object, data, ...) {
<run the prediction routine>
}
residuals.my_regression <- function(object, response, data, ...) {
<run the routine for computing residuals>
}
The input y and x and data are vector and matrix-valued. The output of
predict.my_regression() should be a vector of length NROW(data).
For survival responses, comets offers the TRAM-GCM test [5] and supports
parametric and semiparametric survival models from the survival package, as
well as random survival forests from ranger. As an example, we test whether
survival is independent of sex given age in the cancer dataset, once using a
Cox model and once using a random survival forest. Both tests agree to reject
the null hypothesis at conventional significance levels.
library("survival")
data("cancer", package = "survival")
cancer$surv <- with(cancer, Surv(time, status == 2))
comets(surv ~ sex | age, data = cancer, reg_YonZ = "cox")##
## Generalized covariance measure test
##
## data: comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "cox")
## X-squared = 7.7608, df = 1, p-value = 0.005339
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
comets(surv ~ sex | age, data = cancer, reg_YonZ = "survforest")##
## Generalized covariance measure test
##
## data: comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "survforest")
## X-squared = 14.827, df = 1, p-value = 0.0001178
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
The GCM test also supports multivariate responses. Continuing the example from
above, we generate a bivariate response multivariate argument also allows the specification of
multivariate regression models (this option is experimental).
bivY <- cbind(Y, 0.5 * X[, 1] + Z[, 1] + rnorm(n))
gcm(bivY, X, Z)##
## Generalized covariance measure test
##
## data: gcm(Y = bivY, X = X, Z = Z)
## X-squared = 42.177, df = 4, p-value = 1.533e-08
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
The development version of comets can be installed using:
# install.packages("remotes")
remotes::install_github("LucasKook/comets")A stable version of comets can be installed from CRAN via:
install.packages("comets")All results in [3] can be reproduced by running make all in ./inst after
downloading all required data from the
zenodo repository.
The scripts for reproducing the results manually can be found in ./inst/code/
for the CCLE data (ccle.R), TCGA data (multiomics.R) and MIMIC data
(mimic.R).
Please cite the comets package as
@article{10.1093/bib/bbae475,
title={{Algorithm-Agnostic Significance Testing in Supervised Learning With Multimodal Data}},
author={Lucas Kook and Anton Rask Lundborg},
year={2024},
journal={Briefings in Bioinformatics},
volume={25},
number={6},
doi={10.1093/bib/bbae475},
}
[1] Rajen D. Shah, Jonas Peters "The hardness of conditional independence testing and the generalised covariance measure," The Annals of Statistics, 48(3), 1514-1538. doi:10.1214/19-aos1857
[2] Lundborg, A. R., Kim, I., Shah, R. D., & Samworth, R. J. (2024). The Projected Covariance Measure for assumption-lean variable significance testing. The Annals of Statistics, 52(6), 2851-2878. doi:10.1214/24-AOS2447
[3] Scheidegger, C., Hörrmann, J., & Bühlmann, P. (2022). The weighted generalised covariance measure. Journal of Machine Learning Research, 23(273), 1-68.
[4] Fernández, T., & Rivera, N. (2024). A general framework for the analysis of kernel-based tests. Journal of Machine Learning Research, 25(95), 1-40.
[5] Kook, L., Saengkyongam, S., Lundborg, A. R., Hothorn, T., & Peters, J. (2025). Model-based causal feature selection for general response types. Journal of the American Statistical Association, 120(550), 1090-1101. doi:10.1080/01621459.2024.2395588
[6] Kook, L. & Lundborg A. R. (2024). Algorithm-agnostic significance testing in supervised learning with multimodal data. Briefings in Bioinformatics 25(6) 2024. doi:10.1093/bib/bbae475
