Skip to content

LucasKook/comets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

R-CMD-check

Covariance Measure Tests (COMETs) in R

The Generalised [1], Projected [2], weighted generalised [3], kernel generalised [4] Covariance Measure tests (GCM, PCM, wGCM, kGCM tests) can be used to test conditional independence between a real-valued response $Y$ and features/modalities $X$ given additional features/modalities $Z$ using any sufficiently predictive supervised learning algorithms. An extension of the GCM to censored responses was proposed in [5] and is implemented with survival regression methods. The comets R package implements these covariance measure tests (COMETs) with a user-friendly interface which allows the user to use any sufficiently predictive supervised learning algorithm of their choosing. The default is to use random forests implemented in ranger for all regressions. A Python version of this package is available here.

Here, we showcase how to use comets with a simple example in which $Y$ is not independent of $X$ given $Z$. More elaborate examples including conditional variable significance testing and modality selection on real-world data can be found in [6].

set.seed(1)
n <- 300
X <- matrix(rnorm(2 * n), ncol = 2)
colnames(X) <- c("X1", "X2")
Z <- matrix(rnorm(2 * n), ncol = 2)
colnames(Z) <- c("Z1", "Z2")
Y <- X[, 1]^2 + Z[, 2] + rnorm(n)
GCM <- gcm(Y, X, Z) # plot(GCM)

The output for the GCM test, which fails to reject the null hypothesis of conditional independence in this example, is shown below. The residuals for the $Y$ on $Z$ and $X$ on $Z$ regressions can be investigated by calling plot(GCM) (not shown here).

## 
## 	Generalized covariance measure test
## 
## data:  gcm(Y = Y, X = X, Z = Z)
## X-squared = 2.8211, df = 2, p-value = 0.244
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0

The PCM test can be run likewise.

PCM <- pcm(Y, X, Z) # plot(PCM)

The output is shown below: The PCM test correctly rejects the null hypothesis of conditional independence in this example.

## 
## 	Projected covariance measure test
## 
## data:  pcm(Y = Y, X = X, Z = Z)
## Z = 4.8589, p-value = 5.901e-07
## alternative hypothesis: true E[Y | X, Z] is not equal to E[Y | Z]

The comets package contains an alternative formula-based interface, in which $H_0 : Y \perp\hspace{-5pt}\perp X \mid Z$ can be supplied as Y ~ X | Z with a corresponding data argument. This interface is implemented in comets() and shown below.

dat <- data.frame(Y = Y, X, Z)
comets(Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
## 
## 	Generalized covariance measure test
## 
## data:  comets(formula = Y ~ X1 + X2 | Z1 + Z2, data = dat, test = "gcm")
## X-squared = 3.2184, df = 2, p-value = 0.2
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0

Specifying regression methods

Different regression methods can supplied for both GCM and PCM tests using the reg_* arguments (for instance, reg_YonZ in gcm() for the regression of $Y$ on $Z$). Pre-implemented regressions are "rf" for random forests and "lasso" for cross-validated $L_1$-penalized regression. Custom regression functions can be supplied as character strings or functions, require a residual() (GCM and PCM) or predict() (PCM only) method and the following structure:

my_regression <- function(y, x, ...) {
  ret <- <run the regression>
  class(ret) <- "my_regression"
  ret
}

predict.my_regression <- function(object, data, ...) {
  <run the prediction routine>
}

residuals.my_regression <- function(object, response, data, ...) {
  <run the routine for computing residuals>
}

The input y and x and data are vector and matrix-valued. The output of predict.my_regression() should be a vector of length NROW(data).

Usage example: Survival response

For survival responses, comets offers the TRAM-GCM test [5] and supports parametric and semiparametric survival models from the survival package, as well as random survival forests from ranger. As an example, we test whether survival is independent of sex given age in the cancer dataset, once using a Cox model and once using a random survival forest. Both tests agree to reject the null hypothesis at conventional significance levels.

library("survival")
data("cancer", package = "survival")
cancer$surv <- with(cancer, Surv(time, status == 2))
comets(surv ~ sex | age, data = cancer, reg_YonZ = "cox")
## 
## 	Generalized covariance measure test
## 
## data:  comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "cox")
## X-squared = 7.7608, df = 1, p-value = 0.005339
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0
comets(surv ~ sex | age, data = cancer, reg_YonZ = "survforest")
## 
## 	Generalized covariance measure test
## 
## data:  comets(formula = surv ~ sex | age, data = cancer, reg_YonZ = "survforest")
## X-squared = 14.827, df = 1, p-value = 0.0001178
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0

Usage example: Multivariate response

The GCM test also supports multivariate responses. Continuing the example from above, we generate a bivariate response $Y$. Internally, since $Y$ and $X$ are both two dimensional, four random forest regressions are performed. Advanced usage with the multivariate argument also allows the specification of multivariate regression models (this option is experimental).

bivY <- cbind(Y, 0.5 * X[, 1] + Z[, 1] + rnorm(n))
gcm(bivY, X, Z)
## 
## 	Generalized covariance measure test
## 
## data:  gcm(Y = bivY, X = X, Z = Z)
## X-squared = 42.177, df = 4, p-value = 1.533e-08
## alternative hypothesis: true E[cov(Y, X | Z)] is not equal to 0

Installation

The development version of comets can be installed using:

# install.packages("remotes")
remotes::install_github("LucasKook/comets")

A stable version of comets can be installed from CRAN via:

install.packages("comets")

Replication materials

All results in [3] can be reproduced by running make all in ./inst after downloading all required data from the zenodo repository. The scripts for reproducing the results manually can be found in ./inst/code/ for the CCLE data (ccle.R), TCGA data (multiomics.R) and MIMIC data (mimic.R).

Citation

Please cite the comets package as

@article{10.1093/bib/bbae475,
title={{Algorithm-Agnostic Significance Testing in Supervised Learning With Multimodal Data}},
author={Lucas Kook and Anton Rask Lundborg},
year={2024},
journal={Briefings in Bioinformatics},
volume={25},
number={6},
doi={10.1093/bib/bbae475},
}

References

[1] Rajen D. Shah, Jonas Peters "The hardness of conditional independence testing and the generalised covariance measure," The Annals of Statistics, 48(3), 1514-1538. doi:10.1214/19-aos1857

[2] Lundborg, A. R., Kim, I., Shah, R. D., & Samworth, R. J. (2024). The Projected Covariance Measure for assumption-lean variable significance testing. The Annals of Statistics, 52(6), 2851-2878. doi:10.1214/24-AOS2447

[3] Scheidegger, C., Hörrmann, J., & Bühlmann, P. (2022). The weighted generalised covariance measure. Journal of Machine Learning Research, 23(273), 1-68.

[4] Fernández, T., & Rivera, N. (2024). A general framework for the analysis of kernel-based tests. Journal of Machine Learning Research, 25(95), 1-40.

[5] Kook, L., Saengkyongam, S., Lundborg, A. R., Hothorn, T., & Peters, J. (2025). Model-based causal feature selection for general response types. Journal of the American Statistical Association, 120(550), 1090-1101. doi:10.1080/01621459.2024.2395588

[6] Kook, L. & Lundborg A. R. (2024). Algorithm-agnostic significance testing in supervised learning with multimodal data. Briefings in Bioinformatics 25(6) 2024. doi:10.1093/bib/bbae475

About

Algorithm-agnostic significance testing in supervised learning with multimodal data

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published