This pipeline analyzes GTEx (Genotype-Tissue Expression) whole-slide histological image features to predict tissue quality metrics both tissue-specific and pan-tissue modeling approaches. The primary goal is to build predictive models that can estimate RNA integrity (RIN scores) and tissue degradation levels (autolysis scores) from extracted image features across different human tissue types.
- RIN Scores (RNA Integrity Number): Predicts RNA quality on a scale of 1-10, where higher scores indicate better preserved RNA
- Autolysis Scores: Predicts tissue degradation levels (0=none, 1=mild, 2=moderate, 3=severe)
- Quality Control: Automated assessment of tissue sample quality before expensive downstream analysis
- Resource Optimization: Prioritize high-quality samples for RNA sequencing and other molecular assays
- Tissue Banking: Improve sample selection and storage protocols
- Research Impact: Enable better quality-aware analysis of GTEx data
- Combines H5-formatted image features with GTEx metadata
- Integrates quality metrics (RIN, autolysis) with demographic information
- Processes 29+ tissue types with 25 thousands of samples
- Feature Processing: Standardization and quality filtering of image-derived features
- Exploratory Analysis: Tissue distribution, quality score patterns, demographic effects (Pipeline)
- Dimensionality Reduction: PCA and UMAP for tissue clustering visualization
- Tissue-specific models: Individual predictive models for each tissue type
- Feature selection: Correlation-based selection of top 5% most predictive features
- Cross-validation: 5-fold CV ensures robust performance estimates
- Comprehensive evaluation: Multiple metrics (R, RMSE, MAE) and stability assessment
- Data Preparation: Stratified train-test split with tissue and quality quartiles
- Feature Selection: ANOVA F-score ranking for tissue discrimination (top 500 features)
- Stage 1 - Tissue Classification:
- Multi-class GLMNET with stratified cross-validation
- Confidence-based predictions with threshold filtering
- Tissue-specific Lasso models for RIN and autolysis prediction
- Uses all available features with built-in Lasso selection
Batch Prediction: Integrated workflow for tissue identification → quality prediction Performance Evaluation: Multi-metric assessment including accuracy, correlation, RMSE
install.packages(c(
"tidyverse", "data.table", "ggplot2", "readxl",
"umap", "glmnet", "caret", "corrplot", "pROC",
"doParallel", "cowplot", "magick"
))
# Bioconductor for HDF5 support
BiocManager::install("rhdf5")PathQC/
├── data/raw/
│ ├── GTEX_AGGREGATED/concatnated_features_pooled.h5
│ └── metadata/ (GTEx annotation files)(Publically available on GTEx website)
└── output/ (generated figures and results)
# Set base directory
base_dir <- "/path/to/PathQC"
# Run main analysis
source("gtex_analysis_pipeline.R")
# Run predictive modeling
source("gtex_tissue_modeling.R")- R: Version 4.0 or higher
- Runtime: 1-2 hours for complete analysis
- Dependencies: See installation section
- Algorithm: Lasso regression with L1 regularization
- Feature selection: Top 5% by correlation with target variable
- Cross-validation: 5-fold stratified CV
- Performance metrics: Pearson correlation (R), RMSE, MAE
- Stability assessment: Coefficient of variation across folds
- Quality-aware analysis of GTEx expression data
- Tissue-specific quality control protocols
- Histological image analysis method development
- Automated tissue quality assessment
- Sample prioritization for expensive assays
- Quality control in biobanking operations
We welcome contributions!
Note: This pipeline is designed for research purposes. Ensure compliance with GTEx data use agreements and institutional policies.