Skip to content

Cranjit9/PathQC

Repository files navigation

GTEx Whole-Slide Image Analysis Pipeline

Overview

This pipeline analyzes GTEx (Genotype-Tissue Expression) whole-slide histological image features to predict tissue quality metrics both tissue-specific and pan-tissue modeling approaches. The primary goal is to build predictive models that can estimate RNA integrity (RIN scores) and tissue degradation levels (autolysis scores) from extracted image features across different human tissue types.

What We're Predicting

Primary Targets

  • RIN Scores (RNA Integrity Number): Predicts RNA quality on a scale of 1-10, where higher scores indicate better preserved RNA
  • Autolysis Scores: Predicts tissue degradation levels (0=none, 1=mild, 2=moderate, 3=severe)

Why This Matters

  • Quality Control: Automated assessment of tissue sample quality before expensive downstream analysis
  • Resource Optimization: Prioritize high-quality samples for RNA sequencing and other molecular assays
  • Tissue Banking: Improve sample selection and storage protocols
  • Research Impact: Enable better quality-aware analysis of GTEx data

Methodology

Data Integration

  • Combines H5-formatted image features with GTEx metadata
  • Integrates quality metrics (RIN, autolysis) with demographic information
  • Processes 29+ tissue types with 25 thousands of samples

Analysis Pipeline

  1. Feature Processing: Standardization and quality filtering of image-derived features
  2. Exploratory Analysis: Tissue distribution, quality score patterns, demographic effects (Pipeline)
  3. Dimensionality Reduction: PCA and UMAP for tissue clustering visualization

Two Methods Approaches

  • Tissue-specific models: Individual predictive models for each tissue type
  • Feature selection: Correlation-based selection of top 5% most predictive features
  • Cross-validation: 5-fold CV ensures robust performance estimates
  • Comprehensive evaluation: Multiple metrics (R, RMSE, MAE) and stability assessment

Pan-Tissue Pipeline

  • Data Preparation: Stratified train-test split with tissue and quality quartiles
  • Feature Selection: ANOVA F-score ranking for tissue discrimination (top 500 features)
  • Stage 1 - Tissue Classification:
  • Multi-class GLMNET with stratified cross-validation
  • Confidence-based predictions with threshold filtering

Stage 2 - Quality Prediction:

  • Tissue-specific Lasso models for RIN and autolysis prediction
  • Uses all available features with built-in Lasso selection

Batch Prediction: Integrated workflow for tissue identification → quality prediction Performance Evaluation: Multi-metric assessment including accuracy, correlation, RMSE

Quick Start

Installation

install.packages(c(
  "tidyverse", "data.table", "ggplot2", "readxl",
  "umap", "glmnet", "caret", "corrplot", "pROC",
  "doParallel", "cowplot", "magick"
))

# Bioconductor for HDF5 support
BiocManager::install("rhdf5")

Data Structure

PathQC/
├── data/raw/
│   ├── GTEX_AGGREGATED/concatnated_features_pooled.h5
│   └── metadata/ (GTEx annotation files)(Publically available on GTEx website)
└── output/ (generated figures and results)

Usage

# Set base directory
base_dir <- "/path/to/PathQC"

# Run main analysis
source("gtex_analysis_pipeline.R")

# Run predictive modeling
source("gtex_tissue_modeling.R")

Technical Details

System Requirements

  • R: Version 4.0 or higher
  • Runtime: 1-2 hours for complete analysis
  • Dependencies: See installation section

Model Specifications

  • Algorithm: Lasso regression with L1 regularization
  • Feature selection: Top 5% by correlation with target variable
  • Cross-validation: 5-fold stratified CV
  • Performance metrics: Pearson correlation (R), RMSE, MAE
  • Stability assessment: Coefficient of variation across folds

Applications

Research Applications

  • Quality-aware analysis of GTEx expression data
  • Tissue-specific quality control protocols
  • Histological image analysis method development

Clinical Potential

  • Automated tissue quality assessment
  • Sample prioritization for expensive assays
  • Quality control in biobanking operations

Contributing

We welcome contributions!

Note: This pipeline is designed for research purposes. Ensure compliance with GTEx data use agreements and institutional policies.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages