This repository contains an R notebook for analyzing acute myeloid leukemia (AML) RNA-sequencing data and generating annotated heatmaps for gene expression clustering analysis.
This analysis uses RNA-sequencing data from 19 AML model mice samples to create clustered heatmaps that visualize gene expression patterns. The data comes from Shih et al., 2017 and has been pre-processed by refine.bio.
The analysis focuses on:
- Gene expression clustering
- Sample clustering
- Treatment and mutation annotation
- High-variance gene selection
- Source: refine.bio experiment SRP070849
- Samples: 19 AML model mice samples
- Data type: RNA-sequencing (quantile normalized)
- Mutations studied: IDH2, TET2, and wild-type (WT)
- Treatments:
- IDH2 mutant AML: Vehicle or AG-221
- TET2 mutant AML: Vehicle or 5-Azacytidine (Decitabine)
- R >= 3.6.0 (recommended)
# Core packages (auto-installed by script)
pheatmap
magrittr
readr
dplyr
tibble
# Optional for session info
sessioninfo- Clone this repository:
git clone <repository-url>
cd aml-heatmap-analysis- Install required R packages (if not already installed):
# Run in R console
if (!("pheatmap" %in% installed.packages())) {
install.packages("pheatmap", update = FALSE)
}
install.packages(c("magrittr", "readr", "dplyr", "tibble", "sessioninfo"))Ensure your data files are organized as follows:
project/
├── data/
│ └── SRP070849/
│ ├── SRP070849.tsv # Gene expression matrix
│ └── metadata_SRP070849.tsv # Sample metadata
├── plots/ # Generated plots (auto-created)
├── results/ # Analysis results (auto-created)
└── analysis.Rmd # Main analysis notebook
# In RStudio
# Open the .Rmd file and click "Run All"
# Or knit to HTML: Ctrl+Shift+K (Windows) or Cmd+Shift+K (Mac)# Load libraries
library(pheatmap)
library(magrittr)
set.seed(12345)
# Read data
metadata <- readr::read_tsv("data/SRP070849/metadata_SRP070849.tsv")
expression_df <- readr::read_tsv("data/SRP070849/SRP070849.tsv") %>%
tibble::column_to_rownames("Gene")
# Generate heatmap
variances <- apply(expression_df, 1, var)
upper_var <- quantile(variances, 0.75)
df_by_var <- data.frame(expression_df) %>%
dplyr::filter(variances > upper_var)
# Create annotation
annotation_df <- metadata %>%
dplyr::mutate(
mutation = dplyr::case_when(
startsWith(refinebio_title, "TET2") ~ "TET2",
startsWith(refinebio_title, "IDH2") ~ "IDH2",
startsWith(refinebio_title, "WT") ~ "WT",
TRUE ~ "unknown"
)
) %>%
dplyr::select(refinebio_accession_code, mutation, refinebio_treatment) %>%
tibble::column_to_rownames("refinebio_accession_code")
# Generate heatmap
heatmap_annotated <- pheatmap(
df_by_var,
cluster_rows = TRUE,
cluster_cols = TRUE,
show_rownames = FALSE,
annotation_col = annotation_df,
main = "Annotated Heatmap",
colorRampPalette(c("deepskyblue", "black", "yellow"))(25),
scale = "row"
)# Create simple heatmap without annotation
basic_heatmap <- pheatmap(df_by_var, scale = "row")# Heatmap with custom colors and clustering
custom_heatmap <- pheatmap(
df_by_var,
cluster_rows = TRUE,
cluster_cols = TRUE,
clustering_distance_rows = "euclidean",
clustering_method = "complete",
color = colorRampPalette(c("blue", "white", "red"))(50),
scale = "row"
)# Save as PNG
png("plots/my_heatmap.png", width = 800, height = 600)
print(heatmap_annotated)
dev.off()
# Save as PDF
pdf("plots/my_heatmap.pdf", width = 10, height = 8)
print(heatmap_annotated)
dev.off()The analysis generates the following files:
top_90_var_genes.tsv: High-variance genes used for clustering
aml_heatmap.png: Annotated heatmap visualization
- Gene Filtering: Selects genes with variance in the upper quartile (75th percentile)
- Sample Annotation: Automatically annotates samples by mutation type and treatment
- Clustering: Performs hierarchical clustering on both genes and samples
- Visualization: Creates publication-ready heatmaps with color-coded annotations
# Select top 100 most variable genes
top_genes <- head(order(variances, decreasing = TRUE), 100)
df_top_genes <- expression_df[top_genes, ]
# Select genes with specific fold change
# (requires additional differential expression analysis)# Alternative color palettes
colors_viridis <- viridis::viridis(25)
colors_rcolorbrewer <- RColorBrewer::brewer.pal(11, "RdYlBu")
colors_custom <- c("navy", "white", "firebrick")# Different clustering options
pheatmap(df_by_var,
clustering_distance_rows = "correlation", # or "euclidean", "maximum", etc.
clustering_method = "ward.D2" # or "complete", "average", etc.
)-
File not found errors
# Check if files exist file.exists("data/SRP070849/SRP070849.tsv") file.exists("data/SRP070849/metadata_SRP070849.tsv")
-
Memory issues with large datasets
# Increase memory limit (Windows) memory.limit(size = 8000) # 8GB # Use data.table for large files library(data.table) expression_df <- fread("data/SRP070849/SRP070849.tsv")
-
Package installation issues
# Install from Bioconductor if needed if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("package_name")
If you use this analysis in your research, please cite:
- Original paper: Shih et al., 2017. PMID: 28193779
- refine.bio: https://www.refine.bio/
- pheatmap package: Kolde R (2019). pheatmap: Pretty Heatmaps. R package version 1.0.12.
This analysis is adapted from the refine.bio-examples repository by CCDL for ALSF and modified by Candace Savonen.
For questions about the analysis or issues with the code, please:
- Check the troubleshooting section above
- Review the original refine.bio examples
- Open an issue in this repository