GWASStudio is a powerful CLI tool designed for efficient storage, retrieval, and querying of genomic summary statistics. It offers a high-performance infrastructure for handling and analyzing large-scale GWAS and QTL datasets, enabling seamless cross-dataset exploration.
GWASStudio provides a unified interface across the SCDH infrastructure, handling the ingestion, storage, querying and export of genomic data using high-performance technologies.
GWASStudio consists of several key functionalities:
- Data Ingestion: Imports summary statistics data and its metadata associated.
- Support for Multiple Storage Options: Works with both local filesystems and cloud storage (S3).
- Flexible Search: Enables searching metadata using template files.
- Selective Export: Extracts subsets of data and its metadata associated based on genomic regions, SNPs, or the entire set of data.
GWASStudio leverages several advanced technologies:
- TileDB Embedded: A high-performance array storage engine that enables efficient storage and retrieval of genomic data.
- MongoDB: A flexible, scalable NoSQL database used for storing and querying metadata associated with genomic datasets.
- Dask: Provides distributed computing capabilities for processing large datasets.
- Python Ecosystem: Built on Python with libraries like Click/Cloup for CLI interfaces, Pandas for data manipulation, and various genomics-specific tools.
To get started with GWASStudio, follow these installation steps:
# Clone the repository
git clone https://github.com/ht-diva/gwasstudio
cd gwasstudio
# Create a virtual environment (recommended)
conda env create --file base_environment.yml
conda activate gwasstudio
# Install the package
make install
# Verify installation
gwasstudio --version
For detailed instructions on how to use this tool, please refer to the documentation and check the cli_test script for a practical guide by examples.
Example files are derived from:
The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, Ben Elsworth, Tom R Gaunt, Gibran Hemani, Edoardo Marcora. bioRxiv 2020.05.29.115824; doi: https://doi.org/10.1101/2020.05.29.115824