Skip to content

ucl-cssb/genome-minimizer-2

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🧬 GENOME MINIMIZER 2

VAE-powered minimal genome generation pipeline for E. coli.

Pipeline Overview

Data Files → [Preprocess] → [Explore] → [Training] → [Sample] → [Minimize]
  1. Preprocess: Extract essential gene positions from literature
  2. Explore: Analyze dataset distributions and generate visualizations
  3. Training: Train VAE models with different configurations
  4. Sample: Generate synthetic genomes from trained models
  5. Minimize: Create actual minimized genome sequences

Setup

Prerequisites

  • Python >=3.9,<3.11
  • pip (usually comes with Python)

Installation

  1. Clone the repository:

    git clone <repository-url>
    cd genome-minimizer-2
  2. Create virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install project:

    pip install -e .

Quick Start

python main.py --mode preprocess
python main.py --mode training --preset v0 --epochs 1
python main.py --mode sample --model-path models/trained_models/v0_model/saved_VAE_v0.pt --genes-path data/essential_genes/essential_gene_positions.pkl --num-samples 100
python main.py --mode minimizer --genes-path models/v0_model/sampling_results/binary_samples_default.npy --single-file --output-file results.fasta

Data Setup

Place these files in data/:

data/
├── F4_complete_presence_absence.csv    # Gene presence/absence matrix
├── accessionID_phylogroup_BD.csv       # Phylogroup classifications
├── essential_genes.csv                 # Essential genes from literature
└── wild_type_sequence.gb               # E. coli reference genome

Commands

Mode Purpose Example
preprocess Extract essential gene positions python main.py --mode preprocess
explore Generate data analysis plots python main.py --mode explore
training Train VAE models python main.py --mode training --preset v0
sample Generate synthetic genomes python main.py --mode sample --model-path MODEL.pt --genes-path GENES.pkl
minimizer Create FASTA sequences python main.py --mode minimizer --genes-path SAMPLES.npy --single-file

Parameters by Mode

Preprocess

python main.py --mode preprocess [--force-reprocess]
  • --force-reprocess: Regenerate even if files exist

Training

python main.py --mode training --preset PRESET [--epochs N]
  • --preset v0/v1/v2/v3: Model architecture (required)
  • --epochs N: Training epochs (default: 10000)

Experiment

python main.py --mode experiment [--interactive]
  • --interactive: Prompt for custom parameters

Sample

python main.py --mode sample --model-path PATH --genes-path PATH [OPTIONS]

Required:

  • --model-path: Trained model (.pt file)
  • --genes-path: Essential gene positions (.pkl file)

Optional:

  • --num-samples N: Number of genomes (default: 1)
  • --sampling-mode default/focused: Strategy (default: default)
  • --noise-level N: Noise for focused sampling (default: 0.1)
  • --genome-path: Reference genome (.gb file)

Minimizer

python main.py --mode minimizer --genes-path PATH [OPTIONS]

Required:

  • --genes-path: Binary samples (.npy file)

Optional:

  • --genome-path: Reference genome (.gb file)
  • --single-file: Output single FASTA vs multiple files
  • --output-file: Specific output filename
  • --output-dir: Directory for outputs (default: ./minimized_genomes)
  • --model-name: Label for file naming (default: "default")

Model Architectures

Preset Architecture Features
v0 1024→64 Linear KL annealing
v1 512→32 + Gene abundance + L1 regularization
v2 512→32 + Cosine annealing
v3 512→32 + Weighted abundance

Output Structure

├── data/essential_genes/           # Preprocessing results
├── models/
│   ├── trained_models/v0_model/    # Saved model weights
│   └── v0_model/
│       ├── figures/                # Training plots
│       └── sampling_results/       # Generated samples
└── [output-dir]/                   # Final FASTA files

Troubleshooting

  • Missing files: Pipeline automatically checks and shows missing files
  • Import errors: Ensure virtual environment is activated
  • GPU issues: Auto-detects GPU/CPU availability
  • Path errors: Use absolute paths for model/data files

Pipeline flow: preprocess → training → sample → minimizer

About

Updated genome minimizer package

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.6%
  • Shell 0.4%