VAE-powered minimal genome generation pipeline for E. coli.
Data Files → [Preprocess] → [Explore] → [Training] → [Sample] → [Minimize]
- Preprocess: Extract essential gene positions from literature
- Explore: Analyze dataset distributions and generate visualizations
- Training: Train VAE models with different configurations
- Sample: Generate synthetic genomes from trained models
- Minimize: Create actual minimized genome sequences
- Python >=3.9,<3.11
- pip (usually comes with Python)
-
Clone the repository:
git clone <repository-url> cd genome-minimizer-2
-
Create virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install project:
pip install -e .
python main.py --mode preprocess
python main.py --mode training --preset v0 --epochs 1
python main.py --mode sample --model-path models/trained_models/v0_model/saved_VAE_v0.pt --genes-path data/essential_genes/essential_gene_positions.pkl --num-samples 100
python main.py --mode minimizer --genes-path models/v0_model/sampling_results/binary_samples_default.npy --single-file --output-file results.fasta
Place these files in data/
:
data/
├── F4_complete_presence_absence.csv # Gene presence/absence matrix
├── accessionID_phylogroup_BD.csv # Phylogroup classifications
├── essential_genes.csv # Essential genes from literature
└── wild_type_sequence.gb # E. coli reference genome
Mode | Purpose | Example |
---|---|---|
preprocess |
Extract essential gene positions | python main.py --mode preprocess |
explore |
Generate data analysis plots | python main.py --mode explore |
training |
Train VAE models | python main.py --mode training --preset v0 |
sample |
Generate synthetic genomes | python main.py --mode sample --model-path MODEL.pt --genes-path GENES.pkl |
minimizer |
Create FASTA sequences | python main.py --mode minimizer --genes-path SAMPLES.npy --single-file |
python main.py --mode preprocess [--force-reprocess]
--force-reprocess
: Regenerate even if files exist
python main.py --mode training --preset PRESET [--epochs N]
--preset v0/v1/v2/v3
: Model architecture (required)--epochs N
: Training epochs (default: 10000)
python main.py --mode experiment [--interactive]
--interactive
: Prompt for custom parameters
python main.py --mode sample --model-path PATH --genes-path PATH [OPTIONS]
Required:
--model-path
: Trained model (.pt file)--genes-path
: Essential gene positions (.pkl file)
Optional:
--num-samples N
: Number of genomes (default: 1)--sampling-mode default/focused
: Strategy (default: default)--noise-level N
: Noise for focused sampling (default: 0.1)--genome-path
: Reference genome (.gb file)
python main.py --mode minimizer --genes-path PATH [OPTIONS]
Required:
--genes-path
: Binary samples (.npy file)
Optional:
--genome-path
: Reference genome (.gb file)--single-file
: Output single FASTA vs multiple files--output-file
: Specific output filename--output-dir
: Directory for outputs (default: ./minimized_genomes)--model-name
: Label for file naming (default: "default")
Preset | Architecture | Features |
---|---|---|
v0 | 1024→64 | Linear KL annealing |
v1 | 512→32 | + Gene abundance + L1 regularization |
v2 | 512→32 | + Cosine annealing |
v3 | 512→32 | + Weighted abundance |
├── data/essential_genes/ # Preprocessing results
├── models/
│ ├── trained_models/v0_model/ # Saved model weights
│ └── v0_model/
│ ├── figures/ # Training plots
│ └── sampling_results/ # Generated samples
└── [output-dir]/ # Final FASTA files
- Missing files: Pipeline automatically checks and shows missing files
- Import errors: Ensure virtual environment is activated
- GPU issues: Auto-detects GPU/CPU availability
- Path errors: Use absolute paths for model/data files
Pipeline flow: preprocess → training → sample → minimizer