This is a repository that includes the transformer model and relevant training routines. It is a greatly distilled version of Harris Hardiman-Mostow's research repository with optimizations and improvements specifically tailored for the DIST-S1 product written by Diego Martinez. There are also additional notebooks to inspect the input dataset and visualize the model application to existing OPERA RTC data.
-
Install the environment using mamba:
mamba env create -f environment_gpu.yml
-
Activate the environment:
conda activate dist-s1-model
- Training data (~53 GB):
<url>
- Test data (~13 GB):
<url>
Update the data paths in your configuration file (see Configuration section below).
Note:
We currently support two different datasets:
- sequential time-series to establish baselines and
- another that uses windows around anniversary date from the target/post-image acquisition to establish a baseline.
The former is the original work that was done to prototype the algorithm and the latter is what OPERA project aims to support to be in line with the OPERA DIST suite. Currently all the *-redux
or Redux
are for the latter more recent dataset regarding windows around anniversary dates (.i.e. 2).
We will support both for provenance, though our current focus will be on the newer dataset with the project's goal in mind.
Create a configuration file (e.g., config.yml
) with the following structure:
# Data configuration
data:
train_path: "/path/to/your/train_data.pt"
test_path: "/path/to/your/test_data.pt"
# Model configuration
model_config:
type: "SpatioTemporalTransformer"
# Add your model-specific parameters here
# Training configuration
train_config:
batch_size: 8
learning_rate: 0.001
num_epochs: 100
seed: 42
step_size: 30
gamma: 0.1
checkpoint_freq: 10
input_size: 16 # Patch size for processing
# Save directories
save_dir:
models: "./saved_models"
checkpoints: "./checkpoints"
visualizations: "./visualizations"
# Validation configuration (optional)
validation:
enable_visual_validation: true
enable_intermediate_validation: true
intermediate_validation_freq: 10
apply_smoothing: true
smooth_sigma: 0.5
blend_mode: "gaussian"
# Weights & Biases logging (optional)
use_wandb: true
wandb_project: "dist-s1-training"
wandb_entity: "your-entity"
# Resume training (optional)
# resume_checkpoint: "/path/to/checkpoint.pth"
Set up Accelerate configuration interactively:
accelerate config
Follow the prompts to configure:
- Compute environment (local machine or cluster)
- Machine type (multi-GPU, multi-node, etc.)
- Number of processes/GPUs
- Mixed precision settings
Create an Accelerate config file (accelerate_config.yml
):
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU # or NO for single GPU
gpu_ids: all # or specify specific GPUs like "0,1"
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 2 # Number of GPUs to use
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
python trainer.py config.yml
or
python trainer_redux.py config_redux.yml
accelerate launch trainer.py config.yml
accelerate launch --config_file accelerate_config.yml train.py config.yml
accelerate launch --num_processes 2 train.py config.yml
If you encounter issues with PyTorch's dynamo compilation, you can disable it by setting the environment variable:
export TORCH_COMPILE_DISABLE=1
accelerate launch train.py config.yml
Add the checkpoint path to your config:
resume_checkpoint: "/path/to/checkpoint_epoch_X.pth"
To capture training logs:
accelerate launch train.py config.yml > training.log 2> training.err
The training script supports Weights & Biases logging. Configure in your YAML:
use_wandb: true
wandb_project: "your-project-name"
wandb_entity: "your-entity"
Before using wandb for the first time, you must open a terminal session, activate the dist-s1-model
env and run
wandb login
. The command line will prompt you for an API key that can be found at https://wandb.ai/home.
Enable visual validation to monitor training progress:
validation:
enable_visual_validation: true
enable_intermediate_validation: true
intermediate_validation_freq: 10
Checkpoints are automatically saved based on the checkpoint_freq
setting. The training script creates:
- Regular checkpoints:
checkpoint_epoch_X_MM-DD-YYYY_HH-MM.pth
- Model weights:
ModelType_MM-DD-YYYY_HH-MM_epoch_X.pth
- Final checkpoint:
final_checkpoint_MM-DD-YYYY_HH-MM.pth
- Emergency checkpoints: Saved automatically on interruption
- CUDA Out of Memory: Reduce
batch_size
in your configuration - Compilation Errors: Set environment variable
TORCH_COMPILE_DISABLE=1
- Multi-GPU Issues: Ensure proper Accelerate configuration
- Data Loading Errors: Verify data paths in configuration file
- Adjust
input_size
based on available GPU memory - Enable gradient accumulation in Accelerate config for larger effective batch sizes
The training script supports graceful interruption (Ctrl+C). It will:
- Save an emergency checkpoint
- Preserve training metrics
- Clean up resources properly
See the included notebooks for model application examples. This section is currently under development.
A separate repository for SAR data curation is planned. This is currently a work in progress.
-
OPERA Disturbance Suite: https://www.jpl.nasa.gov/go/opera/products/dist-product-suite/
-
Hardiman-Mostow, Harris, Charles Marshak, and Alexander L. Handwerger. "Deep Self-Supervised Disturbance Mapping with the OPERA Sentinel-1 Radiometric Terrain Corrected SAR Backscatter Product." IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2025). arXiv
[Add your license information here]
[Add contributing guidelines here]
For issues and questions, please create an issue in this repository or contact the maintainers.