Skip to content

Conversation

@naquib314
Copy link
Contributor

@naquib314 naquib314 commented Oct 24, 2025

This PR works in tandem with the following draft PR in the depmap-deploy repo:

The run_pipelines_temp.sh is a temporary file to show how the shell script should look like in jenkins preprocessing.

The pipeline directory has been reorganized in the following way:

Before (master):

pipeline/
├── _run_common.conseq
├── cell_lines.conseq
├── celligner/
├── cn_gene/
├── context_explorer/
├── cor_analysis/
├── jenkins-run-pipeline.sh
├── jenkins-run-nonquarterly.sh
├── predictability/
├── scripts/
└── (etc.)

After (depmap-pipeline-reorg-25q3):

pipeline/
├── base_pipeline_runner.py           # Base class for all runners
├── pipeline_config.yaml               # Centralized configuration
├── image-name                         # Docker image reference
├── run_pipelines_temp.sh              # Temporary shell script to show how the jenkins shell script should look like
├── preprocessing-pipeline/            # All preprocessing logic
│   ├── preprocessing_pipeline_runner.py  # Runner implementation
│   ├── celligner/
│   ├── context_explorer/
│   ├── cor_analysis/
│   ├── predictability/
│   ├── scripts/
│   └── (all preprocessing conseq files)
├── data-prep-pipeline/                # Data preparation pipeline
│   ├── data_prep_pipeline_runner.py   # Runner implementation
│   ├── README.md                      # Documentation
│   ├── data_prep_pipeline/            # Conseq files
│   ├── scripts/                       #Data prep scripts
│   ├── poetry.lock
│   └── pyproject.toml
└── analysis-pipeline/                 
    ├── analysis_pipeline_runner.py    
    ├── predictability/
    └── publish.conseq

Introduces a --rows-per-model flag to make_fusions_matrix.py to optionally transpose the output matrix. Updates predictability.conseq to use this flag. Also updates publish.conseq messages for consistency and enables cngene_log_2_transformation in run_common.conseq. Adds a local_run.sh script for easier local execution.
Switched Docker image references to depmap-consortium registry across multiple pipeline files for consistency and updated GCP project settings. Removed unused lineage rules and publish logic from data-prep-pipeline, deleted obsolete jenkins-run-pipeline.sh, and added sparkles-config for Sparkles job configuration. Improved dstat_wrapper to handle additional terminal job states and made minor code cleanups.
README updated with revised setup and run instructions, including new dependency on depmap-deploy and changes to local execution steps. local_run.sh now supports 'internal' and 'external' environments with input validation. Removed unused transform_fusion.py script from predictability directory.
Refactored pipeline runners to load all hardcoded paths, credentials, Docker, and pipeline-specific settings from a new pipeline_config.yaml file. Added README_CONFIG.md to document the configuration structure and usage. Updated base, data prep, and preprocessing pipeline runners to use config values instead of hardcoded strings, improving maintainability and consistency.

Improve Docker image name handling and fix symlink

Refactored read_docker_image_name to handle missing or malformed image-name files more robustly, and improved handling of conseq_args in several methods. Replaced pipeline/image-name file with a symlink to ensure correct referencing.

Update data_prep_pipeline_runner.py

Update _run_common.conseq

Move xrefs-external.template logic to xrefs-public.template

Migrated all dataset artifact definitions from xrefs-external.template to xrefs-public.template, replacing 'virtual_dataset_id' with 'virtual_permaname' for consistency. Updated xrefs-external.template to include xrefs-public.template and removed redundant logic, streamlining the preprocessing pipeline configuration.

Refactor pipeline runners for improved error handling and clarity

Replaces try/except blocks with assertions for file existence and content checks, and simplifies error handling in base_pipeline_runner.py. Refactors dataset usage tracking in data_prep_pipeline_runner.py and preprocessing_pipeline_runner.py to remove redundant exception handling and improve clarity. Also ensures that log backup and post-run tasks are consistently handled, and adds additional assertions for robustness.

Improve log formatting and add pipeline run script

Replaces log separators with clearer lines in base and preprocessing pipeline runners for better readability. Adds run_pipelines_temp.sh script to automate cleanup, setup, and execution of data prep and preprocessing pipelines, including error handling and optional DB rebuild trigger.

Rename depmap_data_taiga_id to release_taiga_id

Replaces all references to 'depmap_data_taiga_id' with 'release_taiga_id' across pipeline and sample data scripts, including argument names, artifact types, and variable names. This change standardizes naming for clarity and consistency throughout the codebase.
@naquib314 naquib314 requested a review from pgm October 24, 2025 12:21
Updated local_run.sh to require an explicit 'internal' or 'external' parameter instead of defaulting to 'internal'. Improved error handling and updated README to reflect the new usage. Also added error raising in data_prep_pipeline_runner.py if release taiga ID is not found.
Updated the Celligner docker image SHA in celligner.conseq to use a newer version. Removed the unused dstat_wrapper.py script from the celligner directory.
Restrict supported Python version to >=3.9,<3.10 and update taigapy to version 4.1.0 in pyproject.toml. This ensures compatibility with the new taigapy release and clarifies the supported Python versions.
Replaces usage of release_taiga_id with RELEASE_PERMANAME from config throughout publish rules and upload_to_taiga.py. Simplifies upload_to_taiga.py by removing SHA256 checks and always updating the dataset. Updates preprocess_taiga_ids.py to export RELEASE_PERMANAME for downstream use and enables relevant includes in run_common.conseq.
@naquib314 naquib314 requested a review from pgm October 28, 2025 21:16
Moved common argument parsing and config building logic to the base PipelineRunner class. Updated data prep and preprocessing pipeline runners to use these shared methods, reducing code duplication. Centralized dataset usage tracking logic in the base class and updated runners to call the new method with the appropriate pipeline directory.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants