How to use large database in a directory with k8s ? #6294

Romumrn · 2025-07-21T12:10:16Z

Romumrn
Jul 21, 2025

Hi everyone,

I'm working on a Nextflow implementation of MassiveFold (a derivative of AlphaFold), which requires aligning sequences against a large database (~1.2 TB) stored in a directory. I'd like to use Kubernetes (K8s) as the executor, but I'm facing an issue: since I obviously can't copy 1 TB of data into a pod every time the pipeline runs, how can I allow access to this data efficiently?

I wasn't able to use a Persistent Volume Claim (PVC) through Wave and Fusion, but maybe there's a workaround or better approach? Through a s3 bucket ?

My process:

process Alignement_with_colabfold {
    tag "$seqFile.baseName"
    publishDir "result/${seqFile.baseName}/alignment"
    label 'colabfold'

    input:
    path(seqFile)
    path(data_dir)
    val(pair_strategy)

    output:
    tuple val(seqFile.baseName), path("${seqFile.baseName}_msa*")

    script:
    """

    echo "=== Starting ColabFold alignment ==="
    echo "Sequence file: $seqFile"
    echo "Data directory: $data_dir"
    echo "Pair strategy: ${pair_strategy}"
    ls -a > ls.txt

    # Check if data directory exists
    if [ ! -d "$data_dir" ]; then
        echo "ERROR: Database directory does not exist: $data_dir"
        exit 1
    fi
    
    # Set pairing strategy
    if [[ ${pair_strategy} == "greedy" ]]; then
        pairing_strategy=0
    elif [[ ${pair_strategy} == "complete" ]]; then
        pairing_strategy=1
    else
        echo "ValueError: --pair_strategy '${pair_strategy}' is not valid. Use 'greedy' or 'complete'"
        exit 1
    fi

    echo "Using pairing strategy: \$pairing_strategy"
    
    # Run ColabFold search
    colabfold_search $seqFile $data_dir ${seqFile.baseName}_msa --pairing_strategy \${pairing_strategy}
    
    echo "=== ColabFold alignment completed ==="
    """

and my config for this profile

    distant {
       workDir = 's3://test-nf/work'
        
        docker.enabled = true
        wave.enabled = true
        fusion.enabled = true
        fusion.exportStorageCredentials = true
        
        aws {
        
            accessKey = 'XXXXXXXXXXXXXXXXXNA'
            secretKey = 'tyyyyyyyyyyyyyyyyyyyyyyyyyyyyyb'
            
            client {
                endpoint = 'https://biosphere-s3.france-bioinformatique.fr'
                s3PathStyleAccess = true
                signerOverride = 'S3SignerType' 
                protocol = 'https'
            }
        
        }

        k8s {
            namespace = 'default'
            serviceAccount = 'default' 
            computeResourceType = 'Job'
           
        }

        process {
            withLabel: colabfold {
                container         = 'jysgro/colabfold:latest'
                executor = 'k8s'
                scratch = false 
                
            }
            withLabel: python_treatment {
                container = 'romudock/python_basic_packages:latest'
                executor = 'k8s'
                scratch = false 
            }
        }

Thanks in advance for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to use large database in a directory with k8s ? #6294

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to use large database in a directory with k8s ? #6294

Uh oh!

Romumrn Jul 21, 2025

Replies: 0 comments

Romumrn
Jul 21, 2025