Why new publish directive via workflow output only in entry workflow? #6325

xsvato01 · 2025-08-04T08:40:48Z

xsvato01
Aug 4, 2025

Hi, I was wondering if there’s a specific reason the new publish directive is only available in the entry workflow?
https://www.nextflow.io/docs/latest/workflow.html#workflow-outputs

A script can define an output block which declares the top-level outputs of the workflow. Each output should be assigned in the publish section of the entry workflow

It feels like a missed opportunity that this can't be used within named subworkflows. This limitation means I have to bubble up all my channels to the entry workflow just to publish files, which adds unnecessary overhead and reduces modularity.

Otherwise I would need to use publish-only processes:

 process PublishFastqs{
	tag "${task.cpus} CPUs ${task.memory}"
	publishDir "${params.outDir}/fastqs/", mode: 'copy'
	label "s_cpu"
	label "xs_mem"

	input:
	tuple val(meta), path(fastqs)

	output:
	path fastqs

	script:
	"""
	"""
}

bentsherman · 2025-08-04T14:23:22Z

bentsherman
Aug 4, 2025
Maintainer

To the contrary, allowing subworkflows to publish channels would hurt modularity. If a subworkflow is to be re-used across different pipelines, it should not impose publishing behavior on the calling workflow.

For example, I may call a subworkflow and only want to publish some of its channels. If the subworkflow inherently publishes all of its output channels instead of emitting them, that choice is taken from me as the caller.

This is also a best practice in software engineering more generally, to keep I/O at the "boundaries" (beginning and end) of your code. Makes components easier to test and re-use in different contexts.

6 replies

adamrtalbot Aug 5, 2025
Collaborator

With all due respect, you have some odd things about your workflow:

Conditional logic based on if statements, which, while not unusual, isn't the best practice
There are no inputs for 2 of the subworkflows and no outputs for any subworkflows
Accessing the profile as if it were a parameter (I think this one is problematic)
Using a "test" subworkflow separate from the main workflow (I'm not sure I understand what this tests, I assume it's infrastructure rather than pipeline logic?)

I think you should modify your workflow to:

Have defined inputs and outputs for every component of your pipeline (function, process, workflow)
- Each unit will become a composable, testable unit which can be reused
Control the data flow with channel logic rather than a switch
- The workflow becomes more predictable and structured, making it easier to read and write.
Use parameters to control the workflow rather than a profile
- Parameters can be more easily controlled and validated than profiles, which are primarily designed for infrastructure management.

workflow {
    ch_datatype  = Channel.of(params.datatype)

    ch_inputs    = params.input ? 
                    Channel.fromPath(params.input, checkIfExists: true) : 
                    Channel.empty()

    ch_reanalyze = params.resultFolderToReanalyze ? 
                    Channel.fromPath(params.resultFolderToReanalyze, type: 'dir', checkIfExists: true) :
                    Channel.empty()
    
    ch_analysed   = Analyze(ch_inputs)
    ch_reanalyzed = Reanalyze(ch_reanalyze)
    // I would control test with parameters, but if you wanted:
    ch_test       = Test()
}

With these changes, the empty channels are generated automatically. The workflows are managed automatically by parameters without needing to check the profile, and you can run multiple subworkflows simultaneously if desired.

In this way, I believe the new outputs syntax encourages users to act correctly and make their output channels visible instead of buried deep within the workflow.

xsvato01 Aug 5, 2025
Author

Than you Adam, the reason I use profiles is due to number of parameters that are connected to each profile:

    analyze {
        params {
            sequencerOutputs                 = ['/srv/nextseq2', '/srv/nextseq3', '/srv/nextseq4/']
            runFolder                        = false
            notify                           = true
            saveAnalysisName                 = true
        }
    }

    test {
        params {
            dataFolder                      = "/mnt/data_for_tests/"
            saveAnalysisName                = false
            notify                          = false
        }
    }

    reanalyze {
        params {
            resultFolderToReanalyze         = "/mnt/inputs/bams/" 
            saveAnalysisName                = false
	    notify          		    = false
        }
    }
}

Also, the input logic is much more complicated for the analyze profile. But perhaps to linearize the logic I could do something like

fastqs_ch = workflow.profile  == "analyze" ? PrimaryAnalysis() : RevertFastqs()

And than continue the same way.

I do not need any inputs for these wrapper subworkflows since they are already defined in the profile parameters.

The purpose of the testing subworkflow was to support automated GitHub Action testing. The idea was to provide a small set of test data alongside expected outputs. This testing mode acts as a wrapper around the existing reanalyze mode, with added verification to ensure that newly generated results match the predefined "ground truth" data. Basically, the idea was to join the output data with predefined reference data and run md5sum in a process that would return a non-zero exit status if the files don't match.

I checked https://www.nf-test.com/docs/testcases/nextflow_workflow/
but it did not seem like that kind of testing I am intending.

Any of your further insights would be valuable.

bentsherman Aug 5, 2025
Maintainer

It seems like you are trying to wrap multiple standalone Nextflow pipelines with another Nextflow script. This is totally possible, but I wonder consider maintaining each workflow as a separate pipeline in its own subdirectory:

analyze/
  main.nf
  nextflow.config
reanalyze/
  main.nf
  nextflow.config
test/
  main.nf
  nextflow.config

You can launch these workflows directly from the Nextflow CLI:

nextflow run <repo>/analyze/main.nf

And you can write a Bash script to route these workflows if you need to. It just seems like that would save you some boilerplate. And you wouldn't need to use config profiles to route pipeline code

adamrtalbot Aug 5, 2025
Collaborator

Since your profiles are 'just' changing parameters, it seems easier to modify your workflow based on the params and not on the profile set. Profiles are brittle and not intended for modifying the workflow logic.

This testing mode acts as a wrapper around the existing reanalyze mode, with added verification to ensure that newly generated results match the predefined "ground truth" data. Basically, the idea was to join the output data with predefined reference data and run md5sum in a process that would return a non-zero exit status if the files don't match.

Then this is a very good argument for exporting results from the Analyze/Reanalyze workflow. If you export results, you can add on the MD5 test after running the core logic and be sure you aren't introducing a deviation between the Analyze and Test workflow. Here is a modification to my previous workflow:

// example workflow checking MD5s
workflow Test {
    take:
        analyzed_files

    main:
        ch_validity_check = CHECK_MD5S(analyzed_files)
        
    emit:
        checks = ch_validity_check
}

workflow {
    ch_sequencer  = Channel.of(params.sequencerOutputs)

    ch_inputs    = params.input ? 
                    Channel.fromPath(params.input, checkIfExists: true) : 
                    Channel.empty()

    ch_reanalyze = params.resultFolderToReanalyze ? 
                    Channel.fromPath(params.resultFolderToReanalyze, type: 'dir', checkIfExists: true) :
                    Channel.empty()
    
    ch_analysed   = Analyze(ch_sequencer)
    ch_reanalyzed = Reanalyze(ch_reanalyze)
    ch_test       = Test(ch_reanalyzed)
}

I'm not sure what's missing with nf-test, it has support for a range of parameters, profiles, md5s, snapshots and lots of other features that make it very convenient for testing workflows without modifying the core behaviour of the pipeline. The only limitation I have found so far is lack of cloud support. Note: nf-test is maintained by community members and not a core Nextflow product.

xsvato01 Aug 5, 2025
Author

@bentsherman I remembered seeing this kind of design—where each script has its own config—in some nf-core pipelines, but I wasn’t sure it was still a valid approach. Thank you for confirming! I already have the individual "modes" in separate folders, each with its own main.nf.

@adamrtalbot Thank you, Adam, for helping me think about the workflow from a different angle. It's easy to get locked into one approach after spending a lot of time with it, so your perspective was really helpful.

I really appreciate both of your insights—and the incredible response times!

bentsherman · 2025-08-04T18:55:53Z

bentsherman
Aug 4, 2025
Maintainer

You would need to propagate the output channels for each subworkflow or leave them empty:

workflow {
    main:
    println "$workflow.profile pipeline"

    if( workflow.profile == "analyze" )
        ch_analyze = Analyze()
    else
        ch_analyze = channel.empty()

    if( workflow.profile == "reanalyze" )
        ch_reanalyze = Reanalyze(params.datatype, params.resultFolderToReanalyze)
    else
        ch_reanalyze = channel.empty()

    if( workflow.profile == "test" )
        ch_test = Test()
    else
        ch_test = channel.empty()

    publish:
    analyze = ch_analyze
    reanalyze = ch_reanalyze
    test = ch_test
}

Of course this example assumes one output channel per subworkflow. I understand that this becomes cumbersome when each subworkflow has many output channels.

I have noticed from studying nf-core pipelines that subworkflows often have many related outputs (e.g. BAM_MARKDUPLICATES_PICARD). I'm guessing that you follow a similar pattern?

I think this is a huge source of verbosity in general, but especially in your case of propagating outputs through multiple levels of subworkflows. Ideally I think this subworkflow would output a single channel of records with the following structure:

[
  id: meta.id,
  // other meta properties...
  bam: /* ... */ ,
  cram: /* ... */ ,
  metrics: /* ... */ ,
  bai: /* ... */ ,
  crai: /* ... */ ,
  csi: /* ... */ ,
  stats: /* ... */ ,
  flagstat: /* ... */ ,
  idxstats: /* ... */ ,
]

This is much easier to emit back up to the entry workflow and publish. Nextflow can even write out the channel into a CSV or JSON file now that you have the map keys. And I know that many people have asked for this kind of pattern just because it's better in general.

This is currently difficult to do because of how process inputs/outputs work. We are working on some possible improvements to the process syntax to make it possible.

In the meantime, you could try to create a single output channel with the join operator:

result = bam
  .join(cram)
  .join(metrics)
  // ...

But until we improve the process input/output syntax, I think this pattern will be ugly either way 🙁

5 replies

xsvato01 Aug 5, 2025
Author

Hi Ben,
perhaps using ifEmpty operator for handling the unused subworfklors might do the job aswell

workflow {
    println "$workflow.profile pipeline"

    switch (workflow.profile) {
        case "analyze":
            ch_analyze = Analyze().ifEmpty('')
            break
        case "reanalyze":
            ch_reanalyze = Reanalyze(params.datatype, params.resultFolderToReanalyze).ifEmpty('')
            break
        case "test":
            ch_test = Test().ifEmpty('')
            break
        default:
            println "Please select run profile, e.g. '-profile analyze'"
    }
}

As you mentioned, the key might be to join the outputs from multiple subworkflow levels and emit them as a single channel. In that case, I assume I could publish the files to a single target folder — even if it’s dynamically set per sample or subject.

In my scenario, however, the complication is that the files need to be distributed across different disks. That would likely mean I need to create a separate channel for each publish destination — is that correct?

adamrtalbot Aug 5, 2025
Collaborator

In my scenario, however, the complication is that the files need to be distributed across different disks. That would likely mean I need to create a separate channel for each publish destination — is that correct?

One of the current limitations of the workflow outputs syntax is that it assumes one root for publishing, e.g. all outputs are relative to ./outputs. If your disks are all mounted at a common path (e.g. /mnt) then you can use this as the root and assign each one to different disks:

output {
    outputs1 {
        path "${params.disk1}/"
    }
    output2 {
        path "${params.disk2}/"
    }
}

or calculate it dynamically:

output {
    outputs1 {
        path { sample -> sample.disk  ?: "disk1/" }
    }
    output2 {
        path { sample -> sample.disk  ?: "disk2/" }
    }
}

The common root is probably something that needs to change because publishing to separate directories is relatively common.

xsvato01 Aug 5, 2025
Author

Other constraint of this otherwise pretty approach I see when not all analysis are performed for all samples. In that case I assume I would have to somehow normalize the sample object to contain nulls for keys that have otherwise associated resulting files. This of course is the case only for the scenario in which I would be joining many channels in the nested subworkflows to eliminate the number of propagated channels to the entry script.

output {
    samples {
        path { sample ->
            sample.fastq_1 >> "fastq/${sample.id}/"
            sample.fastq_2 >> "fastq/${sample.id}/"
        }
    }
}

E.g. In this example from docs, for some samples the "fastq_1" or "fastq_2" might just not exist (assume here perhaps not fastq files, but some additional resulting files that are only calculated based on complex input csv/tsv/json file that serves as analysis samplesheet)

adamrtalbot Aug 5, 2025
Collaborator

Nextflow will just handle this. In your example above, if sample.fastq_2 were missing, it wouldn't be published.

This is a bit contrived but all the complicated logic is getting the channels into the right place, the actual publishing logic is relatively simple. It makes more sense when combined with Ben's data types syntax.

nextflow.preview.output = true

process CREATE_FILES {

    input:
        val n

    output:
        tuple val(n), path("*file1.txt"), emit: file1
        tuple val(n), path("*file2.txt"), emit: file2, optional: true
    
    script:
    // If number is odd create file2
    def create_file2_cmd = (n % 2 == 1) ? "touch ${n}_file2.txt" : "" 
    """
    touch ${n}_file1.txt
    $create_file2_cmd
    """
}

workflow {
    main:

        ch_vals = Channel.of(1..5)
        myFiles = CREATE_FILES(ch_vals)

        outfiles = myFiles.file1
            .join(myFiles.file2, remainder: true)
            .map { n, file1, file2 ->
                [
                    n: n, 
                    file1: file1 ? file(file1) : null,
                    file2: file2 ? file(file2) : null
                ]
            }
    
    publish:
        outfiles = outfiles
}

output {
    outfiles {
        path { myFiles ->
            myFiles.file1 >> "file1/"
            myFiles.file2 >> "file2/"
        }
        index {
            path 'samples.csv'
        }
    }
}

Index file:

"5","results/file1/5_file1.txt","results/file2/5_file2.txt"
"1","results/file1/1_file1.txt","results/file2/1_file2.txt"
"3","results/file1/3_file1.txt","results/file2/3_file2.txt"
"4","results/file1/4_file1.txt"
"2","results/file1/2_file1.txt"

bentsherman Aug 5, 2025
Maintainer

Thanks Adam. Setting up the input channels with a ternary is much cleaner.

Like Adam said, the publish >> operator automatically skips null values so that it "just works".

In fact the conversion logic is even simpler because file1 and file2 are already path types:

        outfiles = myFiles.file1
            .join(myFiles.file2, remainder: true)
            .map { n, file1, file2 ->
                [
                    n: n, 
                    file1: file1,
                    file2: file2
                ]
            }

Also, you might have found a bug because the index file should still contain null for null values...

Why new publish directive via workflow output only in entry workflow? #6325

Uh oh!

Uh oh!

xsvato01 Aug 4, 2025

Replies: 2 comments · 11 replies

Uh oh!

bentsherman Aug 4, 2025 Maintainer

Uh oh!

Uh oh!

adamrtalbot Aug 5, 2025 Collaborator

Uh oh!

xsvato01 Aug 5, 2025 Author

Uh oh!

bentsherman Aug 5, 2025 Maintainer

Uh oh!

adamrtalbot Aug 5, 2025 Collaborator

Uh oh!

xsvato01 Aug 5, 2025 Author

Uh oh!

bentsherman Aug 4, 2025 Maintainer

Uh oh!

xsvato01 Aug 5, 2025 Author

Uh oh!

Uh oh!

adamrtalbot Aug 5, 2025 Collaborator

Uh oh!

xsvato01 Aug 5, 2025 Author

Uh oh!

Uh oh!

adamrtalbot Aug 5, 2025 Collaborator

Uh oh!

bentsherman Aug 5, 2025 Maintainer

xsvato01
Aug 4, 2025

Replies: 2 comments 11 replies

bentsherman
Aug 4, 2025
Maintainer

adamrtalbot Aug 5, 2025
Collaborator

xsvato01 Aug 5, 2025
Author

bentsherman Aug 5, 2025
Maintainer

adamrtalbot Aug 5, 2025
Collaborator

xsvato01 Aug 5, 2025
Author

bentsherman
Aug 4, 2025
Maintainer

xsvato01 Aug 5, 2025
Author

adamrtalbot Aug 5, 2025
Collaborator

xsvato01 Aug 5, 2025
Author

adamrtalbot Aug 5, 2025
Collaborator

bentsherman Aug 5, 2025
Maintainer