Skip to content

Refresh troubleshooting docs #5856

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
eb7ee64
Move troubleshooting to new section
christopher-hakkaart Mar 5, 2025
f7c391a
Language improvements
christopher-hakkaart Mar 5, 2025
d3b54ba
Revert sections
christopher-hakkaart Mar 10, 2025
26f3c2e
Merge branch 'nextflow-io:master' into docs-troubleshoot-2
christopher-hakkaart Mar 10, 2025
951fabf
Fix placeholder
christopher-hakkaart Mar 10, 2025
929cb56
Update link
christopher-hakkaart Mar 10, 2025
aa94299
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart Mar 11, 2025
fda0d10
Merge branch 'docs-troubleshoot-2' of https://github.com/christopher-…
christopher-hakkaart Mar 11, 2025
3dc2b9b
Add tip section
christopher-hakkaart Mar 11, 2025
45842dc
Move headings down a level
christopher-hakkaart Mar 11, 2025
e32c39c
Fix missed heading
christopher-hakkaart Mar 11, 2025
dcae181
Prepare for review
christopher-hakkaart Mar 11, 2025
ceba3f4
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart Mar 11, 2025
8c19601
Merge with master@00a53b97 [ci fast]
pditommaso Mar 15, 2025
8cfe891
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart Mar 17, 2025
eb20622
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart Apr 9, 2025
d35ff8b
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart May 13, 2025
c77c0f7
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart May 20, 2025
9170f99
Merge branch 'master' into docs-troubleshoot-2
christopher-hakkaart May 21, 2025
aafa33a
Remove link to removed section
christopher-hakkaart May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 19 additions & 21 deletions docs/aws.md
Original file line number Diff line number Diff line change
Expand Up @@ -476,42 +476,40 @@ The above snippet defines two volume mounts for the jobs executed in your pipeli

### Troubleshooting

**Problem**: The Pipeline execution terminates with an AWS error message similar to the one shown below:
<h4>Job queue not found</h4>

```
JobQueue <your queue> not found
```
**`JobQueue <QUEUE> not found`**

Make sure you have defined a AWS region in the Nextflow configuration file and it matches the region in which your Batch environment has been created.
This error occurs when Nextflow cannot locate the specified AWS Batch job queue. It usually happens when the job queue does not exist, is not enabled, or there is a region mismatch between the configuration and the AWS Batch environment.

**Problem**: A process execution fails reporting the following error message:
To resolve this error, ensure you have defined an AWS region in your `nextflow.config` file and that it matches your Batch environment region.

```
Process <your task> terminated for an unknown reason -- Likely it has been terminated by the external system
```
<h4>Process terminated for an unknown reason</h4>

This may happen when Batch is unable to execute the process script. A common cause of this problem is that the Docker container image you have specified uses a non standard [entrypoint](https://docs.docker.com/engine/reference/builder/#entrypoint) which does not allow the execution of the Bash launcher script required by Nextflow to run the job.
**`Process terminated for an unknown reason -- Likely it has been terminated by the external system`**

This may also happen if the AWS CLI doesn't run correctly.
This error typically occurs when AWS Batch is unable to execute the process script. The most common reason is that the specified Docker container image has a non-standard entrypoint that prevents the execution of the Bash launcher script required by Nextflow to run the job. Another possible cause is an issue with the AWS CLI failing to run correctly within the job environment.

Other places to check for error information:
To resolve this error, ensure the Docker container image used for the job does not have a custom entrypoint overriding or preventing Bash from launching and that the AWS CLI is properly installed.

- The `.nextflow.log` file.
- The Job execution log in the AWS Batch dashboard.
- The [CloudWatch](https://aws.amazon.com/cloudwatch/) logs found in the `/aws/batch/job` log group.
Check the following logs for more detailed error information:

**Problem**: A process execution is stalled in the `RUNNABLE` status and the pipeline output is similar to the one below:
- The `.nextflow.log` file
- The Job execution log in the AWS Batch dashboard
- The CloudWatch logs found in the `/aws/batch/job` log group

<h4>Process stalled in RUNNABLE status</h4>

If a process execution is stalled in the RUNNABLE status you may see an output similar to the following:

```
executor > awsbatch (1)
process > <your process> (1) [ 0%] 0 of ....
process > <PROCESS> (1) [ 0%] 0 of ....
```

It may happen that the pipeline execution hangs indefinitely because one of the jobs is held in the queue and never gets executed. In AWS Console, the queue reports the job as `RUNNABLE` but it never moves from there.

There are multiple reasons why this can happen. They are mainly related to the Compute Environment workload/configuration, the docker service or container configuration, network status, etc.
This error occurs when a job remains stuck in the RUNNABLE state in AWS Batch and never progresses to execution. In the AWS Console, the job will be listed as RUNNABLE indefinitely, indicating that it’s waiting to be scheduled but cannot proceed. The root cause is often related to issues with the Compute Environment, Docker configuration, or network settings.

This [AWS page](https://aws.amazon.com/premiumsupport/knowledge-center/batch-job-stuck-runnable-status/) provides several resolutions and tips to investigate and work around the issue.
See [Why is my AWS Batch job stuck in RUNNABLE status?](https://repost.aws/knowledge-center/batch-job-stuck-runnable-status) for several resolutions and tips to investigate this error.

(aws-fargate)=

Expand Down
137 changes: 87 additions & 50 deletions docs/cache-and-resume.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,120 +69,138 @@ For this reason, it is important to preserve both the task cache (`.nextflow/cac

## Troubleshooting

Cache failures happen when either (1) a task that was supposed to be cached was re-executed, or (2) a task that was supposed to be re-executed was cached.
Cache failures occur when a task that was supposed to be cached was re-executed or a task that was supposed to be re-executed was cached. This page provides an overview of common causes for cache failures and strategies to identify them.

When this happens, consider the following questions:
Common causes of cache failures include:

- Is resume enabled via `-resume`?
- Is the {ref}`process-cache` directive set to a non-default value?
- Is the task still present in the task cache and work directory?
- Were any of the task inputs changed?
- {ref}`Resume not being enabled <cache-failure-resume>`
- {ref}`Non-default cache directives <cache-failure-directives>`
- {ref}`Modified inputs <cache-failure-modified>`
- {ref}`Inconsistent file attributes <cache-failure-inconsistent>`
- {ref}`Race condition on a global variable <cache-global-var-race-condition>`
- {ref}`Non-deterministic process inputs <cache-nondeterministic-inputs>`

Changing any of the inputs included in the [task hash](#task-hash) will invalidate the cache, for example:
The causes of these cache failure and solutions to resolve them are described in detail below.

- Resuming from a different session ID
- Changing the process name
- Changing the task container image or Conda environment
- Changing the task script
- Changing an input file or bundled script used by the task
(cache-failure-resume)=

### Resume not enabled

The `-resume` option is required to resume a pipeline. Ensure `-resume` has been enabled in your run command or your nextflow configuration file.

(cache-failure-directives)=

### Non-default cache directives

The `cache` directive is enabled by default. However, you can disable or modify its behavior for a specific process. For example:

```nextflow
process FOO {
cache false
// ...
}
```

While the following examples would not invalidate the cache:
Ensure that the cache has not been set to a non-default value. See {ref}`process-cache` for more information about the `cache` directive.

- Changing the value of a directive (other than {ref}`process-ext`), even if that directive is used in the task script
(cache-failure-modified)=

In many cases, cache failures happen because of a change to the pipeline script or configuration, or because the pipeline itself has some non-deterministic behavior.
### Modified inputs

Here are some common reasons for cache failures:
Modifying inputs that are used in the task hash will invalidate the cache. Common causes of modified inputs include:

### Modified input files
- Changing input files
- Resuming from a different session ID
- Changing the process name
- Changing the task container image or Conda environment
- Changing the task script
- Changing a bundled script used by the task

Make sure that your input files have not been changed. Keep in mind that the default caching mode uses the complete file path, the last modified timestamp, and the file size. If any of these attributes change, the task will be re-executed, even if the file content is unchanged.
:::{note}
Changing the value of any directive, except {ref}`process-ext`, will not inactivate the task cache.
:::

### Process that modifies its inputs
A hash for an input file is calculated from the complete file path, the last modified timestamp, and the file size to calculate. If any of these attributes change the task will be re-executed. If a process modifies its input files it cannot be resumed. Processes that modify their own input files are considered to be an anti-pattern and should be avoided.

If a process modifies its own input files, it cannot be resumed for the reasons described in the previous point. As a result, processes that modify their own input files are considered an anti-pattern and should be avoided.
(cache-failure-inconsistent)=

### Inconsistent file attributes

Some shared file systems, such as NFS, may report inconsistent file timestamps, which can invalidate the cache. If you encounter this problem, you can avoid it by using the `'lenient'` {ref}`caching mode <process-cache>`, which ignores the last modified timestamp and uses only the file path and size.
Some shared file systems, such as NFS, may report inconsistent file timestamps. If you encounter this problem, use the `'lenient'` {ref}`caching mode <process-cache>` to ignore the last modified timestamp and only use the file path.

(cache-global-var-race-condition)=

### Race condition on a global variable

While Nextflow tries to make it easy to write safe concurrent code, it is still possible to create race conditions, which can in turn impact the caching behavior of your pipeline.

Consider the following example:
Race conditions can in disrupt caching behavior of your pipeline. For example:

```nextflow
channel.of(1,2,3) | map { v -> X=v; X+=2 } | view { v -> "ch1 = $v" }
channel.of(1,2,3) | map { v -> X=v; X*=2 } | view { v -> "ch2 = $v" }
```

The problem here is that `X` is declared in each `map` closure without the `def` keyword (or other type qualifier). Using the `def` keyword makes the variable local to the enclosing scope; omitting the `def` keyword makes the variable global to the entire script.

Because `X` is global, and operators are executed concurrently, there is a *race condition* on `X`, which means that the emitted values will vary depending on the particular order of the concurrent operations. If the values were passed as inputs into a process, the process would execute different tasks on each run due to the race condition.
In the above example, `X` is declared in each `map` closure. Without the `def` keyword, or other type qualifier, the variable `X` is global to the entire script. Operators and executed concurrently and, as `X` is global, there is a *race condition* that causes the emitted values to vary depending on the order of the concurrent operations. If these values were passed to a process as inputs the process would execute different tasks during each run due to the race condition.

The solution is to not use a global variable where a local variable is enough (or in this simple example, avoid the variable altogether):
To resolve this failure type, ensure the variable is not global by using a local variable:

```nextflow
// local variable
channel.of(1,2,3) | map { v -> def X=v; X+=2 } | view { v -> "ch1 = $v" }
```

Alternatively, remove the variable:

// no variable
```nextflow
channel.of(1,2,3) | map { v -> v * 2 } | view { v -> "ch2 = $v" }
```

(cache-nondeterministic-inputs)=

### Non-deterministic process inputs

Sometimes a process needs to merge inputs from different sources. Consider the following example:
A process that merges inputs from different sources non-deterministically may invalidate the cache. For example:

```nextflow
workflow {
ch_foo = channel.of( ['1', '1.foo'], ['2', '2.foo'] )
ch_bar = channel.of( ['2', '2.bar'], ['1', '1.bar'] )
gather(ch_foo, ch_bar)
}

process gather {
input:
tuple val(id), file(foo)
tuple val(id), file(bar)

script:
"""
merge_command $foo $bar
"""
}
```

It is tempting to assume that the process inputs will be matched by `id` like the {ref}`operator-join` operator. But in reality, they are simply merged like the {ref}`operator-merge` operator. As a result, not only will the process inputs be incorrect, they will also be non-deterministic, thus invalidating the cache.
In the above example, the inputs will be merged without matching. This is the same way method used by the {ref}`operator-merge` operator. When merged, the inputs are incorrect, non-deterministic, and invalidate the cache.

The solution is to explicitly join the two channels before the process invocation:
To resolve this failure type, ensure channels are deterministic by joining them before invoking the process:

```nextflow
workflow {
ch_foo = channel.of( ['1', '1.foo'], ['2', '2.foo'] )
ch_bar = channel.of( ['2', '2.bar'], ['1', '1.bar'] )
gather(ch_foo.join(ch_bar))
}

process gather {
input:
tuple val(id), file(foo), file(bar)

script:
"""
merge_command $foo $bar
"""
}
```

(cache-compare-hashes)=

## Tips

### Resuming from a specific run
### Resume from a specific run

Nextflow resumes from the previous run by default. If you want to resume from an earlier run, simply specify the session ID for that run with the `-resume` option:

Expand All @@ -192,39 +210,58 @@ nextflow run rnaseq-nf -resume 4dc656d2-c410-44c8-bc32-7dd0ea87bebf

You can use the {ref}`cli-log` command to view all previous runs as well as the task executions for each run.

(cache-compare-hashes)=
### Compare task hashes

### Comparing the hashes of two runs
By identifying differences between hashes you can detect changes that may be causing cache failures.

One way to debug a resumed run is to compare the task hashes of each run using the `-dump-hashes` option.
To compare the task hashes for a resumed run:

1. Perform an initial run: `nextflow -log run_initial.log run <pipeline> -dump-hashes`
2. Perform a resumed run: `nextflow -log run_resumed.log run <pipeline> -dump-hashes -resume`
3. Extract the task hash lines from each log (search for `cache hash:`)
4. Compare the runs with a diff viewer
1. Run your pipeline with the `-log` and `-dump-hashes` options:

While some manual effort is required, the final diff can often reveal the exact change that caused a task to be re-executed.
```bash
nextflow -log run_initial.log run <PIPELINE> -dump-hashes
```

2. Run your pipeline with the `-log`, `-dump-hashes`, and `-resume` options:

```bash
nextflow -log run_resumed.log run <PIPELINE> -dump-hashes -resume
```

3. Extract the task hash lines from each log:

```bash
cat run_initial.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d ' ' -f 10- | sort | awk '{ print; print ""; }' > run_initial.tasks.log
cat run_resumed.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d ' ' -f 10- | sort | awk '{ print; print ""; }' > run_resumed.tasks.log
```

4. Compare the runs:

```bash
diff run_initial.tasks.log run_resumed.tasks.log
```

:::{tip}
You can also compare the hash lines using a graphical diff viewer.
:::

:::{versionadded} 23.10.0
:::

When using `-dump-hashes json`, the task hashes can be more easily extracted into a diff. Here is an example Bash script to perform two runs and produce a diff:
Task hashes can also be extracted into a diff using `-dump-hashes json`. The following is an example Bash script to compare two runs and produce a diff:

```bash
nextflow -log run_1.log run $pipeline -dump-hashes json
nextflow -log run_2.log run $pipeline -dump-hashes json -resume

get_hashes() {
cat $1 \
| grep 'cache hash:' \
| cut -d ' ' -f 10- \
| sort \
| awk '{ print; print ""; }'
}

get_hashes run_1.log > run_1.tasks.log
get_hashes run_2.log > run_2.tasks.log

diff run_1.tasks.log run_2.tasks.log
```

Expand Down
1 change: 0 additions & 1 deletion docs/google.md
Original file line number Diff line number Diff line change
Expand Up @@ -289,4 +289,3 @@ Nextflow will automatically manage the transfer of input and output files betwee
- Compute resources in Google Cloud are subject to [resource quotas](https://cloud.google.com/compute/quotas), which may affect your ability to run pipelines at scale. You can request quota increases, and your quotas may automatically increase over time as you use the platform. In particular, GPU quotas are initially set to 0, so you must explicitly request a quota increase in order to use GPUs. You can initially request an increase to 1 GPU at a time, and after one billing cycle you may be able to increase it further.

- Currently, it's not possible to specify a disk type different from the default one assigned by the service depending on the chosen instance type.

2 changes: 1 addition & 1 deletion docs/reference/cli.md
Original file line number Diff line number Diff line change
Expand Up @@ -1088,7 +1088,7 @@ The `run` command is used to execute a local pipeline script or remote pipeline
`-dump-hashes`
: Dump task hash keys for debugging purposes.
: :::{versionadded} 23.10.0
You can use `-dump-hashes json` to dump the task hash keys as JSON for easier post-processing. See the {ref}`caching and resuming tips <cache-compare-hashes>` for more details.
You can use `-dump-hashes json` to dump the task hash keys as JSON for easier post-processing. See the {ref}`cache-compare-hashes` for more details.
:::

`-e.<key>=<value>`
Expand Down
48 changes: 47 additions & 1 deletion docs/vscode.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,12 +58,58 @@ The extension can generate a workflow DAG that includes the workflow inputs, out

To preview the DAG of a workflow, select the **Preview DAG** CodeLens above the workflow definition.

:::{note}
The **Preview DAG** CodeLens is only available when the script does not contain any errors.
:::

## Troubleshooting

In the event of a language server error, you can use the **Nextflow: Restart language server** command in the command palette to restart the language server.
### Stop and restart

In the event of an error, stop or restart the language server from the Command Palette. The following stop and restart commands are available:

- `Nextflow: Stop language server`
- `Nextflow: Restart language server`

### View logs

Error logs can be useful for troubleshooting errors.

To view logs in VS Code:

1. Open the **Output** tab in your console.
2. Select **Nextflow Language Server** from the dropdown.

To show additional log messages in VS Code:

1. Open the **Extensions** view in the left-hand menu.
2. Select the **Nextflow** extension.
3. Select the **Manage** icon.
3. Enable **Nextflow > Debug** in the extension settings.

### Common errors

<h4>Filesystem changes</h4>

The language server does not detect certain filesystem changes. For example, changing the current Git branch.

To resolve this issue, restart the language server from the command palette to sync it with your workspace. See [Stop and restart](#stop-and-restart) for more information.

<h4>Third-party plugins</h4>

The language server does not recognize configuration options from third-party plugins and will report unrecognized config option warnings. There is currently no solution to suppress them.

<h4>Groovy scripts</h4>

The language server provides limited support for Groovy scripts in the lib directory. Errors in Groovy scripts are not reported as diagnostics, and changing a Groovy script does not automatically re-compile the Nextflow scripts that reference it.

To resolve this issue, edit or close and re-open the Nextflow script to refresh the diagnostics.
Report issues at [nextflow-io/vscode-language-nextflow](https://github.com/nextflow-io/vscode-language-nextflow) or [nextflow-io/language-server](https://github.com/nextflow-io/language-server). When reporting, include a minimal code snippet that reproduces the issue and any error logs from the server. To view logs, open the **Output** tab and select **Nextflow Language Server** from the dropdown. Enable **Nextflow > Debug** in the extension settings to show additional log messages while debugging.

### Reporting issues

Report issues at [nextflow-io/vscode-language-nextflow](https://github.com/nextflow-io/vscode-language-nextflow) or [nextflow-io/language-server](https://github.com/nextflow-io/language-server). When reporting issues, include a minimal code snippet that reproduces the issue and any error logs from the server.

## Limitations

- The language server does not detect certain filesystem changes, such as changing the current Git branch. Restart the language server from the command palette to sync it with your workspace.
Expand Down
Loading