diff --git a/docs/aws.md b/docs/aws.md index 73ee232b65..333e3276ca 100644 --- a/docs/aws.md +++ b/docs/aws.md @@ -476,42 +476,40 @@ The above snippet defines two volume mounts for the jobs executed in your pipeli ### Troubleshooting -**Problem**: The Pipeline execution terminates with an AWS error message similar to the one shown below: +

Job queue not found

-``` -JobQueue not found -``` +**`JobQueue not found`** -Make sure you have defined a AWS region in the Nextflow configuration file and it matches the region in which your Batch environment has been created. +This error occurs when Nextflow cannot locate the specified AWS Batch job queue. It usually happens when the job queue does not exist, is not enabled, or there is a region mismatch between the configuration and the AWS Batch environment. -**Problem**: A process execution fails reporting the following error message: +To resolve this error, ensure you have defined an AWS region in your `nextflow.config` file and that it matches your Batch environment region. -``` -Process terminated for an unknown reason -- Likely it has been terminated by the external system -``` +

Process terminated for an unknown reason

-This may happen when Batch is unable to execute the process script. A common cause of this problem is that the Docker container image you have specified uses a non standard [entrypoint](https://docs.docker.com/engine/reference/builder/#entrypoint) which does not allow the execution of the Bash launcher script required by Nextflow to run the job. +**`Process terminated for an unknown reason -- Likely it has been terminated by the external system`** -This may also happen if the AWS CLI doesn't run correctly. +This error typically occurs when AWS Batch is unable to execute the process script. The most common reason is that the specified Docker container image has a non-standard entrypoint that prevents the execution of the Bash launcher script required by Nextflow to run the job. Another possible cause is an issue with the AWS CLI failing to run correctly within the job environment. -Other places to check for error information: +To resolve this error, ensure the Docker container image used for the job does not have a custom entrypoint overriding or preventing Bash from launching and that the AWS CLI is properly installed. -- The `.nextflow.log` file. -- The Job execution log in the AWS Batch dashboard. -- The [CloudWatch](https://aws.amazon.com/cloudwatch/) logs found in the `/aws/batch/job` log group. +Check the following logs for more detailed error information: -**Problem**: A process execution is stalled in the `RUNNABLE` status and the pipeline output is similar to the one below: +- The `.nextflow.log` file +- The Job execution log in the AWS Batch dashboard +- The CloudWatch logs found in the `/aws/batch/job` log group + +

Process stalled in RUNNABLE status

+ +If a process execution is stalled in the RUNNABLE status you may see an output similar to the following: ``` executor > awsbatch (1) -process > (1) [ 0%] 0 of .... +process > (1) [ 0%] 0 of .... ``` -It may happen that the pipeline execution hangs indefinitely because one of the jobs is held in the queue and never gets executed. In AWS Console, the queue reports the job as `RUNNABLE` but it never moves from there. - -There are multiple reasons why this can happen. They are mainly related to the Compute Environment workload/configuration, the docker service or container configuration, network status, etc. +This error occurs when a job remains stuck in the RUNNABLE state in AWS Batch and never progresses to execution. In the AWS Console, the job will be listed as RUNNABLE indefinitely, indicating that it’s waiting to be scheduled but cannot proceed. The root cause is often related to issues with the Compute Environment, Docker configuration, or network settings. -This [AWS page](https://aws.amazon.com/premiumsupport/knowledge-center/batch-job-stuck-runnable-status/) provides several resolutions and tips to investigate and work around the issue. +See [Why is my AWS Batch job stuck in RUNNABLE status?](https://repost.aws/knowledge-center/batch-job-stuck-runnable-status) for several resolutions and tips to investigate this error. (aws-fargate)= diff --git a/docs/cache-and-resume.md b/docs/cache-and-resume.md index 273da8eb3e..7f4fae4ee1 100644 --- a/docs/cache-and-resume.md +++ b/docs/cache-and-resume.md @@ -69,67 +69,87 @@ For this reason, it is important to preserve both the task cache (`.nextflow/cac ## Troubleshooting -Cache failures happen when either (1) a task that was supposed to be cached was re-executed, or (2) a task that was supposed to be re-executed was cached. +Cache failures occur when a task that was supposed to be cached was re-executed or a task that was supposed to be re-executed was cached. This page provides an overview of common causes for cache failures and strategies to identify them. -When this happens, consider the following questions: +Common causes of cache failures include: -- Is resume enabled via `-resume`? -- Is the {ref}`process-cache` directive set to a non-default value? -- Is the task still present in the task cache and work directory? -- Were any of the task inputs changed? +- {ref}`Resume not being enabled ` +- {ref}`Non-default cache directives ` +- {ref}`Modified inputs ` +- {ref}`Inconsistent file attributes ` +- {ref}`Race condition on a global variable ` +- {ref}`Non-deterministic process inputs ` -Changing any of the inputs included in the [task hash](#task-hash) will invalidate the cache, for example: +The causes of these cache failure and solutions to resolve them are described in detail below. -- Resuming from a different session ID -- Changing the process name -- Changing the task container image or Conda environment -- Changing the task script -- Changing an input file or bundled script used by the task +(cache-failure-resume)= + +### Resume not enabled + +The `-resume` option is required to resume a pipeline. Ensure `-resume` has been enabled in your run command or your nextflow configuration file. + +(cache-failure-directives)= + +### Non-default cache directives + +The `cache` directive is enabled by default. However, you can disable or modify its behavior for a specific process. For example: + +```nextflow +process FOO { + cache false + // ... +} +``` -While the following examples would not invalidate the cache: +Ensure that the cache has not been set to a non-default value. See {ref}`process-cache` for more information about the `cache` directive. -- Changing the value of a directive (other than {ref}`process-ext`), even if that directive is used in the task script +(cache-failure-modified)= -In many cases, cache failures happen because of a change to the pipeline script or configuration, or because the pipeline itself has some non-deterministic behavior. +### Modified inputs -Here are some common reasons for cache failures: +Modifying inputs that are used in the task hash will invalidate the cache. Common causes of modified inputs include: -### Modified input files +- Changing input files +- Resuming from a different session ID +- Changing the process name +- Changing the task container image or Conda environment +- Changing the task script +- Changing a bundled script used by the task -Make sure that your input files have not been changed. Keep in mind that the default caching mode uses the complete file path, the last modified timestamp, and the file size. If any of these attributes change, the task will be re-executed, even if the file content is unchanged. +:::{note} +Changing the value of any directive, except {ref}`process-ext`, will not inactivate the task cache. +::: -### Process that modifies its inputs +A hash for an input file is calculated from the complete file path, the last modified timestamp, and the file size to calculate. If any of these attributes change the task will be re-executed. If a process modifies its input files it cannot be resumed. Processes that modify their own input files are considered to be an anti-pattern and should be avoided. -If a process modifies its own input files, it cannot be resumed for the reasons described in the previous point. As a result, processes that modify their own input files are considered an anti-pattern and should be avoided. +(cache-failure-inconsistent)= ### Inconsistent file attributes -Some shared file systems, such as NFS, may report inconsistent file timestamps, which can invalidate the cache. If you encounter this problem, you can avoid it by using the `'lenient'` {ref}`caching mode `, which ignores the last modified timestamp and uses only the file path and size. +Some shared file systems, such as NFS, may report inconsistent file timestamps. If you encounter this problem, use the `'lenient'` {ref}`caching mode ` to ignore the last modified timestamp and only use the file path. (cache-global-var-race-condition)= ### Race condition on a global variable -While Nextflow tries to make it easy to write safe concurrent code, it is still possible to create race conditions, which can in turn impact the caching behavior of your pipeline. - -Consider the following example: +Race conditions can in disrupt caching behavior of your pipeline. For example: ```nextflow channel.of(1,2,3) | map { v -> X=v; X+=2 } | view { v -> "ch1 = $v" } channel.of(1,2,3) | map { v -> X=v; X*=2 } | view { v -> "ch2 = $v" } ``` -The problem here is that `X` is declared in each `map` closure without the `def` keyword (or other type qualifier). Using the `def` keyword makes the variable local to the enclosing scope; omitting the `def` keyword makes the variable global to the entire script. - -Because `X` is global, and operators are executed concurrently, there is a *race condition* on `X`, which means that the emitted values will vary depending on the particular order of the concurrent operations. If the values were passed as inputs into a process, the process would execute different tasks on each run due to the race condition. +In the above example, `X` is declared in each `map` closure. Without the `def` keyword, or other type qualifier, the variable `X` is global to the entire script. Operators and executed concurrently and, as `X` is global, there is a *race condition* that causes the emitted values to vary depending on the order of the concurrent operations. If these values were passed to a process as inputs the process would execute different tasks during each run due to the race condition. -The solution is to not use a global variable where a local variable is enough (or in this simple example, avoid the variable altogether): +To resolve this failure type, ensure the variable is not global by using a local variable: ```nextflow -// local variable channel.of(1,2,3) | map { v -> def X=v; X+=2 } | view { v -> "ch1 = $v" } +``` + +Alternatively, remove the variable: -// no variable +```nextflow channel.of(1,2,3) | map { v -> v * 2 } | view { v -> "ch2 = $v" } ``` @@ -137,7 +157,7 @@ channel.of(1,2,3) | map { v -> v * 2 } | view { v -> "ch2 = $v" } ### Non-deterministic process inputs -Sometimes a process needs to merge inputs from different sources. Consider the following example: +A process that merges inputs from different sources non-deterministically may invalidate the cache. For example: ```nextflow workflow { @@ -145,12 +165,10 @@ workflow { ch_bar = channel.of( ['2', '2.bar'], ['1', '1.bar'] ) gather(ch_foo, ch_bar) } - process gather { input: tuple val(id), file(foo) tuple val(id), file(bar) - script: """ merge_command $foo $bar @@ -158,9 +176,9 @@ process gather { } ``` -It is tempting to assume that the process inputs will be matched by `id` like the {ref}`operator-join` operator. But in reality, they are simply merged like the {ref}`operator-merge` operator. As a result, not only will the process inputs be incorrect, they will also be non-deterministic, thus invalidating the cache. +In the above example, the inputs will be merged without matching. This is the same way method used by the {ref}`operator-merge` operator. When merged, the inputs are incorrect, non-deterministic, and invalidate the cache. -The solution is to explicitly join the two channels before the process invocation: +To resolve this failure type, ensure channels are deterministic by joining them before invoking the process: ```nextflow workflow { @@ -168,11 +186,9 @@ workflow { ch_bar = channel.of( ['2', '2.bar'], ['1', '1.bar'] ) gather(ch_foo.join(ch_bar)) } - process gather { input: tuple val(id), file(foo), file(bar) - script: """ merge_command $foo $bar @@ -180,9 +196,11 @@ process gather { } ``` +(cache-compare-hashes)= + ## Tips -### Resuming from a specific run +### Resume from a specific run Nextflow resumes from the previous run by default. If you want to resume from an earlier run, simply specify the session ID for that run with the `-resume` option: @@ -192,28 +210,49 @@ nextflow run rnaseq-nf -resume 4dc656d2-c410-44c8-bc32-7dd0ea87bebf You can use the {ref}`cli-log` command to view all previous runs as well as the task executions for each run. -(cache-compare-hashes)= +### Compare task hashes -### Comparing the hashes of two runs +By identifying differences between hashes you can detect changes that may be causing cache failures. -One way to debug a resumed run is to compare the task hashes of each run using the `-dump-hashes` option. +To compare the task hashes for a resumed run: -1. Perform an initial run: `nextflow -log run_initial.log run -dump-hashes` -2. Perform a resumed run: `nextflow -log run_resumed.log run -dump-hashes -resume` -3. Extract the task hash lines from each log (search for `cache hash:`) -4. Compare the runs with a diff viewer +1. Run your pipeline with the `-log` and `-dump-hashes` options: -While some manual effort is required, the final diff can often reveal the exact change that caused a task to be re-executed. + ```bash + nextflow -log run_initial.log run -dump-hashes + ``` + +2. Run your pipeline with the `-log`, `-dump-hashes`, and `-resume` options: + + ```bash + nextflow -log run_resumed.log run -dump-hashes -resume + ``` + +3. Extract the task hash lines from each log: + + ```bash + cat run_initial.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d ' ' -f 10- | sort | awk '{ print; print ""; }' > run_initial.tasks.log + cat run_resumed.log | grep 'INFO.*TaskProcessor.*cache hash' | cut -d ' ' -f 10- | sort | awk '{ print; print ""; }' > run_resumed.tasks.log + ``` + +4. Compare the runs: + + ```bash + diff run_initial.tasks.log run_resumed.tasks.log + ``` + + :::{tip} + You can also compare the hash lines using a graphical diff viewer. + ::: :::{versionadded} 23.10.0 ::: -When using `-dump-hashes json`, the task hashes can be more easily extracted into a diff. Here is an example Bash script to perform two runs and produce a diff: +Task hashes can also be extracted into a diff using `-dump-hashes json`. The following is an example Bash script to compare two runs and produce a diff: ```bash nextflow -log run_1.log run $pipeline -dump-hashes json nextflow -log run_2.log run $pipeline -dump-hashes json -resume - get_hashes() { cat $1 \ | grep 'cache hash:' \ @@ -221,10 +260,8 @@ get_hashes() { | sort \ | awk '{ print; print ""; }' } - get_hashes run_1.log > run_1.tasks.log get_hashes run_2.log > run_2.tasks.log - diff run_1.tasks.log run_2.tasks.log ``` diff --git a/docs/google.md b/docs/google.md index 421a6c5eb1..5f232841de 100644 --- a/docs/google.md +++ b/docs/google.md @@ -289,4 +289,3 @@ Nextflow will automatically manage the transfer of input and output files betwee - Compute resources in Google Cloud are subject to [resource quotas](https://cloud.google.com/compute/quotas), which may affect your ability to run pipelines at scale. You can request quota increases, and your quotas may automatically increase over time as you use the platform. In particular, GPU quotas are initially set to 0, so you must explicitly request a quota increase in order to use GPUs. You can initially request an increase to 1 GPU at a time, and after one billing cycle you may be able to increase it further. - Currently, it's not possible to specify a disk type different from the default one assigned by the service depending on the chosen instance type. - diff --git a/docs/reference/cli.md b/docs/reference/cli.md index c030df46bb..fb02636d2b 100644 --- a/docs/reference/cli.md +++ b/docs/reference/cli.md @@ -1088,7 +1088,7 @@ The `run` command is used to execute a local pipeline script or remote pipeline `-dump-hashes` : Dump task hash keys for debugging purposes. : :::{versionadded} 23.10.0 - You can use `-dump-hashes json` to dump the task hash keys as JSON for easier post-processing. See the {ref}`caching and resuming tips ` for more details. + You can use `-dump-hashes json` to dump the task hash keys as JSON for easier post-processing. See the {ref}`cache-compare-hashes` for more details. ::: `-e.=` diff --git a/docs/vscode.md b/docs/vscode.md index 80df4e89e1..83a107a0cb 100644 --- a/docs/vscode.md +++ b/docs/vscode.md @@ -58,12 +58,58 @@ The extension can generate a workflow DAG that includes the workflow inputs, out To preview the DAG of a workflow, select the **Preview DAG** CodeLens above the workflow definition. +:::{note} +The **Preview DAG** CodeLens is only available when the script does not contain any errors. +::: + ## Troubleshooting -In the event of a language server error, you can use the **Nextflow: Restart language server** command in the command palette to restart the language server. +### Stop and restart + +In the event of an error, stop or restart the language server from the Command Palette. The following stop and restart commands are available: + +- `Nextflow: Stop language server` +- `Nextflow: Restart language server` + +### View logs + +Error logs can be useful for troubleshooting errors. + +To view logs in VS Code: + +1. Open the **Output** tab in your console. +2. Select **Nextflow Language Server** from the dropdown. + +To show additional log messages in VS Code: + +1. Open the **Extensions** view in the left-hand menu. +2. Select the **Nextflow** extension. +3. Select the **Manage** icon. +3. Enable **Nextflow > Debug** in the extension settings. +### Common errors + +

Filesystem changes

+ +The language server does not detect certain filesystem changes. For example, changing the current Git branch. + +To resolve this issue, restart the language server from the command palette to sync it with your workspace. See [Stop and restart](#stop-and-restart) for more information. + +

Third-party plugins

+ +The language server does not recognize configuration options from third-party plugins and will report unrecognized config option warnings. There is currently no solution to suppress them. + +

Groovy scripts

+ +The language server provides limited support for Groovy scripts in the lib directory. Errors in Groovy scripts are not reported as diagnostics, and changing a Groovy script does not automatically re-compile the Nextflow scripts that reference it. + +To resolve this issue, edit or close and re-open the Nextflow script to refresh the diagnostics. Report issues at [nextflow-io/vscode-language-nextflow](https://github.com/nextflow-io/vscode-language-nextflow) or [nextflow-io/language-server](https://github.com/nextflow-io/language-server). When reporting, include a minimal code snippet that reproduces the issue and any error logs from the server. To view logs, open the **Output** tab and select **Nextflow Language Server** from the dropdown. Enable **Nextflow > Debug** in the extension settings to show additional log messages while debugging. +### Reporting issues + +Report issues at [nextflow-io/vscode-language-nextflow](https://github.com/nextflow-io/vscode-language-nextflow) or [nextflow-io/language-server](https://github.com/nextflow-io/language-server). When reporting issues, include a minimal code snippet that reproduces the issue and any error logs from the server. + ## Limitations - The language server does not detect certain filesystem changes, such as changing the current Git branch. Restart the language server from the command palette to sync it with your workspace.