diff --git a/docs/cache-and-resume.md b/docs/cache-and-resume.md index 7184909aa2..556fed62eb 100644 --- a/docs/cache-and-resume.md +++ b/docs/cache-and-resume.md @@ -26,7 +26,7 @@ The task hash is computed from the following metadata: - Task {ref}`inputs ` - Task {ref}`script ` - Any global variables referenced in the task script -- Any {ref}`bundled scripts ` used in the task script +- Any {ref}`bundled scripts ` used in the task script - Whether the task is a {ref}`stub run ` - Task attempt diff --git a/docs/index.md b/docs/index.md index f1585124d4..4b2d60d027 100644 --- a/docs/index.md +++ b/docs/index.md @@ -77,6 +77,7 @@ workflow module notifications secrets +structure sharing vscode dsl1 diff --git a/docs/module.md b/docs/module.md index c9dae998ff..e8f900fb65 100644 --- a/docs/module.md +++ b/docs/module.md @@ -184,41 +184,25 @@ Ciao world! ## Module templates -Process script {ref}`templates ` can be included alongside a module in the `templates` directory. - -For example, suppose we have a project L with a module that defines two processes, P1 and P2, both of which use templates. The template files can be made available in the local `templates` directory: +Template files can be stored in the `templates` directory alongside a module. ``` -Project L -|── myModules.nf -└── templates - |── P1-template.sh - └── P2-template.sh +Project A +├── main.nf +└── modules + └── sayhello + ├── sayhello.nf + └── templates + └── sayhello.sh ``` -Then, we have a second project A with a workflow that includes P1 and P2: - -``` -Pipeline A -└── main.nf -``` +Template files can be invoked like regular scripts from a process in your pipeline using the `template` function. Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template file is executed by Nextflow. -Finally, we have a third project B with a workflow that also includes P1 and P2: +See {ref}`process-template` for more information utilizing template files. -``` -Pipeline B -└── main.nf -``` +Storing template files with the module that utilizes it encourages sharing of modules across pipelines. For example, future projects would be able to include the module from above by cloning the modules directory and including the module without needing to modify the process or template. -With the possibility to keep the template files inside the project L, A and B can use the modules defined in L without any changes. A future project C would do the same, just cloning L (if not available on the system) and including its module. - -Beside promoting the sharing of modules across pipelines, there are several advantages to keeping the module template under the script path: - -1. Modules are self-contained -2. Modules can be tested independently from the pipeline(s) that import them -3. Modules can be made into libraries - -Having multiple template locations enables a structured project organization. If a project has several modules, and they all use templates, the project could group module scripts and their templates as needed. For example: +Beyond facilitating module sharing across pipelines, organizing templates locations allows for a well-structured project. For example, complex projects with multiple modules that rely on templates can be organized into logical groups: ``` baseDir @@ -240,10 +224,11 @@ baseDir |── mymodules6.nf └── templates |── P5-template.sh - |── P6-template.sh - └── P7-template.sh + └── P6-template.sh ``` +Template files can also be stored in the project `templates` directory. See {ref}`structure-template` for more information about the project directory structure. + (module-binaries)= ## Module binaries @@ -251,15 +236,9 @@ baseDir :::{versionadded} 22.10.0 ::: -Modules can define binary scripts that are locally scoped to the processes defined by the tasks. +Modules can define binary scripts that are locally scoped to the processes. -To enable this feature, set the following flag in your pipeline script or configuration file: - -```nextflow -nextflow.enable.moduleBinaries = true -``` - -The binary scripts must be placed in the module directory names `/resources/usr/bin`: +Binary scripts must be placed in the module directory named `/resources/usr/bin`. For example: ``` @@ -267,16 +246,23 @@ The binary scripts must be placed in the module directory names `/re └── resources └── usr └── bin - |── your-module-script1.sh - └── another-module-script2.py + └── script.py ``` -Those scripts will be made accessible like any other command in the task environment, provided they have been granted the Linux execute permissions. +Binary scripts can be invoked like regular commands from the locally scoped module without modifying the `PATH` environment variable or using an absolute path. Each script should include a shebang to specify the interpreter and inputs should be supplied as arguments. See {ref}`structure-bin` for more information about custom scripts in `bin` directories. + +To use this feature, the module binaries must be enabled in your pipeline script or configuration file: + +```nextflow +nextflow.enable.moduleBinaries = true +``` :::{note} -This feature requires the use of a local or shared file system for the pipeline work directory, or {ref}`wave-page` when using cloud-based executors. +Module binary scripts require a local or shared file system for the pipeline work directory or {ref}`wave-page` when using cloud-based executors. ::: +Scripts can also be stored in project level `bin` directory. See {ref}`structure-bin` for more information. + ## Sharing modules Modules are designed to be easy to share and re-use across different pipelines, which helps eliminate duplicate work and spread improvements throughout the community. While Nextflow does not provide an explicit mechanism for sharing modules, there are several ways to do it: diff --git a/docs/process.md b/docs/process.md index 871ecea093..d8eea5131f 100644 --- a/docs/process.md +++ b/docs/process.md @@ -24,11 +24,11 @@ See {ref}`syntax-process` for a full description of the process syntax. ## Script -The `script` block defines, as a string expression, the script that is executed by the process. +The `script` block defines the string expression that is executed by the process. -A process may contain only one script, and if the `script` guard is not explicitly declared, the script must be the final statement in the process block. +The process can contain only one script block. If the `script` guard is not explicitly declared it must be the final statement in the process block. -The script string is executed as a [Bash]() script in the host environment. It can be any command or script that you would normally execute on the command line or in a Bash script. Naturally, the script may only use commands that are available in the host environment. +The script string is executed as a [Bash]() script in the host environment. It can be any command or script that you would execute on the command line or in a Bash script and can only use commands that are available in the host environment. The script block can be a simple string or a multi-line string. The latter approach makes it easier to write scripts with multiple commands spanning multiple lines. For example: @@ -42,19 +42,17 @@ process doMoreThings { } ``` -As explained in the script tutorial section, strings can be defined using single-quotes or double-quotes, and multi-line strings are defined by three single-quote or three double-quote characters. +Strings can be defined using single-quotes or double-quotes. Multi-line strings are defined by three single-quote or three double-quote characters. -There is a subtle but important difference between them. Like in Bash, strings delimited by a `"` character support variable substitutions, while strings delimited by `'` do not. +There is a subtle but important difference between single-quote (`'`) or three double-quote (`"`) characters. Like in Bash, strings delimited by the `"` character support variable substitutions, while strings delimited by `'` do not. -In the above code fragment, the `$db` variable is replaced by the actual value defined elsewhere in the pipeline script. +For example, in the above code fragment, the `$db` variable is replaced by the actual value defined elsewhere in the pipeline script. :::{warning} -Since Nextflow uses the same Bash syntax for variable substitutions in strings, you must manage them carefully depending on whether you want to evaluate a *Nextflow* variable or a *Bash* variable. +Nextflow uses the same Bash syntax for variable substitutions in strings. You must manage them carefully depending on whether you want to evaluate a *Nextflow* variable or a *Bash* variable. ::: -When you need to access a system environment variable in your script, you have two options. - -If you don't need to access any Nextflow variables, you can define your script block with single-quotes: +System environment variables and Nextflow variables can be accessed by your script. If you don't need to access any Nextflow variables, you can define your script block with single-quotes and use the dollar character (`$`) to access system environment variables. For example: ```nextflow process printPath { @@ -64,7 +62,7 @@ process printPath { } ``` -Otherwise, you can define your script with double-quotes and escape the system environment variables by prefixing them with a back-slash `\` character, as shown in the following example: +Otherwise, you can define your script with double-quotes and escape the system environment variables by prefixing them with a back-slash `\` character. For example: ```nextflow process doOtherThings { @@ -76,26 +74,22 @@ process doOtherThings { } ``` -In this example, `$MAX` is a Nextflow variable that must be defined elsewhere in the pipeline script. Nextflow replaces it with the actual value before executing the script. Meanwhile, `$DB` is a Bash variable that must exist in the execution environment, and Bash will replace it with the actual value during execution. - -:::{tip} -Alternatively, you can use the {ref}`process-shell` block definition, which allows a script to contain both Bash and Nextflow variables without having to escape the first. -::: +In this example, `$MAX` is a Nextflow variable that is defined elsewhere in the pipeline script. Nextflow replaces it with the actual value before executing the script. In contrast, `$DB` is a Bash variable that must exist in the execution environment. Bash will replace it with the actual value during execution. ### Scripts *à la carte* -The process script is interpreted by Nextflow as a Bash script by default, but you are not limited to Bash. +The process script is interpreted as Bash by default. -You can use your favourite scripting language (Perl, Python, R, etc), or even mix them in the same pipeline. +However, you can use your favorite scripting language (Perl, Python, R, etc) for each process. You can also mix languages in the same pipeline. -A pipeline may be composed of processes that execute very different tasks. With Nextflow, you can choose the scripting language that best fits the task performed by a given process. For example, for some processes R might be more useful than Perl, whereas for others you may need to use Python because it provides better access to a library or an API, etc. +A pipeline may be composed of processes that execute very different tasks. You can choose the scripting language that best fits the task performed by a given process. For example, R might be more useful than Perl for some processes, whereas for others you may need to use Python because it provides better access to a library or an API. -To use a language other than Bash, simply start your process script with the corresponding [shebang](). For example: +To use a language other than Bash, start your process script with the corresponding [shebang](). For example: ```nextflow process perlTask { """ - #!/usr/bin/perl + #!/usr/bin/env perl print 'Hi there!' . '\n'; """ @@ -103,7 +97,7 @@ process perlTask { process pythonTask { """ - #!/usr/bin/python + #!/usr/bin/env python x = 'Hello' y = 'world!' @@ -118,12 +112,12 @@ workflow { ``` :::{tip} -Since the actual location of the interpreter binary file can differ across platforms, it is wise to use the `env` command followed by the interpreter name, e.g. `#!/usr/bin/env perl`, instead of the absolute path, in order to make your script more portable. +Use `env` to resolve the interpreter's location instead of hard-coding the interpreter path. ::: ### Conditional scripts -The `script` block is like a function that returns a string. This means that you can write arbitrary code to determine the script, as long as the final statement is a string. +The `script` block is like a function that returns a string. You can write arbitrary code to determine the script as long as the final statement is a string. If-else statements based on task inputs can be used to produce a different script. For example: @@ -155,57 +149,58 @@ process align { } ``` -In the above example, the process will execute one of several scripts depending on the value of the `mode` parameter. By default it will execute the `tcoffee` command. +In the above example, the process will execute one of several scripts depending on the value of the `mode` parameter. By default, the process will execute the `tcoffee` command. (process-template)= -### Template +### Template files -Process scripts can be externalized to **template** files, which allows them to be reused across different processes and tested independently from the pipeline execution. +Process scripts can be externalized to **template** files and reused across multiple processes. Template files can be stored in the project or modules template directory. See {ref}`structure-templates` and {ref}`module-templates` for more information about directory structures. -A template can be used in place of an embedded script using the `template` function in the script section: +In template files, variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template script is executed by Nextflow. -```nextflow -process templateExample { +``` +#!/usr/bin/env bash + +echo "Hello ${x}" +``` + +Template files can be invoked like regular scripts from any process in your pipeline using the `template` function. + +``` +process sayHello { + input: - val STR + val x + + output: + stdout script: - template 'my_script.sh' + template 'sayhello.sh' } workflow { - Channel.of('this', 'that') | templateExample + Channel.of("Foo") | sayHello | view } ``` -By default, Nextflow looks for the template script in the `templates` directory located alongside the Nextflow script in which the process is defined. An absolute path can be used to specify a different location. However, this practice is discouraged because it hinders pipeline portability. +All template variable must be defined. The pipeline will fail if a template variable is missing, regardless of where it occurs in the template. -An example template script is provided below: +Templates can be tested independently of pipeline execution by providing each input as an environment variable. For example: ```bash -#!/bin/bash -echo "process started at `date`" -echo $STR -echo "process completed" +STR='foo' bash templates/sayhello.sh ``` -Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template script is executed by Nextflow and Bash variables when executed directly. For example, the above script can be executed from the command line by providing each input as an environment variable: - -```bash -STR='foo' bash templates/my_script.sh -``` +Template scripts are only recommended for Bash scripts. Languages that do not prefix variables with `$` (e.g., Python and R) can't be executed directly as a template script from the command line as variables prefixed with `$` are interpreted as Bash variables. Similarly, template variables escaped with `\$` will be interpreted as Bash variables when executed by Nextflow but not the command line. -The following caveats should be considered: - -- Template scripts are recommended only for Bash scripts. Languages that do not prefix variables with `$` (e.g. Python and R) can't be executed directly as a template script. - -- Variables escaped with `\$` will be interpreted as Bash variables when executed by Nextflow, but will not be interpreted as variables when executed from the command line. This practice should be avoided to ensure that the template script behaves consistently. - -- Template variables are evaluated even if they are commented out in the template script. If a template variable is missing, it will cause the pipeline to fail regardless of where it occurs in the template. +:::{warning} +Template variables are evaluated even if they are commented out in the template script. +::: :::{tip} -Template scripts are generally discouraged due to the caveats described above. The best practice for using a custom script is to embed it in the process definition at first and move it to a separate file with its own command line interface once the code matures. +The best practice for using a custom script is to first embed it in the process definition and transfer it to a separate file with its own command line interface once the code matures. ::: (process-shell)= diff --git a/docs/sharing.md b/docs/sharing.md index 61a840e9f8..d1c1755779 100644 --- a/docs/sharing.md +++ b/docs/sharing.md @@ -93,32 +93,6 @@ Read the {ref}`container-page` page to learn more about how to use containers wi For maximal reproducibility, make sure to define a specific version for each tool. Otherwise, your pipeline might use different versions across subsequent runs, which can introduce subtle differences to your results. ::: -(bundling-executables)= - -#### The `bin` directory - -As for custom scripts, you can include executable scripts in the `bin` directory of your pipeline repository. When configured correctly, these scripts can be executed like a regular command from any process script (i.e. without modifying the `PATH` environment variable or using an absolute path), and changing the script will cause the task to be re-executed on a resumed run (i.e. just like changing the process script itself). - -To configure a custom script: - -1. Save the script in the `bin` directory (relative to the pipeline repository root). -2. Specify a portable shebang (see note below for details). -3. Make the script executable. For example: `chmod a+x bin/my_script.py` - -:::{tip} -To maximize the portability of your bundled script, use `env` to dynamically resolve the location of the interpreter instead of hard-coding it in the shebang line. - -For example, shebang definitions `#!/usr/bin/python` and `#!/usr/local/bin/python` both hard-code specific paths to the Python interpreter. Instead, the following approach is more portable: - -```bash -#!/usr/bin/env python -``` -::: - -#### The `lib` directory - -Any Groovy scripts or JAR files in the `lib` directory will be automatically loaded and made available to your pipeline scripts. The `lib` directory is a useful way to provide utility code or external libraries without cluttering the pipeline scripts. - ### Data In general, input data should be provided by external sources using parameters which can be controlled by the user. This way, a pipeline can be easily reused to process different datasets which are appropriate for the pipeline. diff --git a/docs/structure.md b/docs/structure.md new file mode 100644 index 0000000000..a026e1919c --- /dev/null +++ b/docs/structure.md @@ -0,0 +1,156 @@ +(structure-page)= + +# Project structure + +(structure-templates)= + +## The `templates` directory + +The `templates` directory in the Nextflow project root can be used to store template files. + +``` +├── templates +│ └── sayhello.sh +└── main.nf +``` + +Template files can be invoked like regular scripts from any process in your pipeline using the `template` function. Variables prefixed with the dollar character (`$`) are interpreted as Nextflow variables when the template file is executed by Nextflow. + +See {ref}`process-template` for more information about utilizing template files. + +(structure-bin)= + +## The `bin` directory + +The `bin` directory in the Nextflow project root can be used to store executable scripts. + +``` +├── bin +│ └── sayhello.py +└── main.nf +``` + +The `bin` directory allows binary scripts to be invoked like regular commands from any process in your pipeline without using an absolute path of modifying the `PATH` environment variable. Each script should include a shebang to specify the interpreter and inputs should be supplied as arguments to the executable. For example: + +```python +#!/usr/bin/env python + +import argparse + +def main(): + parser = argparse.ArgumentParser(description="A simple argparse example.") + parser.add_argument("name", type=str, help="Person to greet.") + + args = parser.parse_args() + print(f"Hello {args.name}!") + +if __name__ == "__main__": + main() +``` + +:::{tip} +Use `env` to resolve the interpreter's location instead of hard-coding the interpreter path. +::: + +Binary scripts placed in the `bin` directory must have executable permissions. Use `chmod` to grant the required permissions. For example: + +``` +chmod a+x bin/sayhello.py +``` + +Binary scripts in the `bin` directory can then be invoked like regular commands. + +``` +process sayHello { + + input: + val x + + output: + stdout + + script: + """ + sayhello.py --name $x + """ +} + +workflow { + Channel.of("Foo") | sayHello | view +} +``` + +Like modifying a process script, modifying the binary script will cause the task to be re-executed on a resumed run. + +:::{note} +Binary scripts require a local or shared file system for the pipeline work directory or {ref}`wave-page` when using cloud-based executors. +::: + +:::{warning} +When using containers and the Wave service, Nextflow will send the project-level `bin` directory to the Wave service for inclusion as a layer in the container. Any changes to scripts in the `bin` directory will change the layer md5sum and the hash for the final container. The container identity is a component of the task hash calculation and will force re-calculation of all tasks in the workflow. + +When using the Wave service, use module-specific bin directories instead. See {ref}`module-binaries` for more information. +::: + +## The `lib` directory + +The `lib` directory can be used to add utility code or external libraries without cluttering the pipeline scripts. The `lib` directory in the Nextflow project root is added to the classpath by default. + +``` +├── lib +│ └── DNASequence.groovy +└── main.nf +``` + +Classes or packages defined in the `lib` directory will be available in the execution context. Scripts or functions defined outside of classes will not be available in the execution context. + +For example, `lib/DNASequence.groovy` defines the `DNASequence` class: + +```groovy +// lib/DNASequence.groovy +class DNASequence { + String sequence + + // Constructor + DNASequence(String sequence) { + this.sequence = sequence.toUpperCase() // Ensure sequence is in uppercase for consistency + } + + // Method to calculate melting temperature using the Wallace rule + double getMeltingTemperature() { + int g_count = sequence.count('G') + int c_count = sequence.count('C') + int a_count = sequence.count('A') + int t_count = sequence.count('T') + + // Wallace rule calculation + double tm = 4 * (g_count + c_count) + 2 * (a_count + t_count) + return tm + } + + String toString() { + return "DNA[$sequence]" + } +} +``` + +The `DNASequence` class is available in the execution context: + +```nextflow +// main.nf +workflow { + Channel.of('ACGTTGCAATGCCGTA', 'GCGTACGGTACGTTAC') + .map { seq -> new DNASequence(seq) } + .view { dna -> + def meltTemp = dna.getMeltingTemperature() + "Found sequence '$dna' with melting temperature ${meltTemp}°C" + } +} +``` + +It returns: + +``` +Found sequence 'DNA[ACGTTGCAATGCCGTA]' with melting temperaure 48.0°C +Found sequence 'DNA[GCGTACGGTACGTTAC]' with melting temperaure 50.0°C +```