diff --git a/docs/nextflow_run/01_orientation.md b/docs/nextflow_run/00_orientation.md similarity index 87% rename from docs/nextflow_run/01_orientation.md rename to docs/nextflow_run/00_orientation.md index 1be4247f1..a5aef8fd4 100644 --- a/docs/nextflow_run/01_orientation.md +++ b/docs/nextflow_run/00_orientation.md @@ -7,7 +7,7 @@ If you have not yet done so, please follow [this link](../../envsetup/) before g ## Materials provided -Throughout this training course, we'll be working in the `run-nextflow/` directory. +Throughout this training course, we'll be working in the `nextflow-run/` directory. This directory contains all the code files, test data and accessory files you will need. Feel free to explore the contents of this directory; the easiest way to do so is to use the file explorer on the left-hand side of the GitHub Codespaces workspace. @@ -20,7 +20,7 @@ Here we generate a table of contents to the second level down: tree . -L 2 ``` -If you run this inside `run-nextflow`, you should see the following output: [TODO] +If you run this inside `nextflow-run`, you should see the following output: [TODO] ```console title="Directory contents" . @@ -33,14 +33,14 @@ If you run this inside `run-nextflow`, you should see the following output: [TOD **Here's a summary of what you should know to get started:** -[TODO] +TODO: update when content is final !!!tip If for whatever reason you move out of this directory, you can always run this command to return to it: ```bash - cd /workspaces/training/run-nextflow + cd /workspaces/training/nextflow-run ``` Now, to begin the course, click on the arrow in the bottom right corner of this page. diff --git a/docs/nextflow_run/01_basics.md b/docs/nextflow_run/01_basics.md new file mode 100644 index 000000000..96aebe8bd --- /dev/null +++ b/docs/nextflow_run/01_basics.md @@ -0,0 +1,511 @@ +# Part 1: Run basic operations + +In this first part of the Nextflow Run training course, we ease into the topic with a very basic domain-agnostic Hello World example, which we'll use to demonstrate essential operations and point out the corresponding Nextflow code components. + +!!! note + + A "Hello World!" is a minimalist example that is meant to demonstrate the basic syntax and structure of a programming language or software framework. + The example typically consists of printing the phrase "Hello, World!" to the output device, such as the console or terminal, or writing it to a file. + +## 0. Warmup: Run Hello World directly + +Let's demonstrate this with a simple command that we run directly in the terminal, to show what it does before we wrap it in Nextflow. + +!!! tip + + Remember that you should now be inside the `hello-nextflow/` directory as described in the Orientation. + +### 0.1. Make the terminal say hello + +```bash +echo 'Hello World!' +``` + +This outputs the text 'Hello World' to the terminal. + +```console title="Output" +Hello World! +``` + +### 0.2. Now make it write the text output to a file + +```bash +echo 'Hello World!' > output.txt +``` + +This does not output anything to the terminal. + +```console title="Output" + +``` + +### 0.3. Show the file contents + +```bash +cat output.txt +``` + +The text 'Hello World' is now in the output file we specified. + +```console title="output.txt" linenums="1" +Hello World! +``` + +!!! tip + + In the training environment, you can also find the output file in the file explorer, and view its contents by clicking on it. Alternatively, you can use the `code` command to open the file for viewing. + + ```bash + code output.txt + ``` + +### Takeaway + +You now know how to run a simple command in the terminal that outputs some text, and optionally, how to make it write the output to a file. + +### What's next? + +Find out what it takes to run a Nextflow workflow that achieves the same result. + +--- + +## 1. Run the workflow + +We provide you with a workflow script named `hello-world.nf` that produces a text file containing the greeting 'Hello World!'. +We're not going to look at the code yet; first let's see what it looks like to run it. + +### 1.1. Launch the workflow and monitor execution + +In the terminal, run the following command: + +```bash +nextflow run hello-world.nf +``` + +You console output should look something like this: + +```console title="Output" linenums="1" + N E X T F L O W ~ version 24.10.0 + +Launching `hello-world.nf` [goofy_torvalds] DSL2 - revision: c33d41f479 + +executor > local (1) +[a3/7be2fa] sayHello | 1 of 1 ✔ +``` + +Congratulations, you just ran your first Nextflow workflow! + +The most important output here is the last line (line 6): + +```console title="Output" linenums="6" +[a3/7be2fa] sayHello | 1 of 1 ✔ +``` + +This tells us that the `sayHello` process was successfully executed once (`1 of 1 ✔`). + +Importantly, this line also tells you where to find the output of the `sayHello` process call. +Let's look at that now. + +### 1.2. Find the output and logs in the `work` directory + +When you run Nextflow for the first time in a given directory, it creates a directory called `work` where it will write all files (and any symlinks) generated in the course of execution. + +Within the `work` directory, Nextflow organizes outputs and logs per process call. +For each process call, Nextflow creates a nested subdirectory, named with a hash in order to make it unique, where it will stage all necessary inputs (using symlinks by default), write helper files, and write out logs and any outputs of the process. + +The path to that subdirectory is shown in truncated form in square brackets in the console output. +Looking at what we got for the run shown above, the console log line for the sayHello process starts with `[a3/7be2fa]`. That corresponds to the following directory path: `work/a3/7be2fa7be2fad5e71e5f49998f795677fd68` + +Let's take a look at what's in there. + +!!! tip + + If you browse the contents of the task subdirectory in the VSCode file explorer, you'll see all the files right away. + However, the log files are set to be invisible in the terminal, so if you want to use `ls` or `tree` to view them, you'll need to set the relevant option for displaying invisible files. + + ```bash + tree -a work + ``` + +You should see something like this, though the exact subdirectory names will be different on your system: + +```console title="Directory contents" +work +└── a3 + └── 7be2fad5e71e5f49998f795677fd68 + ├── .command.begin + ├── .command.err + ├── .command.log + ├── .command.out + ├── .command.run + ├── .command.sh + ├── .exitcode + └── output.txt +``` + +These are the helper and log files: + +- **`.command.begin`**: Metadata related to the beginning of the execution of the process call +- **`.command.err`**: Error messages (`stderr`) emitted by the process call +- **`.command.log`**: Complete log output emitted by the process call +- **`.command.out`**: Regular output (`stdout`) by the process call +- **`.command.run`**: Full script run by Nextflow to execute the process call +- **`.command.sh`**: The command that was actually run by the process call +- **`.exitcode`**: The exit code resulting from the command + +The `.command.sh` file is especially useful because it tells you what command Nextflow actually executed. +In this case it's very straightforward, but later in the course you'll see commands that involve some interpolation of variables. +When you're dealing with that, you need to be able to check exactly what was run, especially when troubleshooting an issue. + +The actual output of the `sayHello` process is `output.txt`. +Open it and you will find the `Hello World!` greeting, which was the expected result of our minimalist workflow. + +```console title="output.txt" linenums="1" +Hello World! +``` + +### Takeaway + +You know how to run a simple Nextflow script, monitor its execution and find its outputs. + +### What's next? + +Learn how to read a basic Nextflow script and identify how its components relate to its functionality. + +--- + +## 2. Examine the Hello World workflow starter script + +What we did there was basically treating the workflow script like a black box. +Now that we've seen what it does, let's open the box and look inside at how the code is organized. + +### 2.1. Examine the overall code structure + +Let's open the `hello-world.nf` script in the editor pane. + +!!! note + + The file is in the `hello-nextflow` directory, which should be your current working directory. + You can either click on the file in the file explorer, or type `ls` in the terminal and Cmd+Click (MacOS) or Ctrl+Click (PC) on the file to open it. + +```groovy title="hello-world.nf" linenums="1" +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + output: + path 'output.txt' + + script: + """ + echo 'Hello World!' > output.txt + """ +} + +workflow { + + // emit a greeting + sayHello() +} +``` + +As you can see, a Nextflow script involves two main types of core components: one or more **processes**, and the **workflow** itself. +Each **process** describes what operation(s) the corresponding step in the pipeline should accomplish, while the **workflow** describes the dataflow logic that connects the various steps. + +Let's take a closer look at the **process** block first, then we'll look at the **workflow** block. + +### 2.2. The `process` definition + +The first block of code describes a **process**. +The process definition starts with the keyword `process`, followed by the process name and finally the process body delimited by curly braces. +The process body must contain a script block which specifies the command to run, which can be anything you would be able to run in a command line terminal. + +Here we have a **process** called `sayHello` that writes its **output** to a file named `output.txt`. + +```groovy title="hello-world.nf" linenums="3" +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + output: + path 'output.txt' + + script: + """ + echo 'Hello World!' > output.txt + """ +} +``` + +This is a very minimal process definition that just contains an `output` definition and the `script` to execute. + +The `output` definition includes the `path` qualifier, which tells Nextflow this should be handled as a path (includes both directory paths and files). +Another common qualifier is `val`. + +!!! note + + The output definition does not _determine_ what output will be created. + It simply _declares_ what is the expected output, so that Nextflow can look for it once execution is complete. + + This is necessary for verifying that the command was executed successfully and for passing the output to downstream processes if needed. + Output produced that doesn't match what is declared in the output block will not be passed to downstream processes. + +In a real-world pipeline, a process usually contains additional blocks such as directives and inputs, which we'll introduce in a little bit. + +### 2.3. The `workflow` definition + +The second block of code describes the **workflow** itself. +The workflow definition starts with the keyword `workflow`, followed by an optional name, then the workflow body delimited by curly braces. + +Here we have a **workflow** that consists of one call to the `sayHello` process. + +```groovy title="hello-world.nf" linenums="17" +workflow { + + // emit a greeting + sayHello() +} +``` + +This is a very minimal **workflow** definition. +In a real-world pipeline, the workflow typically contains multiple calls to **processes** connected by **channels**, and the processes expect one or more variable **input(s)**. + +We'll look into that next. + +### Takeaway + +You now know how a simple Nextflow workflow is structured and how the basic components relate to its functionality. + +### What's next? + +Learn to recognize and utilize two more key features of real-world pipelines: inputs parameters and the `publishDir` directive, which provide flexibility for managing inputs and outputs, respectively. + +--- + +## 3. A more flexible Hello World + +An important requirement of real-world pipelines is to be able to feed inputs to the workflow from the command-line, and be able to retrieve outputs efficiently. + +Let's look at a slightly upgraded version of our Hello World workflow called `hello-world-plus.nf` that accepts an arbitrary greeting string from the command-line and writes its output to a more easily accessible directory. + +### 3.1. Examine the code of the upgraded workflow + +This time we're going to look at the code _before_ we run it. +As you can see, we've highlighted the differences compared to the previous version. + +```groovy title="hello-world-plus.nf" linenums="1" +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path 'output.txt' + + script: + """ + echo '$greeting' > output.txt + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'Holà mundo!' + +workflow { + + // emit a greeting + sayHello(params.greeting) +} +``` + +That may seem like a lot, so let's break it down. + +#### 3.1.1. Variable inputs + +First let's look at the components that allow us to pass an input from command-line. + +In the process definition, we now have an input block that specifies a value called 'greeting'. + +```groovy title="hello-world-plus.nf" linenums="10" + input: + val greeting +``` + +That tells Nextflow that the `sayHello()` process now expects an input value. + +Below that, we have a `params` definition that sepcifies a command-line parameter called `greeting`, with the default value set to `Holà mundo!`: + +```groovy title="hello-world-plus.nf" linenums="25" +params.greeting = 'Holà mundo!' +``` + +And finally, bringing it all together in the `workflow` block, we are now giving the `sayHello()` process call an input, which is the `greeting` parameter defined above: + +```groovy title="hello-world-plus.nf" linenums="29" + // emit a greeting + sayHello(params.greeting) +``` + +This means we'll be able to set a greeting from the command-line using `--greeting` as a parameter name. +You'll see that in action in a minute. + +#### 3.1.2. Conveniently accessible outputs + +The other notable addition here is just one line, but it's an important one: + +```groovy title="hello-world-plus.nf" linenums="8" + publishDir 'results', mode: 'copy' +``` + +This is a directive that tells Nextflow to write a copy of the output to the specified directory. +Here we've called it `results` but you can call it anything you want. + +It is possible to use a symbolic link instead of copying the file; this will be discussed later. + +### 3.2. Run the upgraded workflow + +Let's see that in action! +In your terminal, run the following command. + +```bash +nextflow run hello-world-plus.nf --greeting 'Bonjour le monde' +``` + +You console output should look something like this: + +```console title="Output" linenums="1" + N E X T F L O W ~ version 24.10.0 + +Launching `hello-world-plus.nf` [goofy_torvalds] DSL2 - revision: c33d41f479 + +executor > local (1) +[a3/7be2fa] sayHello | 1 of 1 ✔ +``` + +You should see a new directory called `results` appear. +Look inside and you will find your `output.txt file. +The contents should match the string you specified on the command line. +If try running this again without specifying the `--greeting` parameter, the output should match the default value specified in the workflow script. + +In any case, it should match the output that is produced in the work subdirectory. +This is how we publish results files outside of the working directories conveniently. + +### Takeaway + +You now know how input parameters, and the `publishDir` directive provide flexibility for managing inputs and outputs. + +### What's next? + +Learn to manage your workflow executions conveniently. + +--- + +## 4. Manage workflow executions + +Knowing how to launch workflows and retrieve outputs is great, but you'll quickly find there are a few other aspects of workflow management that will make your life easier. + +Here we show you how to take advantage of the `resume` feature for when you need to re-launch the same workflow, and how to delete older work directories with `nextflow clean`. + +### 4.1. Re-launch a workflow with `-resume` + +Sometimes, you're going to want to re-run a pipeline that you've already launched previously without redoing any steps that already completed successfully. + +Nextflow has an option called `-resume` that allows you to do this. +Specifically, in this mode, any processes that have already been run with the exact same code, settings and inputs will be skipped. +This means Nextflow will only run processes that you've added or modified since the last run, or to which you're providing new settings or inputs. + +There are two key advantages to doing this: + +- If you're in the middle of developing a pipeline, you can iterate more rapidly since you only have to run the process(es) you're actively working on in order to test your changes. +- If you're running a pipeline in production and something goes wrong, in many cases you can fix the issue and relaunch the pipeline, and it will resume running from the point of failure, which can save you a lot of time and compute. + +To use it, simply add `-resume` to your command and run it: + +```bash +nextflow run hello-world-plus.nf -resume +``` + +The console output should look similar. + +```console title="Output" linenums="1" + N E X T F L O W ~ version 24.10.0 + +Launching `hello-world-plus.nf` [golden_cantor] DSL2 - revision: 35bd3425e5 + +[62/49a1f8] sayHello | 1 of 1, cached: 1 ✔ +``` + +Look for the `cached:` bit that has been added in the process status line (line 5), which means that Nextflow has recognized that it has already done this work and simply re-used the result from the previous successful run. + +You can also see that the work subdirectory hash is the same as in the previous run. +Nextflow is literally pointing you to the previous execution and saying "I already did that over there." + +!!! note + + When your re-run a pipeline with `resume`, Nextflow does not overwrite any files written to a `publishDir` directory by any process call that was previously run successfully. + +### 4.2. Delete older work directories + +During the development process, you'll typically run your draft pipelines a large number of times, which can lead to an accumulation of very many files across many subdirectories. +Since the subdirectories are named randomly, it is difficult to tell from their names what are older vs. more recent runs. + +Nextflow includes a convenient `clean` subcommand that can automatically delete the work subdirectories for past runs that you no longer care about, with several [options](https://www.nextflow.io/docs/latest/reference/cli.html#clean) to control what will be deleted. + +Here we show you an example that deletes all subdirectories from runs before a given run, specified using its run name. +The run name is the machine-generated two-part string shown in square brackets in the `Launching (...)` console output line. + +First we use the dry run flag `-n` to check what will be deleted given the command: + +```bash +nextflow clean -before golden_cantor -n +``` + +The output should look like this: + +```console title="Output" +Would remove /workspaces/training/nextflow-run/work/a3/7be2fad5e71e5f49998f795677fd68 +``` + +If you don't see any lines output, you either did not provide a valid run name or there are no past runs to delete. + +If the output looks as expected and you want to proceed with the deletion, re-run the command with the `-f` flag instead of `-n`: + +```bash +nextflow clean -before golden_cantor -f +``` + +You should now see the following: + +```console title="Output" +Removed /workspaces/training/nextflow-run/work/a3/7be2fad5e71e5f49998f795677fd68 +``` + +!!! Warning + + Deleting work subdirectories from past runs removes them from Nextflow's cache and deletes any outputs that were stored in those directories. + That means it breaks Nextflow's ability to resume execution without re-running the corresponding processes. + + You are responsible for saving any outputs that you care about or plan to rely on! If you're using the `publishDir` directive for that purpose, make sure to use the `copy` mode, not the `symlink` mode. + +### Takeaway + +You know how to relaunch a pipeline without repeating steps that were already run in an identical way, and use the `nextflow clean` command to clean up old work directories. + +### What's next? + +Take a little break! You've just absorbed a big pile of Nextflow syntax and usage instructions. + +In the next section of this training, we're going to look at three successively more realistic versions of the Hello World pipeline that will demonstrate how Nextflow allows you to process multiple inputs efficiently, run workflows composed of multiple steps chained together, and leverage modular code components. diff --git a/docs/nextflow_run/02_pipeline.md b/docs/nextflow_run/02_pipeline.md new file mode 100644 index 000000000..146b39752 --- /dev/null +++ b/docs/nextflow_run/02_pipeline.md @@ -0,0 +1,654 @@ +# Part 2: Run pipelines + +In Part 1 of this course (Run Basic Operations), we started with an example workflow that had only minimal features in order to keep the code complexity low. +However, most real-world pipelines use more sophisticated features in order to enable efficient processing of large amounts of data at scale, and apply multiple processing steps chained together by sometimes complex logic. + +In this part of the training, we demonstrate key features of real-world pipelines through a set of example workflows that build on the original Hello World pipeline. + +## 1. Processing multiple inputs + +Let's start with the question of how to process not a single greeting at a time, but a batch of greetings, to emulate realistic high-throughout data processing. + +The `hello-world-plus.nf` workflow we ran in Part 1 used a command-line parameter to provide a single value at a time, which was passed directly to the process call using `sayHello(params.greeting)`. +That was a deliberately simplified approach that won't work for processing multiple values. + +In order to process multiple values (experimental data for multiple samples, for example), we have to upgrade the workflow to use Nextflow's powerful system of **channels** and **operators**. + +We've prepared a workflow for you that does exactly that, called `channels.nf`, as well as a CSV file called `greetings.csv` containing some input greetings, emulating the kind of columnar data you might want to process in a real data analysis. + +```csv title="greetings.csv" linenums="1" +Hello,English,123 +Bonjour,French,456 +Holà,Spanish,789 +``` + +(The numbers are not significant, they are just there for illustrative purposes.) + +Let's run the workflow first, and we'll take a look at what has changed in the code after. + +### 1.1. Run the workflow + +Run the following command in your terminal. + +```bash +nextflow run channels.nf --greeting greetings.csv +``` + +This should run without error. + +```console title="Output" linenums="1" + N E X T F L O W ~ version 25.04.3 + +Launching `channels.nf` [tiny_heisenberg] DSL2 - revision: 845b471427 + +executor > local (3) +[1a/1d19ab] sayHello (2) | 3 of 3 ✔ +``` + +Excitingly, this seems to indicate that '3 of 3' calls were made for the process, which is encouraging! +But this only shows us a single run of the process, with one subdirectory path (`1a/1d19ab`). +What's going on? + +By default, the ANSI logging system writes the logging from multiple calls to the same process on the same line. +Fortunately, we can disable that behavior to see the full list of process calls. + +### 1.2. Run the command again with the `-ansi-log false` option + +To expand the logging to display one line per process call, add `-ansi-log false` to the command. + +```bash +nextflow run channels.nf -ansi-log false +``` + +This time we see all three process runs and their associated work subdirectories listed in the output: + +```console title="Output" linenums="1" +N E X T F L O W ~ version 25.04.3 +Launching `channels.nf` [pensive_poitras] DSL2 - revision: 778deadaea +[76/f61695] Submitted process > sayHello (1) +[6e/d12e35] Submitted process > sayHello (3) +[c1/097679] Submitted process > sayHello (2) +``` + +That's much better; at least for a simple workflow. +For a complex workflow, or a large number of inputs, having the full list output to the terminal might get a bit overwhelming, so you might not choose to use `-ansi-log false` in those cases. + +!!! note + + The way the status is reported is a bit different between the two logging modes. + In the condensed mode, Nextflow reports whether calls were completed successfully or not. + In this expanded mode, it only reports that they were submitted. + +### 1.3. Find the outputs + +Ok, so this shows us that the process got run three times. +Let's look for the outputs in the individual work directories first, since we've got them listed. + +TODO: show example work directory output + +There is an output file there but the name has changed, it's no longer just `output.txt`. +File that away in your brain for later. + +Now let's look at the 'results' directory to see if our workflow is still writing a copy of our outputs there. + +TODO: show results directory contents + +Yes! We see all three expected outputs, conveniently with differentiating names. + +### 1.4. Examine the code + +Let's take a look at what has changed in the workflow code. + +```groovy title="channels.nf" linenums="1" +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'greetings.csv' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) +} +``` + +#### 1.4.1. Name the outputs dynamically + +Let's start with the output naming since that's conceptually the simplest change. + +```groovy title="channels.nf" linenums="13" + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +``` + +You see that the output declaration and the relevant bit of the command have changed to include the greeting value in the output file name. +This is one way to ensure that the output file names won't collide when they get published to the common `results` directory. + +And that's the only change we've had to make inside the process declaration. + +#### 1.4.2. Load the inputs from the CSV + +This is the really interesting part: how did we switch from taking a single value from the command-line, to taking a CSV file, parsing it and processing the individual greetings it contains? + +That is what Nextflow **channels** are for. +Channels are queues designed to handle inputs efficiently and shuttle them from one step to another in multi-step workflows, while providing built-in parallelism and many additional benefits. +They are complemented by **operators** that allow us to transform channel contents as needed. + +Confused? Let's break it down. + +```groovy title="channels.nf" linenums="25" +params.greeting = 'greetings.csv' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } +``` + +This is where the magic happens, starting at line 30. +Here's what that line means in plain English: + +Channel = create a **channel**, i.e. a queue that will hold the data +.fromPath = from the filepath provided in parenthesis +(params.greeting) = the filepath provided with `--greeting` on the command line + +Then the next two lines apply **operators** that transform the contents of the newly created channel as follows: + +.splitCsv() = parse the CSV file into an array representing rows and columns +.map { line -> line[0] } = for each row (line), take only the element in the first column + +So in practice, starting from the following CSV file: + +```csv title="greetings.csv" linenums="1" +Hello,English,123 +Bonjour,French,456 +Holà,Spanish,789 +``` + +We have transformed that into an array that looks like this: + +```txt title="Array contents" +[[Hello,English,123],[Bonjour,French,456],[Holà,Spanish,789]] +``` + +And then we've taken the first element from each of the three rows and loaded them into a Nextflow channel that now contains: `Hello`, `Bonjour`, and `Holà`. + +In other words, the result of this very short snippet of code is a channel called `greeting_ch` loaded with the three individual greetings from the CSV file, ready for processing. + +#### 1.4.3. Call the process on each greeting + +Then in the last line of the workflow block, we call the `sayHello()` process on the loaded `greeting_ch` channel. + +```groovy title="channels.nf" linenums="35" + sayHello(greeting_ch) +} +``` + +This tells Nextflow to run the process _individually_ on each element in the channel, i.e. on each greeting. + +And because Nextflow is smart like that, it will run these process calls in parallel if possible, depending on the available computing infrastructure. + +That is how you can achieve efficient and scalable processing of a lot of data (many samples, or data points, whatever is your unit of research) with comparatively very little code. + +### 1.5. Optional: Add `view()` to inspect channel contents + +If you're interested in getting into the guts of channels and operators, you can use [`view()`](https://www.nextflow.io/docs/latest/reference/operator.html#view) as described below to inspect the contents of the channel. +You can think of `view()` as a debugging tool, like a `print()` statement in Python, or its equivalent in other languages. + +In the workflow block, make the following code change: + +```groovy title="channels.nf" linenums="29" hl_lines="3,5,7" + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .view { thing -> "Before splitCsv: $thing" } + .splitCsv() + .view { thing -> "After splitCsv: $thing" } + .map { line -> line[0] } + .view { thing -> "After map: $thing" } +``` + +Here we are using an operator **closure**, denoted by the curly brackets, to specify what to do within the scope of the `view()` operator. +This code will be executed for each item in the channel. +We define a temporary variable for the inner value, here called `thing` to be generic (it could be anything), representing each individual item loaded in a channel. +This variable is only used within the scope of that closure. + +You can then run the workflow again: + +```bash +nextflow run channels.nf --greeting greetings.csv +``` + +This should once again run without error and produce the following output: + +```console title="Output" linenums="1" + N E X T F L O W ~ version 25.04.3 + +Launching `channels.nf` [tiny_heisenberg] DSL2 - revision: 845b471427 + +executor > local (3) +[1a/1d19ab] sayHello (2) | 3 of 3 ✔ +Before splitCsv: /workspaces/training/nextflow-run/greetings.csv +After splitCsv: [Hello,English,123] +After splitCsv: [Bonjour,French,456] +After splitCsv: [Holà,Spanish,789] +After map: Hello +After map: Bonjour +After map: Holà +``` + +This time you see the extra lines at the end showing you what are the contents of the channel at each stage. +Feel free to play around with the contents of the CSV and change the number in the `line -> line[0]` bit that controls which column's value the `map()` operator will pull out. +See what happens! + +### Takeaway + +You understand at a basic level how channels and operators enable us to process multiple inputs efficiently. + +### What's next? + +Discover how multi-step workflows are constructed and operate. + +--- + +## 2. Multi-step workflows + +Most real-world workflows involve more than one step. +Let's build on what we just learned about channels, and look at how Nextflow uses channels and operators to connect processes together in a multi-step workflow. + +To that end, we provide you with an example workflow that chains together three separate steps and demonstrates the following: + +1. Making data flow from one process to the next +2. Collecting outputs from multiple process calls into a single process call + +Specifically, we made a version of the Hello World workflow that takes each input greeting, converts it to uppercase, then collects all the uppercased greetings into a single output file. + +As previously, we'll run the workflow first then look at the code to see what's changed. + +### 2.1. Run the workflow + +Run the following command in your terminal: + +```bash +nextflow run flow.nf --greeting greetings.csv +``` + +Once again this should run successfully. + +```console title="Output" linenums="1" + N E X T F L O W ~ version 25.04.3 + +Launching `flow.nf` [soggy_franklin] DSL2 - revision: bc8e1b2726 + +[d6/cdf466] sayHello (1) | 3 of 3 ✔ +[99/79394f] convertToUpper (2) | 3 of 3 ✔ +[1e/83586c] collectGreetings | 1 of 1 ✔ +There were 3 greetings in this batch +``` + +You see that as promised, multiple steps were run as part of the workflow; the first two (`sayHello` and `convertToUpper`) were presumably run on each individual greeting, and the third (`collectGreetings`) will have been run only once, on the outputs of all three of the `convertToUpper` calls. + +### 2.2. Find the outputs + +If you'd like to verify that that is in fact what happened (good scientist; have a biscuit), you can take a look in the `results` directory. + +```console title="Directory contents" +results +├── Bonjour-output.txt +├── COLLECTED-output.txt +├── COLLECTED-test-batch-output.txt +├── COLLECTED-trio-output.txt +├── Hello-output.txt +├── Holà-output.txt +├── UPPER-Bonjour-output.txt +├── UPPER-Hello-output.txt +└── UPPER-Holà-output.txt +``` + +Look at the file names and check their contents to confirm that they are what you expect, for example: + +```console title="bash" +cat results/COLLECTED-trio-output.txt +``` + +```console title="Output" +HELLO +BONJOUR +HOLà +``` + +That is the expected final result of our multi-step pipeline. + +### 2.3. Examine the code + +Let's look at the code and see what we can tie back to what we just observed. + +```groovy title="channels.nf" linenums="1" +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +} + +/* + * Use a text replacement tool to convert the greeting to uppercase + */ +process convertToUpper { + + publishDir 'results', mode: 'copy' + + input: + path input_file + + output: + path "UPPER-${input_file}" + + script: + """ + cat '$input_file' | tr '[a-z]' '[A-Z]' > 'UPPER-${input_file}' + """ +} + +/* + * Collect uppercase greetings into a single output file + */ +process collectGreetings { + + publishDir 'results', mode: 'copy' + + input: + path input_files + val batch_name + + output: + path "COLLECTED-${batch_name}-output.txt" , emit: outfile + val count_greetings , emit: count + + script: + count_greetings = input_files.size() + """ + cat ${input_files} > 'COLLECTED-${batch_name}-output.txt' + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'greetings.csv' +params.batch = 'test-batch' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) + + // convert the greeting to uppercase + convertToUpper(sayHello.out) + + // collect all the greetings into one file + collectGreetings(convertToUpper.out.collect(), params.batch) + + // emit a message about the size of the batch + collectGreetings.out.count.view { num -> "There were $num greetings in this batch" } +} +``` + +The most obvious difference compared to the previous version of the workflow is that now there are multiple process definitions, and correspondingly, several process calls in the workflow block. + +#### 2.3.1. Multiple process definitions + +In addition to the original `sayHello` process, we now also have `convertToUpper` and `collectGreetings`, which match the names of the processes we saw in the console output. + +All three are structured in the same way and follow roughly the same logic, though you may notice that the `collectGreetings` process takes two inputs and outputs two outputs. +We won't go into that in detail, but it shows how a process can be given additional parameters and emit multiple outputs. + +#### 2.3.2. Processes chained via channels + +The really interesting thing to look at here is how the process calls are chained together in the workflow block. + +```groovy title="channels.nf" linenums="69" +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) + + // convert the greeting to uppercase + convertToUpper(sayHello.out) + + // collect all the greetings into one file + collectGreetings(convertToUpper.out.collect(), params.batch) + + // emit a message about the size of the batch + collectGreetings.out.count.view { num -> "There were $num greetings in this batch" } +} +``` + +You can see that the first process call, to `sayHello()`, is unchanged. + +Then the next process call, to `convertToUpper`, _refers_ to the output of `sayHello` as `sayHello.out`: + +```groovy title="channels.nf" linenums="79" + // convert the greeting to uppercase + convertToUpper(sayHello.out) +``` + +This means 'call `convertToUpper` on the output of `sayHello()`'. + +Then the next call is doing the same thing, with a little twist (or two): + +```groovy title="channels.nf" linenums="82" + // collect all the greetings into one file + collectGreetings(convertToUpper.out.collect(), params.batch) +``` + +First, you'll note this one has two inputs provided to the `collectGreetings()` call: `convertToUpper.out.collect()` and `params.batch`. +The latter is just a parameter value that the process expects (in second position because it is declared in second position in the process definition). + +The other one, `convertToUpper.out.collect()`, is a bit more complicated and deserves its own discussion. + +#### 2.3.3. Operators provide plumbing options + +What we're seeing in `convertToUpper.out.collect()` is the use of another operator, called `collect()`. +This operator is used to collect the outputs from multiple parallel calls to the same process and package them into a single channel element. + +Specifically, +TODO: finish explanation + +There are many other operators available to apply transformations to the contents of channels between process calls. + +This gives pipeline developers a lot of flexibility for customizing the flow logic of their pipeline. +The downside is that it can sometimes make it harder to decipher what the pipeline is doing. + +### 2.4. Use the graph preview + +One very helpful tool for understanding what a pipeline does, if it's not adequately documented, is the graph preview functionality available in VSCode. You can see this in the training environment by clicking on the small `DAG preview` link displayed just above the workflow block in any Nextflow script. + +TODO: add picture + +This does not show operators, but it does give a useful representation of how process calls are connected and what are their inputs. + +### Takeaway + +You understand at a basic level how multi-step workflows are constructed and operate, using channels and operators, and you can manage their execution. + +### What's next? + +Learn how Nextflow pipelines are often modularized to promote code reuse and maintainability. + +--- + +## 3. Modular code components + +So far, all the workflows we've looked at have consisted of one single workflow file containing all the relevant code. + +However, real-world pipelines typically benefit from being _modularized_, meaning that the code is split into different files. +This can make their development and maintenance more efficient and sustainable. + +Here we are going to demonstrate the most common form of code modularity in Nextflow, which is the use of **modules**. + +In Nextflow, a **module** is a single process definition that is encapsulated by itself in a standalone code file. +To use a module in a workflow, you just add a single-line import statement to your workflow code file; then you can integrate the process into the workflow the same way you normally would. + +Putting processes into individual modules makes it possible to reuse process definitions in multiple workflows without producing multiple copies of the code. +This makes the code more shareable, flexible and maintainable. + +We have of course once again prepared a suitable workflow for demonstration purposes, called `modular.nf`, along with a set of modules located in the `modules/` directory. + +### 3.1. Examine the code + +This time we're going to look at the code first. + +TODO: show directory contents + +Importantly, the processes and workflow logic are exactly the same as in the previous version of the workflow. However the process code is in the modules instead of being in the main workflow file, and there are now import statements in the workflow file telling Nextflow to pull them in at runtime. + +```groovy title="hello-modules.nf" linenums="9" hl_lines="4" +// Include modules +include { sayHello } from './modules/sayHello.nf' +include { convertToUpper } from './modules/convertToUpper.nf' +include { collectGreetings } from './modules/collectGreetings.nf' + +workflow { +``` + +You can look inside one of the modules to satisfy yourself that the process definition is unchanged, it's literally just been copy-pasted into a standalone file. + +TODO: show module code for `sayHello` + +So let's see what it looks like to run this new version. + +### 3.2. Run the workflow + +Run this command in your terminal, with the `-resume` flag: + +```bash +nextflow run modular.nf --greeting greetings.csv -resume +``` + +Once again this should run successfully. + +```console title="Output" linenums="1" + N E X T F L O W ~ version 25.04.3 + +Launching `modular.nf` [soggy_franklin] DSL2 - revision: bc8e1b2726 + +[j6/cdfa66] sayHello (1) | 3 of 3, cached: ✔ +[95/79484f] convertToUpper (2) | 3 of 3, cached: ✔ +[5e/4358gc] collectGreetings | 1 of 1, cached: ✔ +There were 3 greetings in this batch +``` + +You'll notice that these all cached successfully, meaning that Nextflow recognized that it has already done the requested work, even though the code has been split up and the main workflow file has been renamed. + +None of that matters to Nextflow; what matters is the job script that is generated once all the code has been pulled together and evaluated. + +!!!note + + It is also possible to encapsulate a section of a workflow as a 'subworkflow' that can be imported into a larger pipeline, but that is outside the scope of this course. + + TODO: add links to learn more about composable workflows + +### Takeaway + +You know how processes can be stored in standalone modules to promote code reuse and improve maintainability. + +### What's next? + +Learn to use containers for managing software dependencies. + +--- + +## 4. Using containerized software + +So far the workflows we've been using as examples just needed to run very basic text procession operations using UNIX tools available in our environment. + +However, real-world pipelines typically require specialized tools and packages that are not included by default in most environments. +Usually, you'd need to install these tools, manage their dependencies, and resolve any conflicts. + +That is all very tedious and annoying. +A much better way to address this problem is to use **containers**. + +A **container** is a lightweight, standalone, executable unit of software created from a container **image** that includes everything needed to run an application including code, system libraries and settings. + +!!! note + + We teach this using the technology [Docker](https://www.docker.com/get-started/), but Nextflow supports [several other container technologies](https://www.nextflow.io/docs/latest/container.html#) as well. + +### 4.1. Use a container directly + +First, let's try interacting with a container directly. +This will help solidify your understanding of what containers are before we start using them in Nextflow. + +TODO: clone the content from hello_containers.md + +### 4.2. Use a container in a workflow + +TODO: clone the content from hello_containers.md + +### Takeaway + +You understand what role containers play in managing software tool versions and ensuring reproducibility. + +More generally, you have a basic understanding of the most common and most important components of real-world Nexflow pipelines. + +### What's next? + +Take another break! +TODO: finalize the transition text diff --git a/docs/nextflow_run/02_run_basics.md b/docs/nextflow_run/02_run_basics.md deleted file mode 100644 index 14f0b7e56..000000000 --- a/docs/nextflow_run/02_run_basics.md +++ /dev/null @@ -1,10 +0,0 @@ -# Part 1: Run Basics - -[TODO] - -Should cover: - -- basic project structure (main.nf, modules, nextflow.config) -- run from CLI (re-use from Hello-World) -- basic config elements (refer to hello-config) including profiles -- running resource profiling and adapting the config diff --git a/docs/nextflow_run/03_config.md b/docs/nextflow_run/03_config.md new file mode 100644 index 000000000..60a4bb515 --- /dev/null +++ b/docs/nextflow_run/03_config.md @@ -0,0 +1,615 @@ +# Part 3: Configuration + +This section will explore how to manage the configuration of a Nextflow pipeline in order to customize its behavior, adapt it to different environments, and optimize resource usage _without altering a single line of the workflow code itself_. + +There are multiple ways to do this; here we are going to use the simplest and most common configuration file mechanism, the `nextflow.config` file. +Whenever there is a file named `nextflow.config` in the current directory, Nextflow will automatically load configuration from it. + +TODO: pare down and streamline some more + +!!!note + + Anything you put into the `nextflow.config` can be overridden at runtime by providing the relevant process directives or parameters and values on the command line, or by importing another configuration file, according to the order of precedence described [here](https://www.nextflow.io/docs/latest/config.html). + +In this part of the training, we're going to use the `nextflow.config` file to demonstrate essential components of Nextflow configuration such as process directives, executors, profiles, and parameter files. + +By learning to utilize these configuration options effectively, you can enhance the flexibility, scalability, and performance of your pipelines. + +--- + +## 0. Warmup: Check that Docker is enabled and run the Hello Config workflow + +First, a quick check. There is a `nextflow.config` file in the current directory that contains the line `docker.enabled = `, where `` is either `true` or `false` depending on whether or not you've worked through Part 5 of this course in the same environment. + +If it is set to `true`, you don't need to do anything. + +If it is set to `false`, switch it to `true` now. + +```console title="nextflow.config" linenums="1" +docker.enabled = true +``` + +Once you've done that, verify that the initial workflow runs properly: + +```bash +nextflow run hello-config.nf +``` + +```console title="Output" + N E X T F L O W ~ version 25.04.3 + +Launching `hello-config.nf` [reverent_heisenberg] DSL2 - revision: 028a841db1 + +executor > local (8) +[7f/0da515] sayHello (1) | 3 of 3 ✔ +[f3/42f5a5] convertToUpper (3) | 3 of 3 ✔ +[04/fe90e4] collectGreetings | 1 of 1 ✔ +[81/4f5fa9] cowpy | 1 of 1 ✔ +There were 3 greetings in this batch +``` + +If everything works, you're ready to learn how to modify basic configuration properties to adapt to your compute environment's requirements. + +--- + +## 1. Determine what software packaging technology to use + +The first step toward adapting your workflow configuration to your compute environment is specifying where the software packages that will get run in each step are going to be coming from. +Are they already installed in the local compute environment? Do we need to retrieve images and run them via a container system? Or do we need to retrieve Conda packages and build a local Conda environment? + +In the very first part of this training course (Parts 1-4) we just used locally installed software in our workflow. +Then in Part 5, we introduced Docker containers and the `nextflow.config` file, which we used to enable the use of Docker containers. + +In the warmup to this section, you checked that Docker was enabled in `nextflow.config` file and ran the workflow, which used a Docker container to execute the `cowpy()` process. + +!!! note + + If that doesn't sound familiar, you should probably go back and work through Part 5 before continuing. + +Now let's see how we can configure an alternative software packaging option via the `nextflow.config` file. + +### 1.1. Disable Docker and enable Conda in the config file + +Let's pretend we're working on an HPC cluster and the admin doesn't allow the use of Docker for security reasons. + +Fortunately for us, Nextflow supports multiple other container technologies such as including Singularity (which is more widely used on HPC), and software package managers such as Conda. + +We can change our configuration file to use Conda instead of Docker. +To do so, we switch the value of `docker.enabled` to `false`, and add a directive enabling the use of Conda: + +=== "After" + + ```groovy title="nextflow.config" linenums="1" hl_lines="1-2" + docker.enabled = false + conda.enabled = true + ``` + +=== "Before" + + ```groovy title="nextflow.config" linenums="1" + + docker.enabled = true + ``` + +This will allow Nextflow to create and utilize Conda environments for processes that have Conda packages specified. +Which means we now need to add one of those to our `cowpy` process! + +### 1.2. Specify a Conda package in the process definition + +We've already retrieved the URI for a Conda package containing the `cowpy` tool: `conda-forge::cowpy==1.1.5` + +!!! note + + There are a few different ways to get the URI for a given conda package. + We recommend using the [Seqera Containers](https://seqera.io/containers/) search query, which will give you a URI that you can copy and paste, even if you're not planning to create a container from it. + +Now we add the URI to the `cowpy` process definition using the `conda` directive: + +=== "After" + + ```console title="modules/cowpy.nf" linenums="4" hl_lines="4" + process cowpy { + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + conda 'conda-forge::cowpy==1.1.5' + + publishDir 'results', mode: 'copy' + ``` + +=== "Before" + + ```console title="modules/cowpy.nf" linenums="4" + process cowpy { + + container 'community.wave.seqera.io/library/cowpy:1.1.5--3db457ae1977a273' + + publishDir 'results', mode: 'copy' + ``` + +To be clear, we're not _replacing_ the `docker` directive, we're _adding_ an alternative option. + +### 1.3. Run the workflow to verify that it can use Conda + +Let's try it out. + +```bash +nextflow run hello-config.nf +``` + +This should work without issue. + +```console title="Output" + N E X T F L O W ~ version 25.04.3 + +Launching `hello-config.nf` [trusting_lovelace] DSL2 - revision: 028a841db1 + +executor > local (8) +[ee/4ca1f2] sayHello (3) | 3 of 3 ✔ +[20/2596a7] convertToUpper (1) | 3 of 3 ✔ +[b3/e15de5] collectGreetings | 1 of 1 ✔ +[c5/af5f88] cowpy | 1 of 1 ✔ +There were 3 greetings in this batch +``` + +Behind the scenes, Nextflow has retrieved the Conda packages and created the environment, which normally takes a bit of work; so it's nice that we don't have to do any of that ourselves! + +!!! note + + This runs quickly because the `cowpy` package is quite small, but if you're working with large packages, it may take a bit longer than usual the first time, and you might see the console output stay 'stuck' for a minute or so before completing. + This is normal and is due to the extra work Nextflow does the first time you use a new package. + +From our standpoint, it looks like it works exactly the same as running with Docker, even though on the backend the mechanics are a bit different. + +This means we're all set to run with Conda environments if needed. + +!!!note + + Since these directives are assigned per process, it is possible 'mix and match', _i.e._ configure some of the processes in your workflow to run with Docker and others with Conda, for example, if the compute infrastructure you are using supports both. + In that case, you would enable both Docker and Conda in your configuration file. + If both are available for a given process, Nextflow will prioritize containers. + + And as noted earlier, Nextflow supports multiple other software packaging and container technologies, so you are not limited to just those two. + +### Takeaway + +You know how to configure which software package each process should use, and how to switch between technologies. + +### What's next? + +Learn how to change the executor used by Nextflow to actually do the work. + +--- + +## 2. Allocate compute resources with process directives + +Most high-performance computing platforms allow (and sometimes require) that you specify certain resource allocation parameters such as number of CPUs and memory. + +By default, Nextflow will use a single CPU and 2GB of memory for each process. +The corresponding process directives are called `cpus` and `memory`, so the following configuration is implied: + +```groovy title="Built-in configuration" linenums="1" +process { + cpus = 1 + memory = 2.GB +} +``` + +You can modify these values, either for all processes or for specific named processes, using additional process directives in your configuration file. +Nextflow will translate them into the appropriate instructions for the chosen executor. + +But how do you know what values to use? + +### 2.1. Run the workflow to generate a resource utilization report + +If you don't know up front how much CPU and memory your processes are likely to need, you can do some resource profiling, meaning you run the workflow with some default allocations, record how much each process used, and from there, estimate how to adjust the base allocations. + +Conveniently, Nextflow includes built-in tools for doing this, and will happily generate a report for you on request. + +To do so, add `-with-report .html` to your command line. + +```bash +nextflow run hello-config.nf -with-report report-config-1.html +``` + +The report is an html file, which you can download and open in your browser. You can also right click it in the file explorer on the left and click on `Show preview` in order to view it in the training environment. + +Take a few minutes to look through the report and see if you can identify some opportunities for adjusting resources. +Make sure to click on the tabs that show the utilization results as a percentage of what was allocated. +There is some [documentation](https://www.nextflow.io/docs/latest/reports.html) describing all the available features. + + + +### 2.2. Set resource allocations for all processes + +The profiling shows that the processes in our training workflow are very lightweight, so let's reduce the default memory allocation to 1GB per process. + +Add the following to your `nextflow.config` file: + +```groovy title="nextflow.config" linenums="4" +process { + memory = 1.GB +} +``` + +### 2.3. Set resource allocations for an individual process + +At the same time, we're going to pretend that the `cowpy` process requires more resources than the others, just so we can demonstrate how to adjust allocations for an individual process. + +=== "After" + + ```groovy title="nextflow.config" linenums="4" hl_lines="3-6" + process { + memory = 1.GB + withName: 'cowpy' { + memory = 2.GB + cpus = 2 + } + } + ``` + +=== "Before" + + ```groovy title="nextflow.config" linenums="14" + process { + memory = 1.GB + } + ``` + +With this configuration, all processes will request 1GB of memory and a single CPU (the implied default), except the `cowpy` process, which will request 2GB and 2 CPUs. + +!!! note + + If you have a machine with few CPUs and you allocate a high number per process, you might see process calls getting queued behind each other. + This is because Nextflow ensures we don't request more CPUs than are available. + +### 2.4. Run the workflow with the modified configuration + +Let's try that out, supplying a different filename for the profiling report so we can compare performance before and after the configuration changes. + +```bash +nextflow run hello-config.nf -with-report report-config-2.html +``` + +You will probably not notice any real difference since this is such a small workload, but this is the approach you would use to analyze the performance and resource requirements of a real-world workflow. + +It is very useful when your processes have different resource requirements. It empowers you to right-size the resource allocations you set up for each process based on actual data, not guesswork. + +!!!note + + This is just a tiny taster of what you can do to optimize your use of resources. + Nextflow itself has some really neat [dynamic retry logic](https://training.nextflow.io/basic_training/debugging/#dynamic-resources-allocation) built in to retry jobs that fail due to resource limitations. + Additionally, the Seqera Platform offers AI-driven tooling for optimizing your resource allocations automatically as well. + + We'll cover both of those approaches in an upcoming part of this training course. + +### 2.5. Add resource limits + +Depending on what computing executor and compute infrastructure you're using, there may be some constraints on what you can (or must) allocate. +For example, your cluster may require you to stay within certain limits. + +You can use the `resourceLimits` directive to set the relevant limitations. The syntax looks like this when it's by itself in a process block: + +```groovy title="Syntax example" +process { + resourceLimits = [ + memory: 750.GB, + cpus: 200, + time: 30.d + ] +} +``` + +Nextflow will translate these values into the appropriate instructions depending on the executor that you specified. + +We're not going to run this, since we don't have access to relevant infrastructure in the training environment. +However, if you were to try running the workflow with resource allocations that exceed these limits, then look up the `sbatch` command in the `.command.run` script file, you would see that the requests that actually get sent to the executor are capped at the values specified by `resourceLimits`. + +!!!note + + The nf-core project has compiled a [collection of configuration files](https://nf-co.re/configs/) shared by various institutions around the world, covering a wide range of HPC and cloud executors. + + Those shared configs are valuable both for people who work there and can therefore just utilize their institution's configuration out of the box, and as a model for people who are looking to develop a configuration for their own infrastructure. + +### Takeaway + +You know how to generate a profiling report to assess resource utilization and how to modify resource allocations for all processes and/or for individual processes, as well as set resource limitations for running on HPC. + +### What's next? + +Learn to use a parameter file to store workflow parameters. + +--- + +## 3. Use a parameter file to store workflow parameters + +So far we've been looking at configuration from the technical point of view of the compute infrastructure. +Now let's consider another aspect of workflow configuration that is very important for reproducibility: the configuration of the workflow parameters. + +Currently, our workflow is set up to accept several parameter values via the command-line, with default values set in the workflow script itself. +This is fine for a simple workflow with very few parameters that need to be set for a given run. +However, many real-world workflows will have many more parameters that may be run-specific, and putting all of them in the command line would be tedious and error-prone. + +Nextflow allows us to specify parameters via a parameter file in JSON format, which makes it very convenient to manage and distribute alternative sets of default values, for example, as well as run-specific parameter values. + +We provide an example parameter file in the current directory, called `test-params.json`: + +```json title="test-params.json" linenums="1" +{ + "greeting": "greetings.csv", + "batch": "Trio", + "character": "turkey" +} +``` + +This parameter file contains a key-value pair for each of the inputs our workflow expects. + +### 3.1. Run the workflow using a parameter file + +To run the workflow with this parameter file, simply add `-params-file ` to the base command. + +```bash +nextflow run hello-config.nf -params-file test-params.json +``` + +It works! And as expected, this produces the same outputs as previously. + +```console title="Output" + N E X T F L O W ~ version 25.04.3 + +Launching `hello-config.nf` [disturbed_sammet] DSL2 - revision: ede9037d02 + +executor > local (8) +[f0/35723c] sayHello (2) | 3 of 3 ✔ +[40/3efd1a] convertToUpper (3) | 3 of 3 ✔ +[17/e97d32] collectGreetings | 1 of 1 ✔ +[98/c6b57b] cowpy | 1 of 1 ✔ +There were 3 greetings in this batch +``` + +This may seem like overkill when you only have a few parameters to specify, but some pipelines expect dozens of parameters. +In those cases, using a parameter file will allow us to provide parameter values at runtime without having to type massive command lines and without modifying the workflow script. + +### Takeaway + +You know how to manage parameter defaults and override them at runtime using a parameter file. + +### What's next? + +Learn how to use profiles to conveniently switch between alternative configurations. + +--- + +## 4. Determine what executor(s) should be used to do the work + +Until now, we have been running our pipeline with the local executor. +This executes each task on the machine that Nextflow is running on. +When Nextflow begins, it looks at the available CPUs and memory. +If the resources of the tasks ready to run exceed the available resources, Nextflow will hold the last tasks back from execution until one or more of the earlier tasks have finished, freeing up the necessary resources. + +For very large workloads, you may discover that your local machine is a bottleneck, either because you have a single task that requires more resources than you have available, or because you have so many tasks that waiting for a single machine to run them would take too long. +The local executor is convenient and efficient, but is limited to that single machine. +Nextflow supports [many different execution backends](https://www.nextflow.io/docs/latest/executor.html), including HPC schedulers (Slurm, LSF, SGE, PBS, Moab, OAR, Bridge, HTCondor and others) as well as cloud execution backends such (AWS Batch, Google Cloud Batch, Azure Batch, Kubernetes and more). + +Each of these systems uses different technologies, syntaxes and configurations for defining how a job should be defined. For example, /if we didn't have Nextflow/, a job requiring 8 CPUs and 4GB of RAM to be executed on the queue "my-science-work" would need to include the following configuration on SLURM and submit the job using `sbatch`: + +```bash +#SBATCH -o /path/to/my/task/directory/my-task-1.log +#SBATCH --no-requeue +#SBATCH -c 8 +#SBATCH --mem 4096M +#SBATCH -p my-science-work +``` + +If I wanted to make the workflow available to a colleague running on PBS, I'd need to remember to use a different submission program `qsub` and I'd need to change my scripts to use a new syntax for resources: + +```bash +#PBS -o /path/to/my/task/directory/my-task-1.log +#PBS -j oe +#PBS -q my-science-work +#PBS -l nodes=1:ppn=5 +#PBS -l mem=4gb +``` + +If I wanted to use SGE, the configuration would be slightly different again: + +```bash +#$ -o /path/to/my/task/directory/my-task-1.log +#$ -j y +#$ -terse +#$ -notify +#$ -q my-science-work +#$ -l slots=5 +#$ -l h_rss=4096M,mem_free=4096M +``` + +Running on a single cloud execution engine would require a new approach again, likely using an SDK that uses the cloud platform's APIs. + +Nextflow makes it easy to write a single workflow that can be run on each of these different infrastructures and systems, without having to modify the workflow. +The executor is subject to a process directive called `executor`. +By default it is set to `local`, so the following configuration is implied: + +```groovy title="Built-in configuration" +process { + executor = 'local' +} +``` + +### 4.1. Targeting a different backend + +By default, this training environment does not include a running HPC schedulder, but if you were running on a system with SLURM installed, for example, you can have Nextflow convert the `cpus`, `memory`, `queue` and other process directives into the correct syntax at runtime by adding following lines to the `nextflow.config` file: + +```groovy title="nextflow.config" +process { + executor = 'slurm' +} +``` + +And... that's it! As noted before, this does assume that Slurm itself is already set up for you, but this is really all Nextflow itself needs to know. + +Basically we are telling Nextflow to generate a Slurm submission script and submit it using an `sbatch` command. + +### Takeaway + +You now know how to change the executor to use different kinds of computing infrastructure. + +### What's next? + +Learn how to control the resources allocated for executing processes. + +--- + +## 5. Use profiles to select preset configurations + +You may want to switch between alternative settings depending on what computing infrastructure you're using. For example, you might want to develop and run small-scale tests locally on your laptop, then run full-scale workloads on HPC or cloud. + +Nextflow lets you set up profiles that describe different configurations, which you can then select at runtime using a command-line argument, rather than having to modify the configuration file itself. + +### 5.1. Create profiles for switching between local development and execution on HPC + +Let's set up two alternative profiles; one for running small scale loads on a regular computer, where we'll use Docker containers, and one for running on a university HPC with a Slurm scheduler, where we'll use Conda packages. + +Add the following to your `nextflow.config` file: + +```groovy title="nextflow.config" linenums="3" +profiles { + my_laptop { + process.executor = 'local' + docker.enabled = true + } + univ_hpc { + process.executor = 'slurm' + conda.enabled = true + process.resourceLimits = [ + memory: 750.GB, + cpus: 200, + time: 30.d + ] + } +} +``` + +You see that for the university HPC, we're also specifying resource limitations. + +### 5.2. Run the workflow with a profile + +To specify a profile in our Nextflow command line, we use the `-profile` argument. + +Let's try running the workflow with the `my_laptop` configuration. + +```bash +nextflow run hello-config.nf -profile my_laptop +``` + +This still produces the following output: + +``` + N E X T F L O W ~ version 25.04.3 + +Launching `hello-config.nf` [gigantic_brazil] DSL2 - revision: ede9037d02 + +executor > local (8) +[58/da9437] sayHello (3) | 3 of 3 ✔ +[35/9cbe77] convertToUpper (2) | 3 of 3 ✔ +[67/857d05] collectGreetings | 1 of 1 ✔ +[37/7b51b5] cowpy | 1 of 1 ✔ +There were 3 greetings in this batch +``` + +As you can see, this allows us to toggle between configurations very conveniently at runtime. + +!!! warning + + The `univ_hpc` profile will not run properly in the training environment since we do not have access to a Slurm scheduler. + +If in the future we find other elements of configuration that are always co-occurring with these, we can simply add them to the corresponding profile(s). +We can also create additional profiles if there are other elements of configuration that we want to group together. + +### 5.3. Create a test profile + +Profiles are not only for infrastructure configuration. +We can also use them to set default values for workflow parameters, to make it easier for others to try out the workflow without having to gather appropriate input values themselves. +This is intended as an alternative to using a parameter file. + +The syntax for expressing default values is the same as when writing them into the workflow file itself, except we wrap them in a block named `test`: + +```groovy title="Syntax example" + test { + params. + params. + ... + } +``` + +If we add a test profile for our workflow, the `profiles` block becomes: + +```groovy title="nextflow.config" linenums="4" +profiles { + my_laptop { + process.executor = 'local' + docker.enabled = true + } + univ_hpc { + process.executor = 'slurm' + conda.enabled = true + process.resourceLimits = [ + memory: 750.GB, + cpus: 200, + time: 30.d + ] + } + test { + params.greeting = 'greetings.csv' + params.batch = 'test-batch' + params.character = 'turkey' + } +} +``` + +Just like for technical configuration profiles, you can set up multiple different profiles specifying parameters under any arbitrary name you like. + +### 5.4. Run the workflow locally with the test profile + +Conveniently, profiles are not mutually exclusive, so we can specify multiple profiles in our command line using the following syntax `-profile ,` (for any number of profiles). + +!!! note + + If you combine profiles that set values for the same elements of configuration and are described in the same configuration file, Nextflow will resolve the conflict by using whichever value it read in last (_i.e._ whatever comes later in the file). + If the conflicting settings are set in different configuration sources, the default [order of precedence](https://www.nextflow.io/docs/latest/config.html) applies. + +Let's try adding the test profile to our previous command: + +```bash +nextflow run hello-config.nf -profile my_laptop,test +``` + +This should produce the following: + +```console title="Output" + N E X T F L O W ~ version 25.04.3 + +Launching `hello-config.nf` [gigantic_brazil] DSL2 - revision: ede9037d02 + +executor > local (8) +[58/da9437] sayHello (3) | 3 of 3 ✔ +[35/9cbe77] convertToUpper (2) | 3 of 3 ✔ +[67/857d05] collectGreetings | 1 of 1 ✔ +[37/7b51b5] cowpy | 1 of 1 ✔ +There were 3 greetings in this batch +``` + + + +This means that as long as we distribute any test data files with the workflow code, anyone can quickly try out the workflow without having to supply their own inputs via the command line or a parameter file. + +!!! note + + We can even point to URLs for larger files that are stored externally. + Nextflow will download them automatically as long as there is an open connection. + +### Takeaway + +You know how to use profiles to select a preset configuration at runtime with minimal hassle. More generally, you know how to configure your workflow executions to suit different compute platforms and enhance the reproducibility of your analyses. + +### What's next? + +TODO: update next steps diff --git a/docs/nextflow_run/03_run_nf-core.md b/docs/nextflow_run/03_run_nf-core.md deleted file mode 100644 index 0b78d3b73..000000000 --- a/docs/nextflow_run/03_run_nf-core.md +++ /dev/null @@ -1,1380 +0,0 @@ -# Part 3: Run nf-core - -nf-core is a community effort to develop and maintain a curated set of analysis pipelines built using Nextflow. - -![nf-core logo](../nf_customize/img/nf-core-logo.png) - -nf-core provides a standardized set of best practices, guidelines, and templates for building and sharing scientific pipelines. -These pipelines are designed to be modular, scalable, and portable, allowing researchers to easily adapt and execute them using their own data and compute resources. - -One of the key benefits of nf-core is that it promotes open development, testing, and peer review, ensuring that the pipelines are robust, well-documented, and validated against real-world datasets. -This helps to increase the reliability and reproducibility of scientific analyses and ultimately enables researchers to accelerate their scientific discoveries. - -nf-core is published in Nature Biotechnology: [Nat Biotechnol 38, 276–278 (2020). Nature Biotechnology](https://www.nature.com/articles/s41587-020-0439-x). An updated preprint is available at [bioRxiv](https://www.biorxiv.org/content/10.1101/2024.05.10.592912v1). - -## nf-core pipelines and other components - -The nf-core collection currently offers [over 100 pipelines](https://nf-co.re/pipelines/) in various stages of development, [72 subworkflows](https://nf-co.re/subworkflows/) and [over 1300 modules](https://nf-co.re/modules/) that you can use to build your own pipelines. - -Each released pipeline has a dedicated page that includes 6 documentation sections: - -- **Introduction:** An introduction and overview of the pipeline -- **Usage:** Descriptions of how to execute the pipeline -- **Parameters:** Grouped pipeline parameters with descriptions -- **Output:** Descriptions and examples of the expected output files -- **Results:** Example output files generated from the full test dataset -- **Releases & Statistics:** Pipeline version history and statistics - -You should read the pipeline documentation carefully to understand what a given pipeline does and how it can be configured before attempting to run it. - -### Pulling an nf-core pipeline - -One really cool aspect of how Nextflow manages pipelines is that you can pull a pipeline from a GitHub repository without cloning the repository. -This is really convenient if you just want to run a pipeline without modifying the code. - -So if you want to try out an nf-core pipeline with minimal effort, you can start by pulling it using the `nextflow pull` command. - -!!!tip - - You can run this from anywhere, but if you feel like being consistent with previous exercises, you can create a `nf-core-demo` directory under `hello-nextflow`. If you were working through Part 7 (Hello nf-test) before this, you may need to go up one level first. - - ```bash - mkdir nf-core-demo - cd nf-core-demo - ``` - -Whenever you're ready, run the command: - -```bash -nextflow pull nf-core/demo -``` - -Nextflow will `pull` the pipeline's default GitHub branch. -For nf-core pipelines with a stable release, that will be the master branch. -You select a specific branch with `-r`; we'll cover that later. - -```console title="Output" -Checking nf-core/demo ... - downloaded from https://github.com/nf-core/demo.git - revision: 04060b4644 [master] -``` - -To be clear, you can do this with any Nextflow pipeline that is appropriately set up in GitHub, not just nf-core pipelines. -However nf-core is the largest open curated collection of Nextflow pipelines. - -!!!tip - - Pulled pipelines are stored in a hidden assets folder. By default, this folder is `$HOME/.nextflow/assets`, but in this training environment the folder has been set to `$NXF_HOME/assets`: - - ```bash - tree $NXF_HOME/assets/ -L 2 - ``` - - ```console title="Output" - /home/gitpod/.nextflow/assets/ - └── nf-core - └── demo - ``` - - So you don't actually see them listed in your working directory. - However, you can view a list of your cached pipelines using the `nextflow list` command: - - ```bash - nextflow list - ``` - - ```console title="Output" - nf-core/demo - ``` - -Now that we've got the pipeline pulled, we can try running it! - -### Trying out an nf-core pipeline with the test profile - -Conveniently, every nf-core pipeline comes with a `test` profile. -This is a minimal set of configuration settings for the pipeline to run using a small test dataset that is hosted on the [nf-core/test-datasets](https://github.com/nf-core/test-datasets) repository. It's a great way to try out a pipeline at small scale. - -The `test` profile for `nf-core/demo` is shown below: - -```groovy title="conf/test.config" linenums="1" -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Nextflow config file for running minimal tests -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - Defines input files and everything required to run a fast and simple pipeline test. - - Use as follows: - nextflow run nf-core/demo -profile test, --outdir - ----------------------------------------------------------------------------------------- -*/ - -process { - resourceLimits = [ - cpus: 4, - memory: '15.GB', - time: '1.h' - ] -} - -params { - config_profile_name = 'Test profile' - config_profile_description = 'Minimal test dataset to check pipeline function' - - // Input data - input = 'https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv' - -} -``` - -This tells us that the `nf-core/demo` `test` profile already specifies the input parameter, so you don't have to provide any input yourself. -However, the `outdir` parameter is not included in the `test` profile, so you have to add it to the execution command using the `--outdir` flag. - -Here, we're also going to specify `-profile docker`, which by nf-core convention enables the use of Docker. - -Lets' try it! - -```bash -nextflow run nf-core/demo -profile docker,test --outdir results -``` - -!!! hint "Changing Nextflow version" - - Depending on the Nextflow version you have installed, this command might fail due to a version mismatch. - If that happens, you can temporarily run the pipeline with a different version than you have installed by adding `NXF_VER=version` to the start of your command as shown below: - - ```bash - NXF_VER=24.09.2-edge nextflow run nf-core/demo -profile docker,test --outdir results - ``` - -Here's the console output from the pipeline: - -```console title="Output" - N E X T F L O W ~ version 24.09.2-edge - -Launching `https://github.com/nf-core/demo` [naughty_bell] DSL2 - revision: 04060b4644 [master] - - ------------------------------------------------------- - ,--./,-. - ___ __ __ __ ___ /,-._.--~' - |\ | |__ __ / ` / \ |__) |__ } { - | \| | \__, \__/ | \ |___ \`-._,-`-, - `._,._,' - nf-core/demo 1.0.1 ------------------------------------------------------- -Input/output options - input : https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv - outdir : results - -Institutional config options - config_profile_name : Test profile - config_profile_description: Minimal test dataset to check pipeline function - -Core Nextflow options - revision : master - runName : naughty_bell - containerEngine : docker - launchDir : /workspaces/training/hello-nextflow - workDir : /workspaces/training/hello-nextflow/work - projectDir : /home/gitpod/.nextflow/assets/nf-core/demo - userName : gitpod - profile : docker,test - configFiles : - -!! Only displaying parameters that differ from the pipeline defaults !! -------------------------------------------------------* The pipeline - https://doi.org/10.5281/zenodo.12192442 - -* The nf-core framework - https://doi.org/10.1038/s41587-020-0439-x - -* Software dependencies - https://github.com/nf-core/demo/blob/master/CITATIONS.md - -executor > local (7) -[0a/e694d8] NFCORE_DEMO:DEMO:FASTQC (SAMPLE3_SE) [100%] 3 of 3 ✔ -[85/4198c1] NFCORE_DEMO:DEMO:SEQTK_TRIM (SAMPLE1_PE) [100%] 3 of 3 ✔ -[d8/fe153e] NFCORE_DEMO:DEMO:MULTIQC [100%] 1 of 1 ✔ --[nf-core/demo] Pipeline completed successfully- -Completed at: 28-Oct-2024 03:24:58 -Duration : 1m 13s -CPU hours : (a few seconds) -Succeeded : 7 -``` - -Isn't that neat? - -You can also explore the `results` directory produced by the pipeline. - -```console title="Output" -results -├── fastqc -│ ├── SAMPLE1_PE -│ ├── SAMPLE2_PE -│ └── SAMPLE3_SE -├── fq -│ ├── SAMPLE1_PE -│ ├── SAMPLE2_PE -│ └── SAMPLE3_SE -├── multiqc -│ ├── multiqc_data -│ ├── multiqc_plots -│ └── multiqc_report.html -└── pipeline_info - ├── execution_report_2024-10-28_03-23-44.html - ├── execution_timeline_2024-10-28_03-23-44.html - ├── execution_trace_2024-10-28_03-14-32.txt - ├── execution_trace_2024-10-28_03-19-33.txt - ├── execution_trace_2024-10-28_03-20-57.txt - ├── execution_trace_2024-10-28_03-22-39.txt - ├── execution_trace_2024-10-28_03-23-44.txt - ├── nf_core_pipeline_software_mqc_versions.yml - ├── params_2024-10-28_03-23-49.json - └── pipeline_dag_2024-10-28_03-23-44.html -``` - -If you're curious about what that all means, check out [the nf-core/demo pipeline documentation page](https://nf-co.re/demo/1.0.1/)! - -And that's all you need to know for now. -Congratulations! You have now run your first nf-core pipeline. - -### Takeaway - -You have a general idea of what nf-core offers and you know how to run an nf-core pipeline using its built-in test profile. - -### What's next? - -Celebrate and take another break! Next, we'll show you how to use nf-core tooling to build your own pipeline. - -## Create a basic pipeline from template - -We will now start developing our own nf-core style pipeline. The nf-core community provides a [command line tool](https://nf-co.re/docs/nf-core-tools) with helper functions to use and develop pipelines. -We have pre-installed nf-core tools, and here, we will use them to create and develop a new pipeline. - -View all of the tooling using the `nf-core --help` argument. - -```bash -nf-core --help -``` - -### Creating your pipeline - -Before we start, let's navigate into the `hello-nf-core` directory: - -``` -cd .. -cd hello-nf-core -``` - -!!! hint "Open a new window in VSCode" - - If you are working with VS Code you can open a new window to reduce visual clutter: - - ```bash - code . - ``` - -Let's start by creating a new pipeline with the `nf-core pipelines create` command: - -All nf-core pipelines are based on a common template, a standardized pipeline skeleton that can be used to streamline development with shared features and components. - -The `nf-core pipelines create` command creates a new pipeline using the nf-core base template with a pipeline name, description, and author. It is the first and most important step for creating a pipeline that will integrate with the wider Nextflow ecosystem. - -```bash -nf-core pipelines create -``` - -Running this command will open a Text User Interface (TUI) for pipeline creation. - -
- -
- -Template features can be flexibly included or excluded at the time of creation, follow these steps create your first pipeline using the `nf-core pipelines create` TUI: - -1. Run the `nf-core pipelines create` command -2. Select **Let's go!** on the welcome screen -3. Select **Custom** on the Choose pipeline type screen -4. Enter your pipeline details, replacing < YOUR NAME > with your own name, then select **Next** - - - **GitHub organisation:** myorg - - **Workflow name:** myfirstpipeline - - **A short description of your pipeline:** My first pipeline - - **Name of the main author / authors:** < YOUR NAME > - -5. On the Template features screen, turn **off**: - - - `Use a GitHub repository` - - `Add GitHub CI tests` - - `Use reference genomes` - - `Add GitHub badges` - - `Include citations` - - `Include a gitpod environment` - - `Include GitHub Codespaces` - - `Use fastqc` - - `Add a changelog` - - `Support Microsoft Teams notifications` - - `Support Slack notifications` - -6. Select **Finish** on the Final details screen -7. Wait for the pipeline to be created, then select **Continue** -8. Select **Finish without creating a repo** on the Create GitHub repository screen -9. Select **Close** on the HowTo create a GitHub repository page - -If run successfully, you will see a new folder in your current directory named `myorg-myfirstpipeline`. - -### Testing your pipeline - -Let's try to run our new pipeline: - -```bash -cd myorg-myfirstpipeline -nextflow run . -profile docker,test --outdir results -``` - -The pipeline should run successfully! - -Here's the console output from the pipeline: - -```console title="Output" -Launching `./main.nf` [marvelous_saha] DSL2 - revision: a633aedb88 - -Input/output options - input : https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv - outdir : results - -Institutional config options - config_profile_name : Test profile - config_profile_description: Minimal test dataset to check pipeline function - -Core Nextflow options - runName : marvelous_saha - containerEngine : docker - launchDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline - workDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline/work - projectDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline - userName : gitpod - profile : docker,test - configFiles : - -!! Only displaying parameters that differ from the pipeline defaults !! ------------------------------------------------------- -executor > local (1) -[ba/579181] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ --[myorg/myfirstpipeline] Pipeline completed successfully- -``` - -Let's dissect what we are seeing. - -The nf-core pipeline template is a working pipeline and comes preconfigured with some modules. Here, we only run [MultiQC](https://multiqc.info/) - -At the top, you see all parameters displayed that differ from the pipeline defaults. Most of these are default or were set by applying the `test` profile. - -Additionally we used the `docker` profile to use docker for software packaging. nf-core provides this as a profile for convenience to enable the docker feature but we could do it with configuration as we did with the earlier module. - -### Template tour - -The nf-core pipeline template comes packed with a lot of files and folders. While creating the pipeline, we selected a subset of the nf-core features. The features we selected are now included as files and directories in our repository. - -While the template may feel overwhelming, a complete understanding isn't required to start developing your pipeline. Let's look at the important places that we need to touch during pipeline development. - -#### Workflows, subworkflows, and modules - -The nf-core pipeline template has a `main.nf` script that calls `myfirstpipeline.nf` from the `workflows` folder. The `myfirstpipeline.nf` file inside the workflows folder is the central pipeline file that is used to bring everything else together. - -Instead of having one large monolithic pipeline script, it's broken up into smaller script components, namely, modules and subworkflows: - -- **Modules:** Wrappers around a single process -- **Subworkflows:** Two or more modules that are packaged together as a mini workflow - -
- --8<-- "docs/hello_nextflow/img/nested.excalidraw.svg" -
- -Within your pipeline repository, `modules` and `subworkflows` are stored within `local` and `nf-core` folders. The `nf-core` folder is for components that have come from the nf-core GitHub repository while the `local` folder is for components that have been developed independently (usually things very specific to a pipeline): - -```console -modules/ -├── local -│ └── .nf -│ . -│ -└── nf-core - ├── - │ ├── environment.yml - │ ├── main.nf - │ ├── meta.yml - │ └── tests - │ ├── main.nf.test - │ ├── main.nf.test.snap - │ └── tags.yml - . -``` - -Modules from nf-core follow a similar structure and contain a small number of additional files for testing using [nf-test](https://www.nf-test.com/) and documentation about the module. - -!!!note - - Some nf-core modules are also split into command specific directories: - - ```console - │ - └── - └── - ├── environment.yml - ├── main.nf - ├── meta.yml - └── tests - ├── main.nf.test - ├── main.nf.test.snap - └── tags.yml - ``` - -!!!note - - The nf-core template does not come with a local modules folder by default. - -#### Configuration files - -The nf-core pipeline template utilizes Nextflow's flexible customization options and has a series of configuration files throughout the template. - -In the template, the `nextflow.config` file is a central configuration file and is used to set default values for parameters and other configuration options. The majority of these configuration options are applied by default while others (e.g., software dependency profiles) are included as optional profiles. - -There are several configuration files that are stored in the `conf` folder and are added to the configuration by default or optionally as profiles: - -- `base.config`: A 'blank slate' config file, appropriate for general use on most high-performance computing environments. This defines broad bins of resource usage, for example, which are convenient to apply to modules. -- `modules.config`: Additional module directives and arguments. -- `test.config`: A profile to run the pipeline with minimal test data. -- `test_full.config`: A profile to run the pipeline with a full-sized test dataset. - -#### `nextflow_schema.json` - -The `nextflow_schema.json` is a file used to store parameter related information including type, description and help text in a machine readable format. The schema is used for various purposes, including automated parameter validation, help text generation, and interactive parameter form rendering in UI interfaces. - -#### `assets/schema_input.json` - -The `schema_input.json` is a file used to define the input samplesheet structure. Each column can have a type, pattern, description and help text in a machine readable format. The schema is used for various purposes, including automated validation, and providing helpful error messages. - -### Takeaway - -You have an example pipeline, and learned about important template files. - -### What's next? - -Congratulations! In the next step, we will check the input data. - ---- - -## Check the input data - -Above, we said that the `test` profile comes with small test files that are stored in the nf-core. Let's check what type of files we are dealing with to plan our expansion. Remember that we can inspect any channel content using the `view` operator: - -```groovy title="workflows/myfirstpipeline.nf" linenums="27" -ch_samplesheet.view() -``` - -and the run command: - -```bash -nextflow run . -profile docker,test --outdir results -``` - -The output should look like the below. We see that we have FASTQ files as input and each set of files is accompanied by some metadata: the `id` and whether or not they are single end: - -```console title="Output" -[['id':'SAMPLE1_PE', 'single_end':false], [/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz, /nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz]] -[['id':'SAMPLE2_PE', 'single_end':false], [/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz, /nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz]] -[['id':'SAMPLE3_SE', 'single_end':true], [/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz, /nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz]] -``` - -You can comment the `view` statement for now. We will use later during this training to inspect the channel content again. - -### Takeaway - -You have learned how input data is supplied via a samplesheet. - -### What's next? - -In the next step we will start changing the code and add new tools to the pipeline. - ---- - -## Add an nf-core module - -nf-core provides a large library of modules and subworkflows: pre-made nextflow wrappers around tools that can be installed into nextflow pipelines. They are designed to be flexible but may require additional configuration to suit different use cases. - -Currently, there are more than [1300 nf-core modules](https://nf-co.re/modules) and [60 nf-core subworkflows](https://nf-co.re/subworkflows) (November 2024) available. Modules and subworkflows can be listed, installed, updated, removed, and patched using nf-core tooling. - -While you could develop a module for this tool independently, you can save a lot of time and effort by leveraging nf-core modules and subworkflows. - -Let's see which modules are available: - -```console -nf-core modules list remote -``` - -This command lists all currently available modules, > 1300. An easier way to find them is to go to the nf-core website and visit the modules subpage [https://nf-co.re/modules](https://nf-co.re/modules). Here you can search for modules by name or tags, find documentation for each module, and see which nf-core pipeline are using the module: - -![nf-core/modules](img/nf-core-modules.png) - -### Install an nf-core module - -Now let's add another tool to the pipeline. - -`Seqtk` is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. Here, you will use the [`seqtk trim`](https://github.com/lh3/seqtk) command to trim FASTQ files. - -In your pipeline, you will add a new step that will take FASTQ files from the sample sheet as inputs and will produce trimmed fastq files that can be used as an input for other tools and version information about the seqtk tools to mix into the inputs for the MultiQC process. - -
- --8<-- "docs/hello_nextflow/img/pipeline.excalidraw.svg" -
- -The `nf-core modules install` command can be used to install the `seqtk/trim` module directly from the nf-core repository: - -``` -nf-core modules install -``` - -!!!warning - - You need to be in the myorg-myfirstpipeline directory when executing `nf-core modules install` - -You can follow the prompts to find and install the module you are interested in: - -```console -? Tool name: seqtk/trim -``` - -Once selected, the tooling will install the module in the `modules/nf-core/` folder and suggest code that you can add to your main workflow file (`workflows/myfirstpipeline.nf`). - -```console -INFO Installing 'seqtk/trim' -INFO Use the following statement to include this module: - -include { SEQTK_TRIM } from '../modules/nf-core/seqtk/trim/main' -``` - -To enable reporting and reproducibility, modules and subworkflows from the nf-core repository are tracked using hashes in the `modules.json` file. When modules are installed or removed using the nf-core tooling the `modules.json` file will be automatically updated. - -When you open the `modules.json`, you will see an entry for each module that is currently installed from the nf-core modules repository. You can open the file with the VS Code user interface by clicking on it in `myorg-myfirstpipeline/modules.json`: - -```console -"nf-core": { - "multiqc": { - "branch": "master", - "git_sha": "cf17ca47590cc578dfb47db1c2a44ef86f89976d", - "installed_by": ["modules"] - }, - "seqtk/trim": { - "branch": "master", - "git_sha": "666652151335353eef2fcd58880bcef5bc2928e1", - "installed_by": ["modules"] - } -} -``` - -### Add the module to your pipeline - -Although the module has been installed in your local pipeline repository, it is not yet added to your pipeline. - -The suggested `include` statement needs to be added to your `workflows/myfirstpipeline.nf` file and the process call (with inputs) needs to be added to the workflow block. - -```groovy title="workflows/myfirstpipeline.nf" linenums="6" -include { SEQTK_TRIM } from '../modules/nf-core/seqtk/trim/main' -include { MULTIQC } from '../modules/nf-core/multiqc/main' -``` - -To add the `SEQTK_TRIM` module to your workflow you will need to check what inputs are required. - -You can view the input channels for the module by opening the `./modules/nf-core/seqtk/trim/main.nf` file. - -```groovy title="modules/nf-core/seqtk/trim/main.nf" linenums="11" -input: -tuple val(meta), path(reads) -``` - -Each nf-core module also has a `meta.yml` file which describes the inputs and outputs. This meta file is rendered on the [nf-core website](https://nf-co.re/modules/seqtk_trim), or can be viewed using the `nf-core modules info` command: - -```console -nf-core modules info seqtk/trim -``` - -It outputs a table with all defined inputs and outputs of the module: - -```console title="Output" - -╭─ Module: seqtk/trim ─────────────────────────────────────────────────────────────────────────────╮ -│ Location: modules/nf-core/seqtk/trim │ -│ 🔧 Tools: seqtk │ -│ 📖 Description: Trim low quality bases from FastQ files │ -╰───────────────────────────────────────────────────────────────────────────────────────────────────╯ - ╷ ╷ - 📥 Inputs │Description │ Pattern -╺━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━╸ - input[0] │ │ -╶──────────────┼───────────────────────────────────────────────────────────────────────┼────────────╴ - meta (map) │Groovy Map containing sample information e.g. [ id:'test', │ - │single_end:false ] │ -╶──────────────┼───────────────────────────────────────────────────────────────────────┼────────────╴ - reads (file)│List of input FastQ files │*.{fastq.gz} - ╵ ╵ - ╷ ╷ - 📥 Outputs │Description │ Pattern -╺━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┿━━━━━━━━━━━━╸ - reads │ │ -╶─────────────────────┼────────────────────────────────────────────────────────────────┼────────────╴ - meta (map) │Groovy Map containing sample information e.g. [ id:'test', │ - │single_end:false ] │ -╶─────────────────────┼────────────────────────────────────────────────────────────────┼────────────╴ - *.fastq.gz (file) │Filtered FastQ files │*.{fastq.gz} -╶─────────────────────┼────────────────────────────────────────────────────────────────┼────────────╴ - versions │ │ -╶─────────────────────┼────────────────────────────────────────────────────────────────┼────────────╴ - versions.yml (file)│File containing software versions │versions.yml - ╵ ╵ - - Use the following statement to include this module: - - include { SEQTK_TRIM } from '../modules/nf-core/seqtk/trim/main' -``` - -Using this module information you can work out what inputs are required for the `SEQTK_TRIM` process: - -1. `tuple val(meta), path(reads)` - - - A tuple with a meta _map_ and a list of FASTQ _files_ - - The channel `ch_samplesheet` used by the `FASTQC` process can be used as the reads input. - -Only one input channel is required, and it already exists, so it can be added to your `firstpipeline.nf` file without any additional channel creation or modifications. - -_Before:_ - -```groovy title="workflows/myfirstpipeline.nf" linenums="30" -// -// Collate and save software versions -// -``` - -_After:_ - -```groovy title="workflows/myfirstpipeline.nf" linenums="29" -// -// MODULE: Run SEQTK_TRIM -// -SEQTK_TRIM ( - ch_samplesheet -) -// -// Collate and save software versions -// -``` - -Let's test it: - -```bash -nextflow run . -profile docker,test --outdir results -``` - -```console title="Output" -Launching `./main.nf` [drunk_waddington] DSL2 - revision: a633aedb88 - -Input/output options - input : https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/samplesheet/samplesheet_test_illumina_amplicon.csv - outdir : results - -Institutional config options - config_profile_name : Test profile - config_profile_description: Minimal test dataset to check pipeline function - -Core Nextflow options - runName : drunk_waddington - containerEngine : docker - launchDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline - workDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline/work - projectDir : /workspaces/training/hello-nextflow/hello-nf-core/myorg-myfirstpipeline - userName : gitpod - profile : docker,test - configFiles : - -!! Only displaying parameters that differ from the pipeline defaults !! ------------------------------------------------------- -executor > local (4) -[74/9b2e7b] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:SEQTK_TRIM (SAMPLE2_PE) [100%] 3 of 3 ✔ -[ea/5ca001] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ --[myorg/myfirstpipeline] Pipeline completed successfully- -``` - -### Inspect results folder - -Default nf-core configuration directs the output of each process into the `/`. After running the previous command, you -should have a `results` folder that looks something like this: - -```console -results -├── multiqc -│ ├── multiqc_data -│ └── multiqc_report.html -├── pipeline_info -│ ├── execution_report_2024-11-14_12-07-43.html -│ ├── execution_report_2024-11-14_12-12-42.html -│ ├── execution_report_2024-11-14_12-13-58.html -│ ├── execution_report_2024-11-14_12-28-59.html -│ ├── execution_timeline_2024-11-14_12-07-43.html -│ ├── execution_timeline_2024-11-14_12-12-42.html -│ ├── execution_timeline_2024-11-14_12-13-58.html -│ ├── execution_timeline_2024-11-14_12-28-59.html -│ ├── execution_trace_2024-11-14_12-07-43.txt -│ ├── execution_trace_2024-11-14_12-12-42.txt -│ ├── execution_trace_2024-11-14_12-13-58.txt -│ ├── execution_trace_2024-11-14_12-28-59.txt -│ ├── params_2024-11-14_12-07-44.json -│ ├── params_2024-11-14_12-12-43.json -│ ├── params_2024-11-14_12-13-59.json -│ ├── params_2024-11-14_12-29-00.json -│ ├── pipeline_dag_2024-11-14_12-07-43.html -│ ├── pipeline_dag_2024-11-14_12-12-42.html -│ ├── pipeline_dag_2024-11-14_12-13-58.html -│ ├── pipeline_dag_2024-11-14_12-28-59.html -│ └── pipeline_software_mqc_versions.yml -└── seqtk - ├── SAMPLE1_PE_sample1_R1.fastq.gz - ├── SAMPLE1_PE_sample1_R2.fastq.gz - ├── SAMPLE2_PE_sample2_R1.fastq.gz - ├── SAMPLE2_PE_sample2_R2.fastq.gz - ├── SAMPLE3_SE_sample1_R1.fastq.gz - └── SAMPLE3_SE_sample2_R1.fastq.gz -``` - -The outputs from the `multiqc` and `seqtk` modules are published in their respective subdirectories. In addition, by default,`nf-core' pipelines generate a set of reports. These files are stored in the`pipeline_info` subdirectory and time-stamped so that runs don't overwrite each other. - -### Handle modules output - -As with the inputs, you can view the outputs for the module by opening the `/modules/nf-core/seqtk/trim/main.nf` file and viewing the module metadata. - -```groovy title="modules/nf-core/seqtk/trim/main.nf" linenums="13" -output: -tuple val(meta), path("*.fastq.gz"), emit: reads -path "versions.yml" , emit: versions -``` - -To help with organization and readability it is beneficial to create named output channels. - -For `SEQTK_TRIM`, the `reads` output could be put into a channel named `ch_trimmed`. - -```groovy title="workflows/myfirstpipeline.nf" linenums="35" -ch_trimmed = SEQTK_TRIM.out.reads -``` - -Similarly, it is beneficial to immediately mix the tool versions into the `ch_versions` channel so they can be used as input for the `MULTIQC` process and passed to the final report. - -```groovy title="workflows/myfirstpipeline.nf" linenums="35" -ch_versions = ch_versions.mix(SEQTK_TRIM.out.versions.first()) -``` - -!!! note - - The `first` operator is used to emit the first item from `SEQTK_TRIM.out.versions` to avoid duplication. - -### Add a parameter to the `seqtk/trim` tool - -nf-core modules should be flexible and usable across many different pipelines. Therefore, tool parameters are typically not set in an nf-core/module. Instead, additional configuration options on how to run the tool, like its parameters or filename, can be applied to a module using the `conf/modules.config` file on the pipeline level. Process selectors (e.g., `withName`) are used to apply configuration options to modules selectively. Process selectors must be used within the `process` scope. - -The parameters or arguments of a tool can be changed using the directive `args`. You can find many examples of how arguments are added to modules in nf-core pipelines, for example, the nf-core/demo [modules.config](https://github.com/nf-core/demo/blob/master/conf/modules.config) file. - -Add this snippet to your `conf/modules.config` file (using the `params` scope) to call the `seqtk/trim` tool with the argument `-b 5` to trim 5 bp from the left end of each read: - -```console title="conf/modules.config" linenums="21" -withName: 'SEQTK_TRIM' { - ext.args = "-b 5" -} -``` - -Run the pipeline again and check if the new parameter is applied: - -```bash -nextflow run . -profile docker,test --outdir results - -[6c/34e549] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:SEQTK_TRIM (SAMPLE1_PE) [100%] 3 of 3 ✔ -[27/397ccf] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ -``` - -Copy the hash you see in your console output (here `6c/34e549`; it is different for _each_ run). You can `ls` using tab-completion in your `work` directory to expand the complete hash. -In this folder you will find various log files. The `.command.sh` file contains the resolved command: - -```bash -less work/6c/34e549912696b6757f551603d135bb/.command.sh -``` - -We can see, that the parameter `-b 5`, that we set in the `modules.config` is applied to the task: - -```console title="Output" -#!/usr/bin/env bash - -set -e # Exit if a tool returns a non-zero status/exit code -set -u # Treat unset variables and parameters as an error -set -o pipefail # Returns the status of the last command to exit with a non-zero status or zero if all successfully execute -set -C # No clobber - prevent output redirection from overwriting files. - -printf "%s\n" sample1_R1.fastq.gz sample1_R2.fastq.gz | while read f; -do - seqtk \ - trimfq \ - -b 5 \ - $f \ - | gzip --no-name > SAMPLE1_PE_$(basename $f) -done - -cat <<-END_VERSIONS > versions.yml -"MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:SEQTK_TRIM": - seqtk: $(echo $(seqtk 2>&1) | sed 's/^.*Version: //; s/ .*$//') -END_VERSIONS -``` - -### Takeaway - -You have now added a nf-core/module to your pipeline, configured it with a particular parameter, and made the output available in the workflow. - -### What's next? - -In the next step we will add a pipeline parameter to allow users to skip the trimming step. - ---- - -## Adding parameters to your pipeline - -Anything that a pipeline user may want to configure regularly should be made into a parameter so it can easily be overridden. nf-core defines some standards for providing parameters. - -Here, as a simple example, you will add a new parameter to your pipeline that will skip the `SEQTK_TRIM` process. - -Parameters are accessible in the pipeline script. - -### Default values - -In the nf-core template the default values for parameters are set in the `nextflow.config` in the base repository. - -Any new parameters should be added to the `nextflow.config` with a default value within the `params` scope. - -Parameter names should be unique and easily identifiable. - -We can a new parameter `skip_trim` to your `nextflow.config` file and set it to `false`. - -```groovy title="nextflow.config" linenums="16" -// Trimming -skip_trim = false -``` - -### Adding parameters to your pipeline - -Here, an `if` statement that is depended on the `skip_trim` parameter can be used to control the execution of the `SEQTK_TRIM` process. An `!` can be used to imply the logical "not". - -Thus, if the `skip_trim` parameter is **not** `true`, the `SEQTK_TRIM` will be be executed. - -```groovy title="workflows/myfirstpipeline.nf" linenums="29" -// -// MODULE: Run SEQTK_TRIM -// -if (!params.skip_trim) { - SEQTK_TRIM ( - ch_samplesheet - ) - ch_trimmed = SEQTK_TRIM.out.reads - ch_versions = ch_versions.mix(SEQTK_TRIM.out.versions.first()) -} -``` - -Now your if statement has been added to your main workflow file and has a default setting in your `nextflow.config` file, you will be able to flexibly skip the new trimming step using the `skip_trim` parameter. - -We can now run the pipeline with the new `skip_trim` parameter to check it is working: - -```console -nextflow run . -profile test,docker --outdir results --skip_trim -``` - -You should see that the `SEQTK_TRIM` process has been skipped in your execution: - -```console title="Output" -!! Only displaying parameters that differ from the pipeline defaults !! ------------------------------------------------------- -WARN: The following invalid input values have been detected: - -* --skip_trim: true - - -executor > local (1) -[7b/8b60a0] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ --[myorg/myfirstpipeline] Pipeline completed successfully- -``` - -### Validate input parameters - -When we ran the pipeline, we saw a warning message: - -```console -WARN: The following invalid input values have been detected: - -* --skip_trim: true -``` - -Parameters are validated through the `nextflow_schema.json` file. This file is also used by the nf-core website (for example, in [nf-core/mag](https://nf-co.re/mag/3.2.1/parameters/)) to render the parameter documentation and print the pipeline help message (`nextflow run . --help`). If you have added parameters and they have not been documented in the `nextflow_schema.json` file, then the input validation does not recognize the parameter. - -The `nextflow_schema.json` file can get very big and very complicated very quickly. - -The `nf-core pipelines schema build` command is designed to support developers write, check, validate, and propose additions to your `nextflow_schema.json` file. - -```console -nf-core pipelines schema build -``` - -It will enable you to launch a web builder to edit this file in your web browser rather than trying to edit this file manually. - -```console -INFO [✓] Default parameters match schema validation -INFO [✓] Pipeline schema looks valid (found 20 params) -✨ Found 'params.skip_trim' in the pipeline config, but not in the schema. Add to pipeline schema? [y/n]: y -INFO Writing schema with 21 params: 'nextflow_schema.json' -🚀 Launch web builder for customization and editing? [y/n]: y -``` - -Using the web builder you can add add details about your new parameters. - -The parameters that you have added to your pipeline will be added to the bottom of the `nf-core pipelines schema build` file. Some information about these parameters will be automatically filled based on the default value from your `nextflow.config`. You will be able to categorize your new parameters into a group, add icons, and add descriptions for each. - -![Pipeline parameters](img/pipeline_schema.png) - -!!!note - - Ungrouped parameters in schema will cause a warning. - -Once you have made your edits you can click `Finished` and all changes will be automatically added to your `nextflow_schema.json` file. - -If you rerun the previous command, the warning should disappear: - -```console -nextflow run . -profile test,docker --outdir results --skip_trim - - -!! Only displaying parameters that differ from the pipeline defaults !! ------------------------------------------------------- -executor > local (1) -[6c/c78d0c] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ --[myorg/myfirstpipeline] Pipeline completed successfully- -``` - -### Takeaway - -You have added a new parameter to the pipeline, and learned how to use nf-core tools to describe it in the pipeline schema. - -### What's next? - -In the next step we will take a look at how we track metadata related to an input file. - ---- - -## Meta maps - -Datasets often contain additional information relevant to the analysis, such as a sample name, information about sequencing protocols, or other conditions needed in the pipeline to process certain samples together, determine their output name, or adjust parameters. - -By convention, nf-core tracks this information as `meta` maps. These are `key`-`value` pairs that are passed into modules together with the files. We already saw this briefly when inspecting the `input` for `seqtk`: - -```groovy title="modules/nf-core/seqtk/trim/main.nf" linenums="11" -input: -tuple val(meta), path(reads) -``` - -If we uncomment our earlier `view` statement: - -```groovy title="workflows/myfirstpipeline.nf" linenums="27" -ch_samplesheet.view() -``` - -and run the pipeline again, we can see the current content of the `meta` maps: - -```console title="meta map" -[[id:SAMPLE1_PE, single_end:false], ....] -``` - -You can add any field that you require to the `meta` map. By default, nf-core modules expect an `id` field. - -### Takeaway - -You know that a `meta` map is used to pass along additional information for a sample. - -### What's next? - -In the next step we will take a look how we can add a new key to the `meta` map using the samplesheet. - ---- - -## Simple Samplesheet adaptations - -nf-core pipelines typically use samplesheets as inputs to the pipelines. This allows us to: - -- validate each entry and print specific error messages. -- attach information to each input file. -- track which datasets are processed. - -Samplesheets are comma-separated text files with a header row specifying the column names, followed by one entry per row. For example, the samplesheet that we have been using during this teaching module looks like this: - -```csv title="samplesheet_test_illumina_amplicon.csv" -sample,fastq_1,fastq_2 -SAMPLE1_PE,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R2.fastq.gz -SAMPLE2_PE,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R2.fastq.gz -SAMPLE3_SE,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample1_R1.fastq.gz, -SAMPLE3_SE,https://raw.githubusercontent.com/nf-core/test-datasets/viralrecon/illumina/amplicon/sample2_R1.fastq.gz, -``` - -The structure of the samplesheet is specified in its own schema file in `assets/schema_input.json`. Each column has its own entry together with information about the column: - -```json title="schema_input.json" -"properties": { - "sample": { - "type": "string", - "pattern": "^\\S+$", - "errorMessage": "Sample name must be provided and cannot contain spaces", - "meta": ["id"] - }, - "fastq_1": { - "type": "string", - "format": "file-path", - "exists": true, - "pattern": "^\\S+\\.f(ast)?q\\.gz$", - "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" - }, - "fastq_2": { - "type": "string", - "format": "file-path", - "exists": true, - "pattern": "^\\S+\\.f(ast)?q\\.gz$", - "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" - } -}, -"required": ["sample", "fastq_1"] -``` - -This validates that the samplesheet has at least two columns: `sample` and `fastq1` (`"required": ["sample", "fastq_1"]`). It also checks that `fastq1` and `fastq2` are files, and that the file endings match a particular pattern. -Lastly, `sample` is information about the files that we want to attach and pass along the pipeline. nf-core uses `meta` maps for this: objects that have a key and a value. We can indicate this in the schema file directly by using the meta field: - -```json title="Sample column" - "sample": { - "type": "string", - "pattern": "^\\S+$", - "errorMessage": "Sample name must be provided and cannot contain spaces", - "meta": ["id"] - }, -``` - -This sets the key name as `id` and the value that is in the `sample` column, for example `SAMPLE1_PE`: - -```console title="meta" -[id: SAMPLE1_PE] -``` - -By adding a new entry into the JSON schema, we can attach additional meta information that we want to track. This will automatically validate it for us and add it to the meta map. - -Let's add some new meta information, like the `sequencer` as an optional column: - -```json title="assets/schema_input.json" -"properties": { - "sample": { - "type": "string", - "pattern": "^\\S+$", - "errorMessage": "Sample name must be provided and cannot contain spaces", - "meta": ["id"] - }, - "sequencer": { - "type": "string", - "pattern": "^\\S+$", - "meta": ["sequencer"] - }, - "fastq_1": { - "type": "string", - "format": "file-path", - "exists": true, - "pattern": "^\\S+\\.f(ast)?q\\.gz$", - "errorMessage": "FastQ file for reads 1 must be provided, cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" - }, - "fastq_2": { - "type": "string", - "format": "file-path", - "exists": true, - "pattern": "^\\S+\\.f(ast)?q\\.gz$", - "errorMessage": "FastQ file for reads 2 cannot contain spaces and must have extension '.fq.gz' or '.fastq.gz'" - } -}, -"required": ["sample", "fastq_1"] -``` - -We can now run our normal tests with the old samplesheet: - -```console -nextflow run . -profile docker,test --outdir results -``` - -The meta map now has a new key `sequencer`, that is empty because we did not specify a value yet: - -```console title="output" -[['id':'SAMPLE1_PE', 'sequencer':[], 'single_end':false], ... ] -[['id':'SAMPLE2_PE', 'sequencer':[], 'single_end':false], ... ] -[['id':'SAMPLE3_SE', 'sequencer':[], 'single_end':true], ... ] -``` - -We have also prepared a new samplesheet, that has the `sequencer` column. You can overwrite the existing input with this command: - -```console -nextflow run . -profile docker,test --outdir results --input ../data/sequencer_samplesheet.csv -``` - -This populates the `sequencer` and we can see it in the pipeline, when `view`ing the samplesheet channel: - -```console title="output" -[['id':'SAMPLE1_PE', 'sequencer':'sequencer1', 'single_end':false], ... ] -[['id':'SAMPLE2_PE', 'sequencer':'sequencer2', 'single_end':false], ... ] -[['id':'SAMPLE3_SE', 'sequencer':'sequencer3', 'single_end':true], ... ] -``` - -We can comment the `ch_samplesheet.view()` line or remove it. We are not going to use it anymore in this training section. - -### Use the new meta key in the pipeline - -We can access this new meta value in the pipeline and use it to, for example, only enable trimming for samples from a particular sequencer. The [branch operator](https://www.nextflow.io/docs/stable/reference/operator.html#branch) let's us split -an input channel into several new output channels based on a selection criteria: - -```groovy title="workflows/myfirstpipeline.nf" linenums="35" -ch_seqtk_in = ch_samplesheet.branch { meta, reads -> - to_trim: meta["sequencer"] == "sequencer2" - other: true -} - -SEQTK_TRIM ( - ch_seqtk_in.to_trim -) -``` - -If we now rerun our default test, no reads are being trimmed (even though we did not specify `--skip_trim`): - -```console -nextflow run . -profile docker,test --outdir results - -[- ] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:SEQTK_TRIM - -[5a/f580bc] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ -``` - -If we use the samplesheet with the `sequencer` set, only one sample will be trimmed: - -```console -nextflow run . -profile docker,test --outdir results --input ../data/sequencer_samplesheet.csv -resume - -[47/fdf9de] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:SEQTK_TRIM (SAMPLE2_PE) [100%] 1 of 1 ✔ -[2a/a742ae] process > MYORG_MYFIRSTPIPELINE:MYFIRSTPIPELINE:MULTIQC [100%] 1 of 1 ✔ -``` - -If you want to learn more about how to fine tune and develop the samplesheet schema further, visit [nf-schema](https://nextflow-io.github.io/nf-schema/2.2/nextflow_schema/sample_sheet_schema_specification/). - -### Takeaway - -You know how to adapt the samplesheet to add new meta information to your files. - -### What's next? - -In the next step we will add a module that is not yet in nf-core. - ---- - -## Create a custom module for your pipeline - -nf-core offers a comprehensive set of modules that have been created and curated by the community. However, as a developer, you may be interested in bespoke pieces of software that are not apart of the nf-core repository or customizing a module that already exists. - -In this instance, we will write a local module for the QC Tool [FastQE](https://fastqe.com/), which computes stats for FASTQ files and print those stats as emoji. - -This section should feel familiar to the `hello_modules` section. - -### Create the module - -!!! note "New module contributions are always welcome and encouraged!" - - If you have a module that you would like to contribute back to the community, reach out on the nf-core slack or open a pull request to the modules repository. - -Start by using the nf-core tooling to create a skeleton local module: - -```console -nf-core modules create -``` - -It will ask you to enter the tool name and some configurations for the module. We will use the defaults here: - - - Specify the tool name: `Name of tool/subtool: fastqe` - - Add the author name: `GitHub Username: (@):` - - Accept the defaults for the remaining prompts by typing `enter` - -This will create a new file in `modules/local/fastqe.nf` that already contains the container and conda definitions, the general structure of the process, and a number of TODO statements to guide you through the adaptation. - -!!! warning - - If the module already exists locally, the command will fail to prevent you from accidentally overwriting existing work: - - ```console - INFO Repository type: pipeline - INFO Press enter to use default values (shown in brackets) or type your own responses. ctrl+click underlined text to open links. - CRITICAL Module file exists already: 'modules/local/fastqe.nf'. Use '--force' to overwrite - ``` - -You will notice, that it still calls `samtools` and the input are `bam`. - -From our sample sheet, we know we have fastq files instead, so let's change the input definition accordingly: - -```groovy title="modules/local/fastqe.nf" linenums="38" -tuple val(meta), path(reads) -``` - -The output of this tool is a tsv file with the emoji annotation, let's adapt the output as well: - -```groovy title="modules/local/fastqe.nf" linenums="42" -tuple val(meta), path("*.tsv"), emit: tsv -``` - -The script section still calls `samtools`. Let's change this to the proper call of the tool: - -```groovy title="modules/local/fastqe.nf" linenums="62" - fastqe \\ - $args \\ - $reads \\ - --output ${prefix}.tsv -``` - -And at last, we need to adapt the version retrieval. This tool does not have a version command, so we will add the release number manually: - -```groovy title="modules/local/fastqe.nf" linenums="52" - def VERSION = '0.3.3' -``` - -and write it to a file in the script section: - -```groovy title="modules/local/fastqe.nf" linenums="70" - fastqe: $VERSION -``` - -We will not cover [`stubs`](https://www.nextflow.io/docs/latest/process.html#stub) in this training. They are not necessary to run a module, so let's remove them for now: - -```groovy title="modules/local/fastqe.nf" linenums="74" -stub: - def args = task.ext.args ?: '' - def prefix = task.ext.prefix ?: "${meta.id}" - // TODO nf-core: A stub section should mimic the execution of the original module as best as possible - // Have a look at the following examples: - // Simple example: https://github.com/nf-core/modules/blob/818474a292b4860ae8ff88e149fbcda68814114d/modules/nf-core/bcftools/annotate/main.nf#L47-L63 - // Complex example: https://github.com/nf-core/modules/blob/818474a292b4860ae8ff88e149fbcda68814114d/modules/nf-core/bedtools/split/main.nf#L38-L54 - """ - touch ${prefix}.bam - - cat <<-END_VERSIONS > versions.yml - "${task.process}": - fastqe: \$(samtools --version |& sed '1!d ; s/samtools //') - END_VERSIONS - """ -``` - -If you think this looks a bit messy and just want to add a complete final version, here's one we made earlier and we've removed all the commented out instructions: - -```groovy title="modules/local/fastqe.nf" linenums="1" -process FASTQE { - tag "$meta.id" - label 'process_single' - - conda "${moduleDir}/environment.yml" - container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? - 'https://depot.galaxyproject.org/singularity/fastqe:0.3.3--pyhdfd78af_0': - 'biocontainers/fastqe:0.3.3--pyhdfd78af_0' }" - - input: - tuple val(meta), path(reads) - - output: - tuple val(meta), path("*.tsv"), emit: tsv - path "versions.yml" , emit: versions - - when: - task.ext.when == null || task.ext.when - - script: - def args = task.ext.args ?: '' - def prefix = task.ext.prefix ?: "${meta.id}" - def VERSION = '0.3.3' - """ - fastqe \\ - $args \\ - $reads \\ - --output ${prefix}.tsv - - cat <<-END_VERSIONS > versions.yml - "${task.process}": - fastqe: $VERSION - END_VERSIONS - """ -} -``` - -### Include the module into the pipeline - -The module is now ready in your `modules/local` folder, but not yet included in your pipeline. Similar to `seqtk/trim` we need to add it to `workflows/myfirstpipeline.nf`: - -_Before:_ - -```groovy title="workflows/myfirstpipeline.nf" linenums="1" -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - IMPORT MODULES / SUBWORKFLOWS / FUNCTIONS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ -include { SEQTK_TRIM } from '../modules/nf-core/seqtk/trim/main' -include { MULTIQC } from '../modules/nf-core/multiqc/main' -``` - -_After:_ - -```groovy title="workflows/myfirstpipeline.nf" linenums="1" -/* -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - IMPORT MODULES / SUBWORKFLOWS / FUNCTIONS -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ -*/ -include { FASTQE } from '../modules/local/fastqe' -include { SEQTK_TRIM } from '../modules/nf-core/seqtk/trim/main' -include { MULTIQC } from '../modules/nf-core/multiqc/main' -``` - -and call it on our input data: - -```groovy title="workflows/myfirstpipeline.nf" linenums="47" - FASTQE(ch_samplesheet) - ch_versions = ch_versions.mix(FASTQE.out.versions.first()) -``` - -Let's run the pipeline again: - -```console -nextflow run . -profile docker,test --outdir results -``` - -In the results folder, you should now see a new subdirectory `fastqe/`, with the mean read qualities: - -```console title="SAMPLE1_PE.tsv" -Filename Statistic Qualities -sample1_R1.fastq.gz mean 😝 😝 😝 😝 😝 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😉 😜 😜 😜 😉 😉 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😁 😉 😛 😜 😉 😉 😉 😉 😜 😜 😉 😉 😉 😉 😉 😁 😁 😁 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😜 😉 😉 😉 😉 😉 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😛 😜 😜 😛 😛 😛 😚 -sample1_R2.fastq.gz mean 😌 😌 😌 😝 😝 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😜 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😉 😜 😉 😉 😜 😜 😉 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😜 😛 😜 😜 😜 😛 😜 😜 😜 😜 😛 😜 😛 😛 😛 😛 😛 😛 😛 😛 😛 😛 😛 😛 😝 😛 😝 😝 😝 😝 😝 😝 😝 😝 😝 😝 😝 😝 😝 😝 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😌 😋 😋 😋 😋 😋 😋 😋 😋 😀 -``` - -### Takeaway - -You know how to add a local module. - -And summarise your sequencing data as emojis. - ---- - -## Takeaway - -You know how to use the nf-core tooling to create a new pipeline, add modulea to it, apply tool and pipeline parameters, and adapt the samplesheet. - -## What's next? - -Celebrate and take another break! Next, we'll show you how to take advantage of Seqera Platform to launch and monitor your workflows more conveniently and efficiently on any compute infrastructure. diff --git a/docs/nextflow_run/04_nf-core.md b/docs/nextflow_run/04_nf-core.md new file mode 100644 index 000000000..6ae09f4a4 --- /dev/null +++ b/docs/nextflow_run/04_nf-core.md @@ -0,0 +1,7 @@ +# Part 3: Run nf-core + +1. Meet the nf-core style Hello World +2. Run it (with test profile) and interpret the console output +3. Locate the outputs (results) +4. Find a 'real' nf-core pipeline +5. Run nf-core/demo diff --git a/docs/nextflow_run/04_run_seqera.md b/docs/nextflow_run/05_seqera.md similarity index 87% rename from docs/nextflow_run/04_run_seqera.md rename to docs/nextflow_run/05_seqera.md index 9da4c1d2b..0411ad37e 100644 --- a/docs/nextflow_run/04_run_seqera.md +++ b/docs/nextflow_run/05_seqera.md @@ -1,9 +1,4 @@ ---- -title: "Part 9: Hello Seqera" -description: Get started with Seqera Platform ---- - -# Part 9: Hello Seqera +# Part 4: Run on Seqera So far we've been running Nextflow workflows on our local machine using the command line interface. In this section, we'll introduce you to Seqera Platform, a powerful cloud-based platform for running, monitoring, and sharing Nextflow workflows. diff --git a/docs/nextflow_run/index.md b/docs/nextflow_run/index.md index 9b4c3837b..ea21cb1ae 100644 --- a/docs/nextflow_run/index.md +++ b/docs/nextflow_run/index.md @@ -1,14 +1,14 @@ --- -title: Run Nextflow +title: Nextflow Run hide: - toc --- -# Run Nextflow +# Nextflow Run Hello! You are now on the path to running reproducible and scalable scientific workflows using Nextflow. -[TODO] NEED TO DISTINGUISH CLEARLY FROM HELLO NEXTFLOW +TODO: Improve overview to differentiate from Hello Nextflow The rise of big data has made it increasingly necessary to be able to analyze and perform experiments on large datasets in a portable and reproducible manner. Parallelization and distributed computing are the best ways to tackle this challenge, but the tools commonly available to computational scientists often lack good support for these techniques, or they provide a model that fits poorly with the needs of computational scientists. Nextflow was particularly created to address these challenges. @@ -27,7 +27,7 @@ By the end of this workshop you will be able to: - Launch a Nextflow workflow locally - Find and interpret outputs (results) and log files generated by Nextflow - Troubleshoot basic issues -- [TODO] +- TODO: update summary of learnings once content is final ## Audience & prerequisites diff --git a/mkdocs.yml b/mkdocs.yml index acf974a30..7d59f8fd1 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -10,6 +10,12 @@ nav: - envsetup/01_setup.md - envsetup/02_local.md - envsetup/03_devcontainer.md + - Nextflow Run: + - nextflow_run/index.md + - nextflow_run/00_orientation.md + - nextflow_run/01_basics.md + - nextflow_run/02_pipeline.md + - nextflow_run/03_config.md - Hello Nextflow: - hello_nextflow/index.md - hello_nextflow/00_orientation.md @@ -183,6 +189,7 @@ plugins: - enumerate-headings: restart_increment_after: - envsetup/01_setup.md + - nextflow_run/00_orientation.md - hello_nextflow/00_orientation.md - hello_nf-core/00_orientation.md - nf4_science/genomics/00_orientation.md @@ -195,6 +202,7 @@ plugins: - index*.md - help*.md - envsetup/*.md + - nextflow_run/*.md - hello_nextflow/*.md - hello_nf-core/*.md - nf4_science/genomics/*.md diff --git a/nextflow-run/channels.nf b/nextflow-run/channels.nf new file mode 100644 index 000000000..0f54be80c --- /dev/null +++ b/nextflow-run/channels.nf @@ -0,0 +1,36 @@ +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'greetings.csv' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) +} diff --git a/nextflow-run/flow.nf b/nextflow-run/flow.nf new file mode 100644 index 000000000..9f2ab6dc3 --- /dev/null +++ b/nextflow-run/flow.nf @@ -0,0 +1,87 @@ +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +} + +/* + * Use a text replacement tool to convert the greeting to uppercase + */ +process convertToUpper { + + publishDir 'results', mode: 'copy' + + input: + path input_file + + output: + path "UPPER-${input_file}" + + script: + """ + cat '$input_file' | tr '[a-z]' '[A-Z]' > 'UPPER-${input_file}' + """ +} + +/* + * Collect uppercase greetings into a single output file + */ +process collectGreetings { + + publishDir 'results', mode: 'copy' + + input: + path input_files + val batch_name + + output: + path "COLLECTED-${batch_name}-output.txt" , emit: outfile + val count_greetings , emit: count + + script: + count_greetings = input_files.size() + """ + cat ${input_files} > 'COLLECTED-${batch_name}-output.txt' + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'greetings.csv' +params.batch = 'test-batch' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) + + // convert the greeting to uppercase + convertToUpper(sayHello.out) + + // collect all the greetings into one file + collectGreetings(convertToUpper.out.collect(), params.batch) + + // emit a message about the size of the batch + collectGreetings.out.count.view { num -> "There were $num greetings in this batch" } +} diff --git a/nextflow-run/hello-world-plus.nf b/nextflow-run/hello-world-plus.nf new file mode 100644 index 000000000..6236eea21 --- /dev/null +++ b/nextflow-run/hello-world-plus.nf @@ -0,0 +1,31 @@ +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path 'output.txt' + + script: + """ + echo '$greeting' > output.txt + """ +} + +/* + * Pipeline parameters + */ +params.greeting = 'Holà mundo!' + +workflow { + + // emit a greeting + sayHello(params.greeting) +} diff --git a/nextflow-run/hello-world.nf b/nextflow-run/hello-world.nf new file mode 100644 index 000000000..4f672da2b --- /dev/null +++ b/nextflow-run/hello-world.nf @@ -0,0 +1,21 @@ +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + output: + path 'output.txt' + + script: + """ + echo 'Hello World!' > output.txt + """ +} + +workflow { + + // emit a greeting + sayHello() +} diff --git a/nextflow-run/modular.nf b/nextflow-run/modular.nf new file mode 100644 index 000000000..8764606bf --- /dev/null +++ b/nextflow-run/modular.nf @@ -0,0 +1,32 @@ +#!/usr/bin/env nextflow + +/* + * Pipeline parameters + */ +params.greeting = 'greetings.csv' +params.batch = 'test-batch' + +// Include modules +include { sayHello } from './modules/sayHello.nf' +include { convertToUpper } from './modules/convertToUpper.nf' +include { collectGreetings } from './modules/collectGreetings.nf' + +workflow { + + // create a channel for inputs from a CSV file + greeting_ch = Channel.fromPath(params.greeting) + .splitCsv() + .map { line -> line[0] } + + // emit a greeting + sayHello(greeting_ch) + + // convert the greeting to uppercase + convertToUpper(sayHello.out) + + // collect all the greetings into one file + collectGreetings(convertToUpper.out.collect(), params.batch) + + // emit a message about the size of the batch + collectGreetings.out.count.view { "There were $it greetings in this batch" } +} diff --git a/nextflow-run/modules/collectGreetings.nf b/nextflow-run/modules/collectGreetings.nf new file mode 100644 index 000000000..849bba4b6 --- /dev/null +++ b/nextflow-run/modules/collectGreetings.nf @@ -0,0 +1,21 @@ +/* + * Collect uppercase greetings into a single output file + */ +process collectGreetings { + + publishDir 'results', mode: 'copy' + + input: + path input_files + val batch_name + + output: + path "COLLECTED-${batch_name}-output.txt" , emit: outfile + val count_greetings , emit: count + + script: + count_greetings = input_files.size() + """ + cat ${input_files} > 'COLLECTED-${batch_name}-output.txt' + """ +} diff --git a/nextflow-run/modules/convertToUpper.nf b/nextflow-run/modules/convertToUpper.nf new file mode 100644 index 000000000..b2689e8e9 --- /dev/null +++ b/nextflow-run/modules/convertToUpper.nf @@ -0,0 +1,20 @@ +#!/usr/bin/env nextflow + +/* + * Use a text replacement tool to convert the greeting to uppercase + */ +process convertToUpper { + + publishDir 'results', mode: 'copy' + + input: + path input_file + + output: + path "UPPER-${input_file}" + + script: + """ + cat '$input_file' | tr '[a-z]' '[A-Z]' > 'UPPER-${input_file}' + """ +} diff --git a/nextflow-run/modules/sayHello.nf b/nextflow-run/modules/sayHello.nf new file mode 100644 index 000000000..6005ad54c --- /dev/null +++ b/nextflow-run/modules/sayHello.nf @@ -0,0 +1,20 @@ +#!/usr/bin/env nextflow + +/* + * Use echo to print 'Hello World!' to a file + */ +process sayHello { + + publishDir 'results', mode: 'copy' + + input: + val greeting + + output: + path "${greeting}-output.txt" + + script: + """ + echo '$greeting' > '$greeting-output.txt' + """ +}