Skip to content

Commit e103c1c

Browse files
authored
feat!: generalize away condition (#66)
* consistent ordering is ensured right after data import here, so this comment is obsolete * apply R lints * refactor variable names * apply more lints * overhaul config setup and make tests more complex * make deseq-init.R configurable via new config setup * apply lints to deseq2.R * actually use config.diffexp.model formula if specified * extend doc comments in config files * make deseq2.R contrast definitions use config.yaml specifications * fix rule align (star) parameters and make gtf an actual input * make samples.tsv and units.tsv complex enough for test case * add trimming explanation to config.yaml * overhaul config/README.md for snakemake workflow catalog explanations * apply lints to plot-pca.R * overhaul PCA plots (plots for all relevant variables, further one requestable, one separate plot per variable) * fix typos in .test/config_complex/config.yaml * fix syntax for complex contrasts * provide more useful error message for typos * add missing comma * add example adapters and strandedness entries to .test/config_basic/units.tsv to actually test the respective code * final check of config/README.md * snakefmt * fix schema for pca: labels: * use config_complex workflow for linting * fix formatting that is only caught in CI * fix cutadapt wrapper params from others: to extra: (originally suggest by @kilpert: f816550) * update actions, hopefully getting newest snakefmt to avoid spurious errors * add basic wildcard_constraints for safer sample and unit names (originally suggested by @kilpert: 5431e08) * snakefmt * fix copy-pasta oversight * fix config/README.md and units.tsv strandedness to read `none` * catch case of no batch_effects ("") in config
1 parent 896007e commit e103c1c

File tree

23 files changed

+453
-123
lines changed

23 files changed

+453
-123
lines changed

.github/workflows/main.yml

Lines changed: 21 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,12 @@ jobs:
1212
runs-on: ubuntu-latest
1313
steps:
1414
- name: Checkout with submodules
15-
uses: actions/checkout@v2
15+
uses: actions/checkout@v3
1616
with:
1717
submodules: recursive
1818
fetch-depth: 0
1919
- name: Formatting
20-
uses: github/super-linter@v4
20+
uses: github/super-linter@v5
2121
env:
2222
VALIDATE_ALL_CODEBASE: false
2323
DEFAULT_BRANCH: master
@@ -26,13 +26,13 @@ jobs:
2626
linting:
2727
runs-on: ubuntu-latest
2828
steps:
29-
- uses: actions/checkout@v2
29+
- uses: actions/checkout@v3
3030
- name: Linting
3131
uses: snakemake/snakemake-github-action@v1.22.0
3232
with:
3333
directory: .test
3434
snakefile: workflow/Snakefile
35-
args: "--lint"
35+
args: "--configfile .test/config_complex/config.yaml --lint"
3636

3737
run-workflow:
3838
runs-on: ubuntu-latest
@@ -41,18 +41,30 @@ jobs:
4141
- formatting
4242
steps:
4343
- name: Checkout repository with submodules
44-
uses: actions/checkout@v2
44+
uses: actions/checkout@v3
4545
with:
4646
submodules: recursive
47-
- name: Test workflow
47+
- name: Test workflow (basic model, no batch_effects)
4848
uses: snakemake/snakemake-github-action@v1.22.0
4949
with:
5050
directory: .test
5151
snakefile: workflow/Snakefile
52-
args: "--use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
53-
- name: Test report
52+
args: "--configfile .test/config_basic/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
53+
- name: Test report (basic model, no batch_effects)
5454
uses: snakemake/snakemake-github-action@v1.22.0
5555
with:
5656
directory: .test
5757
snakefile: workflow/Snakefile
58-
args: "--report report.zip"
58+
args: "--configfile .test/config_basic/config.yaml --report report.zip"
59+
- name: Test workflow (multiple variables_of_interest, include batch_effects)
60+
uses: snakemake/snakemake-github-action@v1.22.0
61+
with:
62+
directory: .test
63+
snakefile: workflow/Snakefile
64+
args: "--configfile .test/config_complex/config.yaml --use-conda --show-failed-logs --cores 2 --conda-cleanup-pkgs cache"
65+
- name: Test report (multiple variables_of_interest, include batch_effects)
66+
uses: snakemake/snakemake-github-action@v1.22.0
67+
with:
68+
directory: .test
69+
snakefile: workflow/Snakefile
70+
args: "--configfile .test/config_complex/config.yaml --report report.zip"

.test/config/config.yaml

Lines changed: 0 additions & 41 deletions
This file was deleted.

.test/config_basic/config.yaml

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,64 @@
1+
# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
2+
samples: config_basic/samples.tsv
3+
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
4+
# Units are technical replicates (e.g. lanes, or resequencing of the same biological
5+
# sample).
6+
units: config_basic/units.tsv
7+
8+
9+
ref:
10+
# Ensembl species name
11+
species: saccharomyces_cerevisiae
12+
# Ensembl release
13+
release: 100
14+
# Genome build
15+
build: R64-1-1
16+
17+
18+
trimming:
19+
# If you activate trimming by setting this to `True`, you will have to
20+
# specify the respective cutadapt adapter trimming flag for each unit
21+
# in the `units.tsv` file's `adapters` column
22+
activate: True
23+
24+
mergeReads:
25+
activate: False
26+
27+
pca:
28+
activate: True
29+
# Per default, a separate PCA plot is generated for each of the
30+
# `variables_of_interest` and the `batch_effects`, coloring according to
31+
# that variables groups.
32+
# If you want PCA plots for further columns in the samples.tsv sheet, you
33+
# can request them under labels as a list, for example:
34+
# - relatively_uninteresting_variable_X
35+
# - possible_batch_effect_Y
36+
labels:
37+
- condition
38+
39+
diffexp:
40+
# variables where you are interested in whether they have
41+
# an effect on expression levels
42+
variables_of_interest:
43+
condition:
44+
# any fold change will be relative to this factor level
45+
base_level: untreated
46+
batch_effects: ""
47+
# contrasts for the deseq2 results method to determine fold changes
48+
contrasts:
49+
treated-vs-untreated:
50+
# must be one of the variables_of_interest
51+
variable_of_interest: condition
52+
level_of_interest: treated
53+
# The default model includes all interactions among variables_of_interest
54+
# and batch_effects added on. For the example above this implicitly is:
55+
# model: ~condition
56+
# For the default model to be used, simply specify an empty `model: ""`
57+
# With more variables_of_interest or batch_effects, you could introduce different
58+
# assumptions into your model, by specicifying a different model here.
59+
model: ~condition
60+
61+
params:
62+
cutadapt-pe: ""
63+
cutadapt-se: ""
64+
star: ""
File renamed without changes.
Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
sample_name unit_name fq1 fq2 sra adapters strandedness
2-
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
3-
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
4-
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
5-
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
2+
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA yes
3+
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none
4+
A2 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA none
5+
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA reverse

.test/config_complex/config.yaml

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# path or URL to sample sheet (TSV format, columns: sample, condition, ...)
2+
samples: config_complex/samples.tsv
3+
# path or URL to sequencing unit sheet (TSV format, columns: sample, unit, fq1, fq2)
4+
# Units are technical replicates (e.g. lanes, or resequencing of the same biological
5+
# sample).
6+
units: config_complex/units.tsv
7+
8+
9+
ref:
10+
# Ensembl species name
11+
species: saccharomyces_cerevisiae
12+
# Ensembl release
13+
release: 100
14+
# Genome build
15+
build: R64-1-1
16+
17+
18+
trimming:
19+
# If you activate trimming by setting this to `True`, you will have to
20+
# specify the respective cutadapt adapter trimming flag for each unit
21+
# in the `units.tsv` file's `adapters` column
22+
activate: False
23+
24+
mergeReads:
25+
activate: False
26+
27+
pca:
28+
activate: True
29+
# Per default, a separate PCA plot is generated for each of the
30+
# `variables_of_interest` and the `batch_effects`, coloring according to
31+
# that variables groups.
32+
# If you want PCA plots for further columns in the samples.tsv sheet, you
33+
# can request them under labels as a list, for example:
34+
# - relatively_uninteresting_variable_X
35+
# - possible_batch_effect_Y
36+
labels:
37+
# columns of sample sheet to use for PCA
38+
- jointly_handled
39+
40+
diffexp:
41+
# variables where you are interested in whether they have
42+
# an effect on expression levels
43+
variables_of_interest:
44+
treatment_1:
45+
# any fold change will be relative to this factor level
46+
base_level: untreated
47+
treatment_2:
48+
# any fold change will be relative to this factor level
49+
base_level: untreated
50+
batch_effects:
51+
- jointly_handled
52+
# contrasts for the deseq2 results method to determine fold changes
53+
contrasts:
54+
treatment_1_alone:
55+
# must be one of the variables_of_interest
56+
variable_of_interest: treatment_1
57+
# the variable's level to test against the base_level
58+
level_of_interest: treated
59+
treatment_2_alone:
60+
# must be one of the variables_of_interest
61+
variable_of_interest: treatment_2
62+
# the variable's level to test against the base_level
63+
level_of_interest: treated
64+
# Must be a valid expression for option two in the contrasts description
65+
# of ?results in the DESeq2 package. For a more detailed intro, also see:
66+
# https://github.com/tavareshugo/tutorial_DESeq2_contrasts/blob/main/DESeq2_contrasts.md
67+
both_treatments: 'list(c("treatment_1_treated_vs_untreated", "treatment_2_treated_vs_untreated", "treatment_1treated.treatment_2treated"))'
68+
# The default model includes all interactions among variables_of_interest,
69+
# and batch_effects added on. For the example above this implicitly is:
70+
# model: ~jointly_handled + treatment_1 * treatment_2
71+
# For the default model to be used, simply specify an empty `model: ""` below.
72+
# If you want to introduce different assumptions into your model, you can
73+
# specify a different model to use, for example skipping the interaction:
74+
# model: ~jointly_handled + treatment_1 + treatment_2
75+
model: ""
76+
77+
params:
78+
cutadapt-pe: ""
79+
cutadapt-se: ""
80+
star: ""

.test/config_complex/samples.tsv

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
sample_name treatment_1 treatment_2 jointly_handled
2+
A1 treated treated 1
3+
A2 treated treated 2
4+
A3 treated untreated 1
5+
A4 treated untreated 2
6+
B1 untreated treated 1
7+
B2 untreated treated 2
8+
B3 untreated untreated 1
9+
B4 untreated untreated 2

.test/config_complex/units.tsv

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
sample_name unit_name fq1 fq2 sra adapters strandedness
2+
A1 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
3+
A2 1 ngs-test-data/reads/a.scerevisiae.1.fq ngs-test-data/reads/a.scerevisiae.2.fq
4+
A3 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
5+
A4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
6+
B1 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq
7+
B2 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
8+
B3 1 ngs-test-data/reads/b.scerevisiae.1.fq ngs-test-data/reads/b.scerevisiae.2.fq
9+
B4 1 ngs-test-data/reads/c.scerevisiae.1.fq ngs-test-data/reads/c.scerevisiae.2.fq

config/README.md

Lines changed: 47 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,54 @@
1-
# General settings
2-
To configure this workflow, modify ``config/config.yaml`` according to your needs, following the explanations provided in the file.
1+
# General configuration
32

4-
# Sample and unit sheet
3+
To configure this workflow, modify `config/config.yaml` according to your needs, following the explanations provided in the file.
54

6-
* Add samples to `config/samples.tsv`. For each sample, the columns `sample_name`, and `condition` have to be defined. The `condition` (healthy/tumor, before Treatment / after Treatment) will be used as contrast for the DEG analysis in DESeq2. To include other relevant variables such as batches, add a new column to the sheet.
7-
* For each sample, add one or more sequencing units (runs, lanes or replicates) to the unit sheet `config/units.tsv`. By activating or deactivating `mergeReads` in the `config/config.yaml`, you can decide wether to merge replicates or run them individually. For each unit, define adapters, and either one (column `fq1`) or two (columns `fq1`, `fq2`) FASTQ files (these can point to anywhere in your system). Alternatively, you can define an SRA (sequence read archive) accession (starting with e.g. ERR or SRR) by using a column `sra`. In the latter case, the pipeline will automatically download the corresponding paired end reads from SRA. If both local files and SRA accession are available, the local files will be preferred.
8-
To choose the correct geneCounts produced by STAR, you can define the strandedness of a unit. STAR produces counts for unstranded ('None' - default), forward oriented ('yes') and reverse oriented ('reverse') protocols.
5+
## `DESeq2` differential expression analysis configuration
96

7+
To successfully run the differential expression analysis, you will need to tell DESeq2 which sample annotations to use (annotations are columns in the `samples.tsv` file described below).
8+
This is done in the `config.yaml` file with the entries under `diffexp:`.
9+
The comments for the entries should give all the necessary infos and linkouts.
10+
But if in doubt, please also consult the [`DESeq2` manual](https://www.bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html).
11+
12+
# Sample and unit setup
13+
14+
The sample and unit setup is specified via tab-separated tabular files (`.tsv`).
1015
Missing values can be specified by empty columns or by writing `NA`.
1116

12-
# DESeq scenario
17+
## sample sheet
18+
19+
The default sample sheet is `config/samples.tsv` (as configured in `config/config.yaml`).
20+
Each sample refers to an actual physical sample, and replicates (both biological and technical) may be specified as separate samples.
21+
For each sample, you will always have to specify a `sample_name`.
22+
In addition, all `variables_of_interest` and `batch_effects` specified in the `config/config.yaml` under the `diffexp:` entry, will have to have corresponding columns in the `config/samples.tsv`.
23+
Finally, the sample sheet can contain any number of additional columns.
24+
So if in doubt about whether you might at some point need some metadata you already have at hand, just put it into the sample sheet already---your future self will thank you.
25+
26+
## unit sheet
27+
28+
The default unit sheet is `config/units.tsv` (as configured in `config/config.yaml`).
29+
For each sample, add one or more sequencing units (for example if you have several runs or lanes per sample).
30+
31+
### `.fastq` file source
32+
33+
For each unit, you will have to define a source for your `.fastq` files.
34+
This can be done via the columns `fq1`, `fq2` and `sra`, with either of:
35+
1. A single `.fastq` file for single-end reads (`fq1` column only; `fq2` and `sra` columns present, but empty).
36+
The entry can be any path on your system, but we suggest something like a `raw/` data directory within your analysis directory.
37+
2. Two `.fastq` files for paired-end reads (columns `fq1` and `fq2`; column `sra` present, but empty).
38+
As for the `fq1` column, the `fq2` column can also point to anywhere on your system.
39+
3. A sequence read archive (SRA) accession number (`sra` column only; `fq1` and `fq2` columns present, but empty).
40+
The workflow will automatically download the corresponding `.fastq` data (currently assumed to be paired-end).
41+
The accession numbers usually start with SRR or ERR and you can find accession numbers for studies of interest with the [SRA Run Selector](https://trace.ncbi.nlm.nih.gov/Traces/study/).
42+
If both local files and an SRA accession are specified for the same unit, the local files will be used.
43+
44+
### adapter trimming
45+
46+
If you set `trimming: activate:` in the `config/config.yaml` to `True`, you will have to provide at least one `cutadapt` adapter argument for each unit in the `adapters` column of the `units.tsv` file.
47+
You will need to find out the adapters used in the sequencing protocol that generated a unit: from your sequencing provider, or for published data from the study's metadata (or its authors).
48+
Then, enter the adapter sequences into the `adapters` column of that unit, preceded by the [correct `cutadapt` adapter argument](https://cutadapt.readthedocs.io/en/stable/guide.html#adapter-types).
1349

14-
To initialize the DEG analysis, you need to define a model in the `config/config.yaml`. The model can include all variables introduced as columns in `config/samples.tsv`.
15-
* The standard model is `~condition` - to include a batch variable, write `~batch + condition`.
50+
### strandedness of library preparation protocol
1651

52+
To get the correct `geneCounts` from `STAR` output, you can provide information on the strandedness of the library preparation protocol used for a unit.
53+
`STAR` can produce counts for unstranded (`none` - this is the default), forward oriented (`yes`) and reverse oriented (`reverse`) protocols.
54+
Enter the respective value into a `strandedness` column in the `units.tsv` file.

0 commit comments

Comments
 (0)