The Sci-ModoM integrator & amalgamator pipeline. A workflow to integrate dataset-level information, to combine and unify evidence to support building transcript models with RNA modification.
python3 -m venv scintillator
source scintillator/bin/activate
pip install -r requirements.txt
To configure this workflow, modify the following files:
-
config/config.yaml: workflow-specific parameters
-
pep/config.yaml: data-related configuration
-
pep/dataset.tsv: input data
To run this workflow on a SLURM cluster, modify the default profile under workflow/profiles/default/config.yaml.
The option --use-envmodules
does not work with the SLURM executor, cf. this issue.
Copy the draft implementation and patch your installation patch -p1 <snakemake.patch
under the environment directory. Add this to your profile
envmodules-precommand: ". /etc/profile.d/modules.sh && . /biosw/__modules__/modules.rc"
use-envmodules: true
Because of this other issue, configuration files must be passed as arguments. Go to workflow and run
snakemake --executor slurm --config pepfile=/full/path/to/pep/config.yaml --configfile config/config.yaml [snakemake options] 2> log.txt &
Scintillator extract evidence from Sci-ModoM for parameters specified in pep/config.yaml. For a given species, modification sites are downloaded from the online database for every chromosome. The list of chromosomes, annotations, assembly information, and the genome sequence are downloaded from Ensembl. To restrict the list of chromosomes, use
chroms:
include:
- "1"
- "2"
exclude:
- "X"
- "Y"
- "MT"
The include
key overrides the exclude
key, e.g. only chromosomes "1" and "2" would be included in the output. The chroms
key is optional; if not present, all chromosomes are used (only chromosomes).
Next, Scintillator uses the primary data given in pep/dataset.tsv to (i) restrict the evidence to the available datasets, and (ii) estimate abundance for every matching transcript, where at least one modification has been reported, for every available dataset. If Run
is given, the corresponding archive is downloaded from the Sequence Read Archive (FileType
must be FASTQ) and dumped to one or two FASTQ files, depending on LibraryLayout
; otherwise FilePath
must be a path pointing to a BAM file (FileType
must be BAM) on the file system.
Note: If Platform
is LONG, Salmon is used in alignment-mode. BAM files are first converted to FASTQ (we assume that they are mapped to the genome). FASTQ files are mapped with minimap2. For SHORT, only FASTQ is allowed, and Salmon is used in mapping-mode. In mapping-mode, FASTQ files are passed as is, i.e. without adapter removal and/or (quality) trimming.
Note: To provide replicates, add different Run
identifiers for the same EUFID
, but FileType
, LibraryLayout
, and Platform
must all be identical. If FileType
is BAM, no replicates are allowed.
Output files are written to output_directory
given in the config/config.yaml. The final output is written for each chromosome to a sub-directory named scintillator, e.g. under chr1 there will be
- chr1.fa.gz: Multi-FASTA with original transcript sequences
- chr1.modified.fa.gz: Multi-FASTA with modified transcript sequences
- chr1.json: Metadata including transcript attributes, modifications and associated evidence (score, frequency, coverage), and abundance (TPM and reads)
The modified transcript sequences use the UNICODE character from MODOMICS (RNAMods code 2023). The transcript as well as the genomic positions in the metadata file are 1-based indexed.
Note: Scintillator does not currently check that the assembly version used by the pipeline and downloaded from Ensembl matches that used in Sci-ModoM.
Check the logs, in particular the annotate logs, to make sure any unused evidence is mostly from Intergenic regions. Intronic regions are not reported; they are just ignored.
We currently only handle single-resolution modifications, i.e. given by chromEnd = chromStart + 1
. Any evidence with a context site is discarded.
# incl. pytest and pytest-datafiles
pip install -r requirements-dev.txt
To run the tests, comment out snakemake-related content from workflow/scripts/annotate.py, adjust imports, and run
python -m pytest tests/
- The CSV output from Sci-ModoM is in fact BED-style (0-base indexed, half-open). These files are under evidence/chroms. This is converted to 1-base indexing by
sanitize
, and used byannotate
. These files are under evidence/chr
- The pipeline will fail if e.g. a dataset has no reads mapping to a certain chromosome. Handling this exception in snakemake
doesn't seem trivial. In general, this is not much of an issue if we only work with autosomal chromosomes (use
chroms/exclude
).