Scintillator

The Sci-ModoM integrator & amalgamator pipeline. A workflow to integrate dataset-level information, to combine and unify evidence to support building transcript models with RNA modification.

Installation

python3 -m venv scintillator
source scintillator/bin/activate
pip install -r requirements.txt

Dependencies

Usage

Configuration

To configure this workflow, modify the following files:

config/config.yaml: workflow-specific parameters
pep/config.yaml: data-related configuration
pep/dataset.tsv: input data

Profile

To run this workflow on a SLURM cluster, modify the default profile under workflow/profiles/default/config.yaml.

How to run the workflow

The option --use-envmodules does not work with the SLURM executor, cf. this issue. Copy the draft implementation and patch your installation patch -p1 <snakemake.patch under the environment directory. Add this to your profile

envmodules-precommand: ". /etc/profile.d/modules.sh && . /biosw/__modules__/modules.rc"
use-envmodules: true

Because of this other issue, configuration files must be passed as arguments. Go to workflow and run

snakemake --executor slurm --config pepfile=/full/path/to/pep/config.yaml --configfile config/config.yaml [snakemake options] 2> log.txt &

What to expect

Scintillator extract evidence from Sci-ModoM for parameters specified in pep/config.yaml. For a given species, modification sites are downloaded from the online database for every chromosome. The list of chromosomes, annotations, assembly information, and the genome sequence are downloaded from Ensembl. To restrict the list of chromosomes, use

chroms:
    include:
        - "1"
        - "2"
    exclude:
        - "X"
        - "Y"
        - "MT"

The include key overrides the exclude key, e.g. only chromosomes "1" and "2" would be included in the output. The chroms key is optional; if not present, all chromosomes are used (only chromosomes).

Next, Scintillator uses the primary data given in pep/dataset.tsv to (i) restrict the evidence to the available datasets, and (ii) estimate abundance for every matching transcript, where at least one modification has been reported, for every available dataset. If Run is given, the corresponding archive is downloaded from the Sequence Read Archive (FileType must be FASTQ) and dumped to one or two FASTQ files, depending on LibraryLayout; otherwise FilePath must be a path pointing to a BAM file (FileType must be BAM) on the file system.

Note: If Platform is LONG, Salmon is used in alignment-mode. BAM files are first converted to FASTQ (we assume that they are mapped to the genome). FASTQ files are mapped with minimap2. For SHORT, only FASTQ is allowed, and Salmon is used in mapping-mode. In mapping-mode, FASTQ files are passed as is, i.e. without adapter removal and/or (quality) trimming.

Note: To provide replicates, add different Run identifiers for the same EUFID, but FileType, LibraryLayout, and Platform must all be identical. If FileType is BAM, no replicates are allowed.

Output files are written to output_directory given in the config/config.yaml. The final output is written for each chromosome to a sub-directory named scintillator, e.g. under chr1 there will be

chr1.fa.gz: Multi-FASTA with original transcript sequences
chr1.modified.fa.gz: Multi-FASTA with modified transcript sequences
chr1.json: Metadata including transcript attributes, modifications and associated evidence (score, frequency, coverage), and abundance (TPM and reads)

The modified transcript sequences use the UNICODE character from MODOMICS (RNAMods code 2023). The transcript as well as the genomic positions in the metadata file are 1-based indexed.

Note: Scintillator does not currently check that the assembly version used by the pipeline and downloaded from Ensembl matches that used in Sci-ModoM.

Output

Check the logs, in particular the annotate logs, to make sure any unused evidence is mostly from Intergenic regions. Intronic regions are not reported; they are just ignored.

We currently only handle single-resolution modifications, i.e. given by chromEnd = chromStart + 1. Any evidence with a context site is discarded.

Development notes

Testing

# incl. pytest and pytest-datafiles
pip install -r requirements-dev.txt

To run the tests, comment out snakemake-related content from workflow/scripts/annotate.py, adjust imports, and run

python -m pytest tests/

Notes

The CSV output from Sci-ModoM is in fact BED-style (0-base indexed, half-open). These files are under evidence/chroms. This is converted to 1-base indexing by sanitize, and used by annotate. These files are under evidence/chr

Issues

The pipeline will fail if e.g. a dataset has no reads mapping to a certain chromosome. Handling this exception in snakemake doesn't seem trivial. In general, this is not much of an issue if we only work with autosomal chromosomes (use chroms/exclude).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Scintillator

Installation

Dependencies

Usage

Configuration

Profile

How to run the workflow

What to expect

Output

Development notes

Testing

Notes

Issues

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
config		config
pep		pep
tests		tests
workflow		workflow
.gitignore		.gitignore
README.md		README.md
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

dieterich-lab/Scintillator

Folders and files

Latest commit

History

Repository files navigation

Scintillator

Installation

Dependencies

Usage

Configuration

Profile

How to run the workflow

What to expect

Output

Development notes

Testing

Notes

Issues

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages