Skip to content

dieterich-lab/Scintillator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scintillator

The Sci-ModoM integrator & amalgamator pipeline. A workflow to integrate dataset-level information, to combine and unify evidence to support building transcript models with RNA modification.

Installation

python3 -m venv scintillator
source scintillator/bin/activate
pip install -r requirements.txt

Dependencies

Usage

Configuration

To configure this workflow, modify the following files:

Profile

To run this workflow on a SLURM cluster, modify the default profile under workflow/profiles/default/config.yaml.

How to run the workflow

The option --use-envmodules does not work with the SLURM executor, cf. this issue. Copy the draft implementation and patch your installation patch -p1 <snakemake.patch under the environment directory. Add this to your profile

envmodules-precommand: ". /etc/profile.d/modules.sh && . /biosw/__modules__/modules.rc"
use-envmodules: true

Because of this other issue, configuration files must be passed as arguments. Go to workflow and run

snakemake --executor slurm --config pepfile=/full/path/to/pep/config.yaml --configfile config/config.yaml [snakemake options] 2> log.txt &

What to expect

Scintillator extract evidence from Sci-ModoM for parameters specified in pep/config.yaml. For a given species, modification sites are downloaded from the online database for every chromosome. The list of chromosomes, annotations, assembly information, and the genome sequence are downloaded from Ensembl. To restrict the list of chromosomes, use

chroms:
    include:
        - "1"
        - "2"
    exclude:
        - "X"
        - "Y"
        - "MT"

The include key overrides the exclude key, e.g. only chromosomes "1" and "2" would be included in the output. The chroms key is optional; if not present, all chromosomes are used (only chromosomes).

Next, Scintillator uses the primary data given in pep/dataset.tsv to (i) restrict the evidence to the available datasets, and (ii) estimate abundance for every matching transcript, where at least one modification has been reported, for every available dataset. If Run is given, the corresponding archive is downloaded from the Sequence Read Archive (FileType must be FASTQ) and dumped to one or two FASTQ files, depending on LibraryLayout; otherwise FilePath must be a path pointing to a BAM file (FileType must be BAM) on the file system.

Note: If Platform is LONG, Salmon is used in alignment-mode. BAM files are first converted to FASTQ (we assume that they are mapped to the genome). FASTQ files are mapped with minimap2. For SHORT, only FASTQ is allowed, and Salmon is used in mapping-mode. In mapping-mode, FASTQ files are passed as is, i.e. without adapter removal and/or (quality) trimming.

Note: To provide replicates, add different Run identifiers for the same EUFID, but FileType, LibraryLayout, and Platform must all be identical. If FileType is BAM, no replicates are allowed.

Output files are written to output_directory given in the config/config.yaml. The final output is written for each chromosome to a sub-directory named scintillator, e.g. under chr1 there will be

  • chr1.fa.gz: Multi-FASTA with original transcript sequences
  • chr1.modified.fa.gz: Multi-FASTA with modified transcript sequences
  • chr1.json: Metadata including transcript attributes, modifications and associated evidence (score, frequency, coverage), and abundance (TPM and reads)

The modified transcript sequences use the UNICODE character from MODOMICS (RNAMods code 2023). The transcript as well as the genomic positions in the metadata file are 1-based indexed.

Note: Scintillator does not currently check that the assembly version used by the pipeline and downloaded from Ensembl matches that used in Sci-ModoM.

Output

Check the logs, in particular the annotate logs, to make sure any unused evidence is mostly from Intergenic regions. Intronic regions are not reported; they are just ignored.

We currently only handle single-resolution modifications, i.e. given by chromEnd = chromStart + 1. Any evidence with a context site is discarded.

Development notes

Testing

# incl. pytest and pytest-datafiles
pip install -r requirements-dev.txt

To run the tests, comment out snakemake-related content from workflow/scripts/annotate.py, adjust imports, and run

python -m pytest tests/

Notes

  • The CSV output from Sci-ModoM is in fact BED-style (0-base indexed, half-open). These files are under evidence/chroms. This is converted to 1-base indexing by sanitize, and used by annotate. These files are under evidence/chr

Issues

  • The pipeline will fail if e.g. a dataset has no reads mapping to a certain chromosome. Handling this exception in snakemake doesn't seem trivial. In general, this is not much of an issue if we only work with autosomal chromosomes (use chroms/exclude).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages