RiboKastIndex is a bioinformatics workflow designed for processing Ribo-seq data.
The RiboKastIndex pipeline automates key steps ensuring a streamlined and reproducible approach for ribosome profiling. This pipeline processes data from raw sequencing reads to ribosome profiling outputs and k-mer analyses.
Once phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to input datasets.
Using KaMRaT, the pipeline constructs the comprehensive k-mer index (RSindex) and generates a contig count matrix (contigs are merged k-mers).
These outputs enable further analyses, such as determining which sequences from a list are actively translated. You can query RNA sequences in the RS index to assess their translation status.
Framed reads are those where, for example, more than 50% of the reads with a specific length map to the same frame of the CDS (coding sequence), indicating a P-site that aligns codons to P0 of the CDS. To determine the translated frame among the three possible frames, if more than 50% of the reads translate in frame 1, the reads with that specific length are considered phased.
From the phased reads, the pipeline extracts phased k-mers, which are aligned to the same translation frame. Once these phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to the input datasets. Using KaMRaT, the pipeline builds a comprehensive k-mer index and generates a contig count matrix (with contigs being merged k-mers), facilitating downstream analysis.
Before running the RiboKastIndex pipeline, ensure that the following prerequisites are met, including setting up the required Ribodoc Conda environment, KaMRaT for k-mer analysis, and joinCounts for merging k-mer counts.
The pipeline relies on a Conda environment defined in the RiboKastIndex.yaml
file. Follow the steps below to set up and activate the environment.
If the environment is not already created, follow these steps to create it:
-
Install Miniconda or Conda if it's not already installed:
-
Create the environment from the RiboKastIndex.yaml file:
conda env create -f /path/to/RS_Framed_kmers/RiboKastIndex.yaml
-
List your environments: After creating the environment, run the following to get the path of the created environment:
conda info --envs
-
Activate the environment: Using the path or the environment name from the previous command, activate your environment:
source /home/yourusername/miniconda3/bin/activate your_environment_name
The pipeline uses KaMRaT for k-mer analysis. Follow these steps to download and configure the KaMRaT Singularity image:
-
Download the KaMRaT Singularity image (.sif) from the official GitHub repository:
-
Use the following command to download the image:
singularity pull KaMRaT.sif docker://transipedia/kamrat:latest
-
Configure the path to the downloaded image in the
config.yaml
file under thekamratImg
key:kamratImg: "/path/to/KaMRaT.sif"
Replace
/path/to/KaMRaT.sif
with the actual path where the Singularity image is located.
joinCounts is used for merging k-mer counts. You can find it on GitHub at the following link:
-
Clone and install joinCounts:
git clone https://github.com/Transipedia/dekupl-joinCounts.git cd dekupl-joinCounts make
-
Set the path to
joinCounts
in theconfig.yaml
file:pathJoinCounts: "$PATH:/path/to/dekupl-joinCounts"
Replace
/path/to/dekupl-joinCounts
with the actual path to thejoinCounts
executable.
The results generated by the RS_Framed_kmers pipeline are organized into several key directories:
- BAM_transcriptome.25-35/: Contains BAM and BAM index files of aligned reads to the transcriptome.
- adapter_lists/: Stores adapter sequences detected or used for trimming.
- annex_database/: Contains reference indices (Bowtie2 and Hisat2), GFF files, and other annotations used for the analysis.
- cutadapt/: Trimmed FastQ files for each sample after adapter removal.
- fastqc/: FastQC quality reports before and after trimming.
- kmerCount/: Results of k-mer counting, including individual k-mer counts for each sample, final merged k-mer result (
merged-res.tsv
) and the kmers index. - no-outRNA/: FastQ files with rRNA reads removed.
- riboWaltz.25-35/: Results from riboWaltz analysis, including P-site offset data, periodicity plots, and frame-shift analysis.
Below is a concise tree structure of the key output directories:
|-- BAM_transcriptome.25-35/
|-- adapter_lists/
|-- annex_database/
| |-- NamedCDS_human.gff3
| |-- index_files/
|-- cutadapt/
|-- fastqc/
| |-- fastqc_after_trimming/
| |-- fastqc_before_trimming/
|-- kmerCount/
| |-- Kamrat/
| | |-- index/
| | | |-- idx-mat.bin
| | | |-- idx-meta.bin
| | | |-- idx-pos.bin
| | └── merged-res.tsv
| |-- Kmer/
| | └── SRR2146892_1.tsv
| └-- Matrix/
| └── matrixFilteredHeader.tsv
|-- no-outRNA/
|-- riboWaltz.25-35/
| |-- best_offset.csv
| |-- frame_psite.tiff
| |-- frame_psite_length.csv
| |-- frame_psite_length.tiff
| |-- psite_offset.csv
| |-- psite_table_forKmerCount.txt
| |-- psite_table_offset.csv
| |-- transcriptome_elongated.SRR2146892_1/
| |-- transcriptome_elongated.SRR23563666/
The main outputs are the RS index (kmerCount/Kamrat/index
), the k-mer count table (kmerCount/Matrix/matrixFilteredHeader.tsv
), and the contig count table (kmerCount/Kamrat/merged-res.tsv
), where contigs represent merged k-mers.
The pipeline uses a configuration file (config.yaml
) that defines project-specific settings, including paths, reference files, trimming parameters, k-mer analysis settings, and more. This file must be tailored to your specific environment and data.
Your working directory should contain a fastq
directory for your FASTQ files, as well as a database
directory for the reference files specified in the configuration file.
Below is an example of the key sections from the config.yaml
file:
# Project Configuration
project_name: "Pancreas ribosome profiling"
# Path Settings
paths:
local_path: "/data/projects/RS_Framed_kmers/"
ribodoc_kmerIndex_tools: "/data/tools/RS_Framed_kmers/tools/"
results_path: "/data/projects/RS_Framed_kmers/results/"
stats_path: "/data/projects/RS_Framed_kmers/stats/"
logs_path: "/data/projects/RS_Framed_kmers/logs/"
snakemake_log_path: "/data/projects/RS_Framed_kmers/.snakemake/log/"
fastq_path: "/data/projects/RS_Framed_kmers/fastq/"
conda_env: "/data/envs/RiboKastIndex.yaml"
# Reference Files
fasta: "/data/references/human_genome.fa"
gff: "/data/references/human_annotations.gff3"
fasta_outRNA: "/data/references/rRNA_exclusion.fasta"
# Adapter Trimming Settings
already_trimmed: "no"
adapt_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Length Selection for Profiling
readsLength_min: "25"
readsLength_max: "35"
# K-mer Index Construction Settings
# Path to joinCounts
pathJoinCounts: "$PATH:/data/tools/dekupl-joinCounts"
kmerSize: "25"
mode: "phase"
kamratImg: "/data/tools/KaMRaT/KaMRaT.sif"