RiboKastIndex

Makes k-mer indexes from ribo-seq data

Introduction

RiboKastIndex is a bioinformatics workflow designed for processing Ribo-seq data.
The RiboKastIndex pipeline automates key steps ensuring a streamlined and reproducible approach for ribosome profiling. This pipeline processes data from raw sequencing reads to ribosome profiling outputs and k-mer analyses. Once phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to input datasets. Using KaMRaT, the pipeline constructs the comprehensive k-mer index (RSindex) and generates a contig count matrix (contigs are merged k-mers). These outputs enable further analyses, such as determining which sequences from a list are actively translated. You can query RNA sequences in the RS index to assess their translation status.

Additional Information on RS_Framed_kmers:

Framed reads are those where, for example, more than 50% of the reads with a specific length map to the same frame of the CDS (coding sequence), indicating a P-site that aligns codons to P0 of the CDS. To determine the translated frame among the three possible frames, if more than 50% of the reads translate in frame 1, the reads with that specific length are considered phased.

From the phased reads, the pipeline extracts phased k-mers, which are aligned to the same translation frame. Once these phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to the input datasets. Using KaMRaT, the pipeline builds a comprehensive k-mer index and generates a contig count matrix (with contigs being merged k-mers), facilitating downstream analysis.

Requirements and Setup

Before running the RiboKastIndex pipeline, ensure that the following prerequisites are met, including setting up the required Ribodoc Conda environment, KaMRaT for k-mer analysis, and joinCounts for merging k-mer counts.

1. Set Up and Activate Conda Environment

The pipeline relies on a Conda environment defined in the RiboKastIndex.yaml file. Follow the steps below to set up and activate the environment.

Step 1: Create the Conda Environment

If the environment is not already created, follow these steps to create it:

Install Miniconda or Conda if it's not already installed:
- Miniconda Installation Guide

Create the environment from the RiboKastIndex.yaml file:

conda env create -f /path/to/RS_Framed_kmers/RiboKastIndex.yaml

List your environments: After creating the environment, run the following to get the path of the created environment:
```
conda info --envs
```
Activate the environment: Using the path or the environment name from the previous command, activate your environment:
```
source /home/yourusername/miniconda3/bin/activate your_environment_name
```

2. Install and Configure KaMRaT

The pipeline uses KaMRaT for k-mer analysis. Follow these steps to download and configure the KaMRaT Singularity image:

Download the KaMRaT Singularity image (.sif) from the official GitHub repository:
- KaMRaT GitHub

Use the following command to download the image:

singularity pull KaMRaT.sif docker://transipedia/kamrat:latest

Configure the path to the downloaded image in the config.yaml file under the kamratImg key:
```
kamratImg: "/path/to/KaMRaT.sif"
```
Replace /path/to/KaMRaT.sif with the actual path where the Singularity image is located.

3. Install and Configure joinCounts

joinCounts is used for merging k-mer counts. You can find it on GitHub at the following link:

joinCounts GitHub

Clone and install joinCounts:

git clone https://github.com/Transipedia/dekupl-joinCounts.git
cd dekupl-joinCounts
make

Set the path to joinCounts in the config.yaml file:
```
pathJoinCounts: "$PATH:/path/to/dekupl-joinCounts"
```
Replace /path/to/dekupl-joinCounts with the actual path to the joinCounts executable.

Results Directory Structure

The results generated by the RS_Framed_kmers pipeline are organized into several key directories:

BAM_transcriptome.25-35/: Contains BAM and BAM index files of aligned reads to the transcriptome.
adapter_lists/: Stores adapter sequences detected or used for trimming.
annex_database/: Contains reference indices (Bowtie2 and Hisat2), GFF files, and other annotations used for the analysis.
cutadapt/: Trimmed FastQ files for each sample after adapter removal.
fastqc/: FastQC quality reports before and after trimming.
kmerCount/: Results of k-mer counting, including individual k-mer counts for each sample, final merged k-mer result (merged-res.tsv) and the kmers index.
no-outRNA/: FastQ files with rRNA reads removed.
riboWaltz.25-35/: Results from riboWaltz analysis, including P-site offset data, periodicity plots, and frame-shift analysis.

Below is a concise tree structure of the key output directories:

|-- BAM_transcriptome.25-35/
|-- adapter_lists/
|-- annex_database/
|   |-- NamedCDS_human.gff3
|   |-- index_files/
|-- cutadapt/
|-- fastqc/
|   |-- fastqc_after_trimming/
|   |-- fastqc_before_trimming/
|-- kmerCount/
|   |-- Kamrat/
|   |   |-- index/
|   |   |   |-- idx-mat.bin
|   |   |   |-- idx-meta.bin
|   |   |   |-- idx-pos.bin
|   |   └── merged-res.tsv
|   |-- Kmer/
|   |   └── SRR2146892_1.tsv
|   └-- Matrix/
|       └── matrixFilteredHeader.tsv
|-- no-outRNA/
|-- riboWaltz.25-35/
|   |-- best_offset.csv
|   |-- frame_psite.tiff
|   |-- frame_psite_length.csv
|   |-- frame_psite_length.tiff
|   |-- psite_offset.csv
|   |-- psite_table_forKmerCount.txt
|   |-- psite_table_offset.csv
|   |-- transcriptome_elongated.SRR2146892_1/
|   |-- transcriptome_elongated.SRR23563666/

The main outputs are the RS index (kmerCount/Kamrat/index), the k-mer count table (kmerCount/Matrix/matrixFilteredHeader.tsv), and the contig count table (kmerCount/Kamrat/merged-res.tsv), where contigs represent merged k-mers.

Configuration

The pipeline uses a configuration file (config.yaml) that defines project-specific settings, including paths, reference files, trimming parameters, k-mer analysis settings, and more. This file must be tailored to your specific environment and data. Your working directory should contain a fastq directory for your FASTQ files, as well as a database directory for the reference files specified in the configuration file. Below is an example of the key sections from the config.yaml file:

# Project Configuration
project_name: "Pancreas ribosome profiling"

# Path Settings
paths:
  local_path: "/data/projects/RS_Framed_kmers/"
  ribodoc_kmerIndex_tools: "/data/tools/RS_Framed_kmers/tools/"
  results_path: "/data/projects/RS_Framed_kmers/results/"
  stats_path: "/data/projects/RS_Framed_kmers/stats/"
  logs_path: "/data/projects/RS_Framed_kmers/logs/"
  snakemake_log_path: "/data/projects/RS_Framed_kmers/.snakemake/log/"
  fastq_path: "/data/projects/RS_Framed_kmers/fastq/"
  conda_env: "/data/envs/RiboKastIndex.yaml"

# Reference Files
fasta: "/data/references/human_genome.fa"
gff: "/data/references/human_annotations.gff3"
fasta_outRNA: "/data/references/rRNA_exclusion.fasta"

# Adapter Trimming Settings
already_trimmed: "no"
adapt_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Length Selection for Profiling
readsLength_min: "25"
readsLength_max: "35"

# K-mer Index Construction Settings
# Path to joinCounts
pathJoinCounts: "$PATH:/data/tools/dekupl-joinCounts"
kmerSize: "25"
mode: "phase"
kamratImg: "/data/tools/KaMRaT/KaMRaT.sif"

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
tools		tools
README.md		README.md
RiboKastIndex.sh		RiboKastIndex.sh
RiboKastIndex.yaml		RiboKastIndex.yaml
config.yaml		config.yaml
snakefile		snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RiboKastIndex

Makes k-mer indexes from ribo-seq data

Introduction

Additional Information on RS_Framed_kmers:

Requirements and Setup

1. Set Up and Activate Conda Environment

Step 1: Create the Conda Environment

2. Install and Configure KaMRaT

3. Install and Configure joinCounts

Results Directory Structure

Configuration

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Transipedia/RiboKastIndex

Folders and files

Latest commit

History

Repository files navigation

RiboKastIndex

Makes k-mer indexes from ribo-seq data

Introduction

Additional Information on RS_Framed_kmers:

Requirements and Setup

1. Set Up and Activate Conda Environment

Step 1: Create the Conda Environment

2. Install and Configure KaMRaT

3. Install and Configure joinCounts

Results Directory Structure

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages