Skip to content

Transipedia/RiboKastIndex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RiboKastIndex

Makes k-mer indexes from ribo-seq data

Introduction

RiboKastIndex is a bioinformatics workflow designed for processing Ribo-seq data.
The RiboKastIndex pipeline automates key steps ensuring a streamlined and reproducible approach for ribosome profiling. This pipeline processes data from raw sequencing reads to ribosome profiling outputs and k-mer analyses. Once phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to input datasets. Using KaMRaT, the pipeline constructs the comprehensive k-mer index (RSindex) and generates a contig count matrix (contigs are merged k-mers). These outputs enable further analyses, such as determining which sequences from a list are actively translated. You can query RNA sequences in the RS index to assess their translation status.

Additional Information on RS_Framed_kmers:

Framed reads are those where, for example, more than 50% of the reads with a specific length map to the same frame of the CDS (coding sequence), indicating a P-site that aligns codons to P0 of the CDS. To determine the translated frame among the three possible frames, if more than 50% of the reads translate in frame 1, the reads with that specific length are considered phased.

From the phased reads, the pipeline extracts phased k-mers, which are aligned to the same translation frame. Once these phased ribo-seq k-mers are generated, they are organized into a phased k-mer matrix, where rows represent k-mers and columns correspond to the input datasets. Using KaMRaT, the pipeline builds a comprehensive k-mer index and generates a contig count matrix (with contigs being merged k-mers), facilitating downstream analysis.

Requirements and Setup

Before running the RiboKastIndex pipeline, ensure that the following prerequisites are met, including setting up the required Ribodoc Conda environment, KaMRaT for k-mer analysis, and joinCounts for merging k-mer counts.

1. Set Up and Activate Conda Environment

The pipeline relies on a Conda environment defined in the RiboKastIndex.yaml file. Follow the steps below to set up and activate the environment.

Step 1: Create the Conda Environment

If the environment is not already created, follow these steps to create it:

  1. Install Miniconda or Conda if it's not already installed:

  2. Create the environment from the RiboKastIndex.yaml file:

    conda env create -f /path/to/RS_Framed_kmers/RiboKastIndex.yaml
  3. List your environments: After creating the environment, run the following to get the path of the created environment:

    conda info --envs
  4. Activate the environment: Using the path or the environment name from the previous command, activate your environment:

    source /home/yourusername/miniconda3/bin/activate your_environment_name

2. Install and Configure KaMRaT

The pipeline uses KaMRaT for k-mer analysis. Follow these steps to download and configure the KaMRaT Singularity image:

  1. Download the KaMRaT Singularity image (.sif) from the official GitHub repository:

  2. Use the following command to download the image:

    singularity pull KaMRaT.sif docker://transipedia/kamrat:latest
  3. Configure the path to the downloaded image in the config.yaml file under the kamratImg key:

    kamratImg: "/path/to/KaMRaT.sif"

    Replace /path/to/KaMRaT.sif with the actual path where the Singularity image is located.


3. Install and Configure joinCounts

joinCounts is used for merging k-mer counts. You can find it on GitHub at the following link:

  1. Clone and install joinCounts:

    git clone https://github.com/Transipedia/dekupl-joinCounts.git
    cd dekupl-joinCounts
    make
  2. Set the path to joinCounts in the config.yaml file:

    pathJoinCounts: "$PATH:/path/to/dekupl-joinCounts"

    Replace /path/to/dekupl-joinCounts with the actual path to the joinCounts executable.

Results Directory Structure

The results generated by the RS_Framed_kmers pipeline are organized into several key directories:

  • BAM_transcriptome.25-35/: Contains BAM and BAM index files of aligned reads to the transcriptome.
  • adapter_lists/: Stores adapter sequences detected or used for trimming.
  • annex_database/: Contains reference indices (Bowtie2 and Hisat2), GFF files, and other annotations used for the analysis.
  • cutadapt/: Trimmed FastQ files for each sample after adapter removal.
  • fastqc/: FastQC quality reports before and after trimming.
  • kmerCount/: Results of k-mer counting, including individual k-mer counts for each sample, final merged k-mer result (merged-res.tsv) and the kmers index.
  • no-outRNA/: FastQ files with rRNA reads removed.
  • riboWaltz.25-35/: Results from riboWaltz analysis, including P-site offset data, periodicity plots, and frame-shift analysis.

Below is a concise tree structure of the key output directories:

|-- BAM_transcriptome.25-35/
|-- adapter_lists/
|-- annex_database/
|   |-- NamedCDS_human.gff3
|   |-- index_files/
|-- cutadapt/
|-- fastqc/
|   |-- fastqc_after_trimming/
|   |-- fastqc_before_trimming/
|-- kmerCount/
|   |-- Kamrat/
|   |   |-- index/
|   |   |   |-- idx-mat.bin
|   |   |   |-- idx-meta.bin
|   |   |   |-- idx-pos.bin
|   |   └── merged-res.tsv
|   |-- Kmer/
|   |   └── SRR2146892_1.tsv
|   └-- Matrix/
|       └── matrixFilteredHeader.tsv
|-- no-outRNA/
|-- riboWaltz.25-35/
|   |-- best_offset.csv
|   |-- frame_psite.tiff
|   |-- frame_psite_length.csv
|   |-- frame_psite_length.tiff
|   |-- psite_offset.csv
|   |-- psite_table_forKmerCount.txt
|   |-- psite_table_offset.csv
|   |-- transcriptome_elongated.SRR2146892_1/
|   |-- transcriptome_elongated.SRR23563666/

The main outputs are the RS index (kmerCount/Kamrat/index), the k-mer count table (kmerCount/Matrix/matrixFilteredHeader.tsv), and the contig count table (kmerCount/Kamrat/merged-res.tsv), where contigs represent merged k-mers.

Configuration

The pipeline uses a configuration file (config.yaml) that defines project-specific settings, including paths, reference files, trimming parameters, k-mer analysis settings, and more. This file must be tailored to your specific environment and data. Your working directory should contain a fastq directory for your FASTQ files, as well as a database directory for the reference files specified in the configuration file. Below is an example of the key sections from the config.yaml file:

# Project Configuration
project_name: "Pancreas ribosome profiling"

# Path Settings
paths:
  local_path: "/data/projects/RS_Framed_kmers/"
  ribodoc_kmerIndex_tools: "/data/tools/RS_Framed_kmers/tools/"
  results_path: "/data/projects/RS_Framed_kmers/results/"
  stats_path: "/data/projects/RS_Framed_kmers/stats/"
  logs_path: "/data/projects/RS_Framed_kmers/logs/"
  snakemake_log_path: "/data/projects/RS_Framed_kmers/.snakemake/log/"
  fastq_path: "/data/projects/RS_Framed_kmers/fastq/"
  conda_env: "/data/envs/RiboKastIndex.yaml"

# Reference Files
fasta: "/data/references/human_genome.fa"
gff: "/data/references/human_annotations.gff3"
fasta_outRNA: "/data/references/rRNA_exclusion.fasta"

# Adapter Trimming Settings
already_trimmed: "no"
adapt_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Length Selection for Profiling
readsLength_min: "25"
readsLength_max: "35"

# K-mer Index Construction Settings
# Path to joinCounts
pathJoinCounts: "$PATH:/data/tools/dekupl-joinCounts"
kmerSize: "25"
mode: "phase"
kamratImg: "/data/tools/KaMRaT/KaMRaT.sif"

About

Makes k-mer indexes from ribo-seq data

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •