- Chicken Muscle Shear Force RNA-seq Analysis
- Project Overview
- Data Source
- Reference Genome
- Directory Structure
- Installation and Setup
- Analysis Pipeline
- Results
- Data Processing and Analysis Details
- Acknowledgments
This repository contains an RNA-seq analysis pipeline that examines gene expression patterns in chicken breast muscles under different shear force conditions. The data appears to be from Piórkowska, et al. (2016) in Animal Genetics, comparing transcriptional profiles between high and low shear force groups, though this needs to be confirmed as the original article is not publicly accessible.
- Original Study: The data is believed to be from Piórkowska K, et al. (2016). Genome-wide RNA-Seq analysis of breast muscles of two broiler chicken groups differing in shear force. Animal Genetics, 47(1):68-80. However, this needs to be confirmed as the article is not publicly available.
- BioProject: PRJNA297364
- Sample Groups:
- High shear force (4 replicates):
- SRR2554364 (replicate 1)
- SRR2554365 (replicate 2)
- SRR2554366 (replicate 3)
- SRR2554367 (replicate 4)
- Low shear force (4 replicates):
- SRR2554362 (replicate 1)
- SRR2554363 (replicate 2)
- SRR2554345 (replicate 3)
- SRR2554344 (replicate 4)
- High shear force (4 replicates):
- Experimental Design: Four biological replicates per condition
- Sequencing Platform: Illumina HiScanSQ
- Library Type: RNA-seq, single-end reads
- Data Processing: RNA-seq analysis performed on IFB cluster
- Important Note: While the BioProject and SRA accession numbers are confirmed, the association with the cited publication needs verification due to limited access to the original article. Users are encouraged to verify the data source independently.
- Version: Gallus gallus bGalGal1.mat.broiler.GRCg7b (Ensembl Release 110)
- Source: Ensembl
- Files:
- Genome FASTA: Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.dna.toplevel.fa
- Gene Annotations: Gallus_gallus.bGalGal1.mat.broiler.GRCg7b.110.gtf
- URLs:
- Genome Indexing:
- Indexed using STAR aligner
- Parameters:
- sjdbOverhang: 99
- Multi-threaded processing enabled
- Memory-optimized for cluster environment
Due to the large size of the raw sequencing files, they are not included in this repository. To download the data:
- Using direct SRA links (faster method):
# Create data directory
mkdir -p data/raw
cd data/raw
# Download each sample using direct links
# High shear force samples
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554364/SRR2554364 -O SRR2554364.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554365/SRR2554365 -O SRR2554365.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554366/SRR2554366 -O SRR2554366.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554367/SRR2554367 -O SRR2554367.sra
# Low shear force samples
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554362/SRR2554362 -O SRR2554362.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554363/SRR2554363 -O SRR2554363.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554345/SRR2554345 -O SRR2554345.sra
wget https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR2554344/SRR2554344 -O SRR2554344.sra
# Convert SRA to FASTQ format
for sra in *.sra; do
fastq-dump --gzip "$sra"
done
# Clean up SRA files (optional)
rm *.sra
- Alternative method using SRA toolkit:
# Install SRA toolkit if not already installed
# conda install -c bioconda sra-tools
# Download samples
prefetch SRR2554364 SRR2554365 SRR2554366 SRR2554367 \
SRR2554362 SRR2554363 SRR2554345 SRR2554344
# Convert to FASTQ format
for acc in SRR2554364 SRR2554365 SRR2554366 SRR2554367 \
SRR2554362 SRR2554363 SRR2554345 SRR2554344; do
fastq-dump --gzip "$acc"
done
Note: The direct download method is generally faster than using the SRA toolkit. Choose the method that works best for your setup.
.
├── scripts/ # Analysis scripts
│ ├── download_reference_ifb.sh # Script to download and prepare reference
│ ├── run_chicken_analysis_ifb.sh # Main analysis pipeline for IFB
│ └── create_expression_atlas.R # R script for expression analysis
│
├── results/ # Analysis results (generated during pipeline run)
│ ├── qc/ # Quality control reports
│ │ ├── raw/ # FastQC reports for raw data
│ │ └── trimmed/ # FastQC reports for trimmed data
│ ├── alignment/ # STAR alignment outputs
│ │ ├── logs/ # Alignment log files
│ │ └── bam/ # Sorted and indexed BAM files
│ ├── counts/ # Feature count matrices
│ └── expression_atlas/ # Final expression analysis results
│
├── plots/ # Generated figures
│ ├── correlation/ # Sample correlation plots
│ │ ├── raw/ # Raw sample correlation plots
│ │ └── trimmed/ # Trimmed sample correlation plots
│ ├── pca/ # PCA analysis plots
│ └── expression/ # Expression pattern plots
│
├── requirements.txt # Python package dependencies
├── environment.yml # Conda environment specification
└── README.md # Project documentation
Note: The data/ and genome/ directories are meaned to be created during analysis when running the pipeline.
See the Data Download Instructions and Reference Genome sections for details on how these are set up.
-
Quality Control and Preprocessing
- Raw data QC using FastQC (v0.11.9)
- Quality score distribution analysis
- Sequence duplication assessment
- Adapter contamination detection
- Base composition analysis
- Adapter trimming with Trimmomatic (v0.39)
- Removal of Illumina adapters
- Quality filtering (leading/trailing bases)
- Sliding window quality trimming
- Minimum length filtering
- Post-trimming QC with FastQC and MultiQC
- Verification of adapter removal
- Quality metrics visualization
- Sample comparison reports
- Raw data QC using FastQC (v0.11.9)
-
Alignment and Quantification
- Reference genome alignment using HISAT2 (v2.2.1)
- Splice-aware alignment
- Multi-threaded processing
- Optimized for RNA-seq data
- SAMtools (v1.15) for BAM processing
- SAM to BAM conversion
- BAM sorting and indexing
- Alignment statistics
- Expression quantification with featureCounts (v2.0.1)
- Gene-level counting
- Multi-mapping handling
- Strand-specific counting support
- Reference genome alignment using HISAT2 (v2.2.1)
-
Differential Expression Analysis
- DESeq2 (v1.32.0) for statistical analysis
- Normalization for sequencing depth
- Dispersion estimation
- Differential expression testing
- Log2 fold change shrinkage
- Statistical thresholds:
- Adjusted p-value < 0.05
- |Log2 fold change| > 1
- DESeq2 (v1.32.0) for statistical analysis
-
Visualization (R v4.1.0)
- Correlation plots using pheatmap
- Sample correlation heatmaps
- Hierarchical clustering
- MA plots with ggplot2
- Log fold change vs mean expression
- Significance highlighting
- PCA plots using ggplot2
- Dimension reduction
- Sample clustering visualization
- Custom expression pattern plots
- Gene-specific visualizations
- Group comparisons
- Correlation plots using pheatmap
The analysis reveals expression patterns between high and low shear force groups in chicken breast muscle. Key findings include:
The correlation heatmap shows the pairwise correlation coefficients between samples from different shear force conditions:
This visualization helps identify:
- Clear clustering of samples by shear force condition
- High correlation between replicates within each condition
- Distinct expression patterns between high and low shear force groups
The PCA plot demonstrates the separation between high and low shear force groups in reduced dimensional space:
Key observations:
- Clear segregation between conditions along principal components
- Tight clustering of biological replicates
- Explained variance for each principal component
MA plots showing the relationship between mean expression and log fold change:
This plot reveals:
- Highlights differentially expressed genes between conditions
- Shows the distribution of up and down-regulated genes
- Indicates statistical significance thresholds
To identify genes associated with meat tenderness (shear force), we performed differential expression analysis between high and low shear force samples. The analysis revealed:
- Total genes analyzed: 30,108
- 827 genes showed significantly higher expression in high shear force samples
- 184 genes showed significantly higher expression in low shear force samples
The top upregulated genes in high shear force samples include genes involved in:
- Muscle structure and organization
- Collagen synthesis and extracellular matrix formation
- Muscle fiber type composition
- Protein metabolism and turnover
Top 10 genes with highest expression in high shear force samples:
Gene ID High Shear Mean Low Shear Mean Log2 Fold Change
ENSGALG00010029558 204.25 3.25 5.59
ENSGALG00010019253 289.25 58.00 2.30
ENSGALG00010003868 288.25 58.00 2.29
ENSGALG00010019732 892.50 208.50 2.09
ENSGALG00010025644 374.75 100.00 1.90
The top upregulated genes in low shear force samples include genes involved in:
- Muscle protein degradation
- Calcium signaling
- Energy metabolism
- Membrane transport
Top 10 genes with highest expression in low shear force samples:
Gene ID High Shear Mean Low Shear Mean Log2 Fold Change
ENSGALG00010014488 32.50 123.75 -1.90
ENSGALG00010016987 208.00 601.00 -1.53
ENSGALG00010016978 840.75 2331.00 -1.47
ENSGALG00010028364 134.75 350.00 -1.37
ENSGALG00010007900 84.50 210.75 -1.31
These expression patterns suggest that meat tenderness in chicken breast muscle is influenced by a complex interplay of genes involved in muscle structure, protein turnover, and energy metabolism. The higher number of upregulated genes in high shear force samples (827 vs 184) suggests that tough meat may result from the activation of multiple pathways that increase muscle density and protein content.
To identify genes most strongly associated with each condition, we performed a detailed expression analysis using the following criteria:
- Calculated mean expression levels for each condition
- Computed log2 fold change between conditions
- Filtered out lowly expressed genes (mean expression < 100)
- Ranked genes by their log2 fold change
Results:
-
High Shear Force-Associated Genes:
- Top gene: ENSGALG00010029558 (log2FC = 5.59)
- Notable genes include:
- ENSGALG00010019253 (log2FC = 2.30)
- ENSGALG00010003868 (log2FC = 2.29)
- ENSGALG00010019732 (log2FC = 2.09)
- 827 genes showed significantly higher expression (log2FC > 1)
-
Low Shear Force-Associated Genes:
- Top gene: ENSGALG00010014488 (log2FC = -1.90)
- Notable genes include:
- ENSGALG00010016987 (log2FC = -1.53)
- ENSGALG00010016978 (log2FC = -1.47)
- ENSGALG00010028364 (log2FC = -1.37)
- 184 genes showed significantly higher expression (log2FC < -1)
The analysis script (scripts/find_top_genes.R
) performs the following steps:
- Reads and processes the normalized expression matrix
- Calculates mean expression for each condition
- Computes log2 fold changes
- Filters and ranks genes based on expression levels
- Outputs top differentially expressed genes for each condition
Complete results are available in:
results/expression_atlas/top_high_shear_genes.txt
results/expression_atlas/top_low_shear_genes.txt
Correlation plots between biological replicates demonstrate the quality and reproducibility of the data:
The correlations show:
- Pairwise comparisons between all high shear force replicates
- Strong correlation coefficients (R² > 0.95)
- Consistent expression patterns across replicates
The correlations demonstrate:
- Pairwise comparisons between all low shear force replicates
- High technical reproducibility (R² > 0.95)
- Consistent expression patterns across biological replicates
The analysis reveals expression patterns between high and low shear force groups in chicken breast muscle. Key findings include:
The correlation heatmap shows the pairwise correlation coefficients between samples from different shear force conditions:
This visualization helps identify:
- Clear clustering of samples by shear force condition
- High correlation between replicates within each condition
- Distinct expression patterns between high and low shear force groups
The PCA plot demonstrates the separation between high and low shear force groups in reduced dimensional space:
Key observations:
- Clear segregation between conditions along principal components
- Tight clustering of biological replicates
- Explained variance for each principal component
MA plots showing the relationship between mean expression and log fold change:
This plot reveals:
- Highlights differentially expressed genes between conditions
- Shows the distribution of up and down-regulated genes
- Indicates statistical significance thresholds
To identify genes associated with meat tenderness (shear force), we performed differential expression analysis between high and low shear force samples. The analysis revealed:
- Total genes analyzed: 30,108
- 827 genes showed significantly higher expression in high shear force samples
- 184 genes showed significantly higher expression in low shear force samples
The top upregulated genes in high shear force samples include genes involved in:
- Muscle structure and organization
- Collagen synthesis and extracellular matrix formation
- Muscle fiber type composition
- Protein metabolism and turnover
Top 10 genes with highest expression in high shear force samples:
Gene ID High Shear Mean Low Shear Mean Log2 Fold Change
ENSGALG00010029558 204.25 3.25 5.59
ENSGALG00010019253 289.25 58.00 2.30
ENSGALG00010003868 288.25 58.00 2.29
ENSGALG00010019732 892.50 208.50 2.09
ENSGALG00010025644 374.75 100.00 1.90
The top upregulated genes in low shear force samples include genes involved in:
- Muscle protein degradation
- Calcium signaling
- Energy metabolism
- Membrane transport
Top 10 genes with highest expression in low shear force samples:
Gene ID High Shear Mean Low Shear Mean Log2 Fold Change
ENSGALG00010014488 32.50 123.75 -1.90
ENSGALG00010016987 208.00 601.00 -1.53
ENSGALG00010016978 840.75 2331.00 -1.47
ENSGALG00010028364 134.75 350.00 -1.37
ENSGALG00010007900 84.50 210.75 -1.31
These expression patterns suggest that meat tenderness in chicken breast muscle is influenced by a complex interplay of genes involved in muscle structure, protein turnover, and energy metabolism. The higher number of upregulated genes in high shear force samples (827 vs 184) suggests that tough meat may result from the activation of multiple pathways that increase muscle density and protein content.
To identify genes most strongly associated with each condition, we performed a detailed expression analysis using the following criteria:
- Calculated mean expression levels for each condition
- Computed log2 fold change between conditions
- Filtered out lowly expressed genes (mean expression < 100)
- Ranked genes by their log2 fold change
Results:
-
High Shear Force-Associated Genes:
- Top gene: ENSGALG00010029558 (log2FC = 5.59)
- Notable genes include:
- ENSGALG00010019253 (log2FC = 2.30)
- ENSGALG00010003868 (log2FC = 2.29)
- ENSGALG00010019732 (log2FC = 2.09)
- 827 genes showed significantly higher expression (log2FC > 1)
-
Low Shear Force-Associated Genes:
- Top gene: ENSGALG00010014488 (log2FC = -1.90)
- Notable genes include:
- ENSGALG00010016987 (log2FC = -1.53)
- ENSGALG00010016978 (log2FC = -1.47)
- ENSGALG00010028364 (log2FC = -1.37)
- 184 genes showed significantly higher expression (log2FC < -1)
The analysis script (scripts/find_top_genes.R
) performs the following steps:
- Reads and processes the normalized expression matrix
- Calculates mean expression for each condition
- Computes log2 fold changes
- Filters and ranks genes based on expression levels
- Outputs top differentially expressed genes for each condition
Complete results are available in:
results/expression_atlas/top_high_shear_genes.txt
results/expression_atlas/top_low_shear_genes.txt
Correlation plots between biological replicates demonstrate the quality and reproducibility of the data:
The correlations show:
- Pairwise comparisons between all high shear force replicates
- Strong correlation coefficients (R² > 0.95)
- Consistent expression patterns across replicates
The correlations demonstrate:
- Pairwise comparisons between all low shear force replicates
- High technical reproducibility (R² > 0.95)
- Consistent expression patterns across biological replicates
-
SRA Tools
- Purpose: Download and handle SRA format sequencing data
- Features:
- Direct download from NCBI SRA
- FASTQ format conversion
- Compression handling
-
FastQC (v0.11.9)
- Purpose: Quality control of raw sequencing data
- Features:
- Per base quality scores
- Sequence duplication levels
- Adapter content detection
- Base composition analysis
- Usage: Initial and post-trimming QC
-
Trimmomatic (v0.39)
- Purpose: Adapter and quality trimming
- Features:
- Illumina adapter removal
- Sliding window trimming
- Minimum length filtering
- Parameters:
- ILLUMINACLIP: TruSeq3-SE.fa:2:30:10
- LEADING: 3
- TRAILING: 3
- SLIDINGWINDOW: 4:15
- MINLEN: 36
-
STAR (v2.7+)
- Purpose: RNA-seq read alignment
- Features:
- Splice-aware alignment
- Multi-threaded processing
- Memory-efficient operation
- Parameters:
- Genome indexing with sjdbOverhang 99
- BAM output sorted by coordinate
- Multi-threading support
-
SAMtools (v1.15)
- Purpose: SAM/BAM manipulation
- Features:
- Format conversion
- BAM sorting and indexing
- Alignment statistics
- BAM file compression
-
Subread package/featureCounts (v2.0.1)
- Purpose: Read counting and quantification
- Features:
- Gene-level counting
- Multi-threading support
- GTF/GFF format support
- Strand-specific counting
- Parameters:
- Multi-threaded processing
- Gene-level summarization
-
MultiQC
- Purpose: Aggregate QC reports
- Features:
- Combines reports from multiple tools
- Interactive visualizations
- Sample comparison
- Quality metric summaries
- R (v4.2)
- Purpose: Statistical analysis and visualization
- Environment: Local installation (not on IFB cluster)
- Usage: Analysis performed locally using
create_expression_atlas.R
after transferring count data from IFB cluster - Key packages:
- DESeq2: Differential expression analysis
- edgeR: RNA-seq analysis
- ggplot2: Data visualization
- pheatmap: Heatmap generation
- RColorBrewer: Color palettes
- gridExtra: Plot layouts
Note: The RNA-seq data processing (quality control, alignment, and counting) is performed on the IFB cluster, but the final statistical analysis and visualization using R is done locally. This requires transferring the count data from the IFB cluster to your local machine using:
# From your local machine
scp USERNAME@core.cluster.france-bioinformatique.fr:~/workspace/project_name/results/counts/expression_matrix.txt ./
- SLURM Workload Manager
- Purpose: Job scheduling and resource allocation
- Features:
- Job queuing and management
- Resource monitoring
- Multi-node support
- Parameters:
- CPUs per task: 8
- Memory: 64GB
- Runtime: 24 hours
- Partition: fast
- Git (v2.34+)
- Purpose: Version control and collaboration
- Features:
- Change tracking
- Branch management
- Collaborative development
All tools are configured with parameters optimized for chicken RNA-seq analysis on the IFB cluster. For specific version requirements and compatibility, see requirements.txt
.
See requirements.txt for Python package dependencies. Main requirements include:
- Python 3.6+
- NumPy
- Pandas
- Matplotlib
- Seaborn
- DESeq2
The analysis is performed in two parts:
-
Main RNA-seq Processing (On IFB Cluster)
- Quality control
- Read alignment
- Feature counting
- See
IFB_GUIDE.md
orIFB_GUIDE_UNIVERSAL.md
for detailed instructions
-
Statistical Analysis (Local)
- Transfer count data from IFB:
# From your local machine scp USERNAME@core.cluster.france-bioinformatique.fr:~/workspace/project_name/results/counts/expression_matrix.txt ./
- Run R analysis:
Rscript scripts/create_expression_atlas.R
Note: The main data processing is performed on the IFB cluster due to computational requirements, while the final statistical analysis and visualization are done locally using R.
This analysis can be run on the IFB (Institut Français de Bioinformatique) cluster for better performance and resource management. Two versions of the IFB guide are available:
IFB_GUIDE.md
- A personalized guide with specific paths and settings for the original setupIFB_GUIDE_UNIVERSAL.md
- A universal guide that can be adapted for any user
The analysis uses the following IFB-specific scripts:
scripts/download_reference_ifb.sh
- Downloads and prepares the reference genome on the IFB cluster, with SLURM job configurationscripts/run_chicken_analysis_ifb.sh
- Runs the complete RNA-seq analysis pipeline on the IFB cluster, with SLURM job configuration
These scripts have been specifically configured for the IFB cluster environment, including:
- SLURM job scheduling parameters
- IFB-specific module loading
- Proper resource allocation (CPU, memory, time)
- IFB-specific paths and environment setup
- Checkpoint system for long-running jobs
- An IFB account (request at https://my.cluster.france-bioinformatique.fr/)
- Basic command line knowledge
- SSH client installed on your local machine
- Your IFB username and password
- Connect to the cluster:
ssh USERNAME@core.cluster.france-bioinformatique.fr
- Start a tmux session (for disconnect protection):
tmux new -s rnaseq
# To reattach if disconnected: tmux attach -t rnaseq
- On your local machine, organize and compress your data:
# Create a project directory and organize files
mkdir -p chicken_rnaseq/{data,genome,scripts,status}
# Create an archive
tar czf chicken_rnaseq.tar.gz chicken_rnaseq/
- Transfer files to the cluster:
# Transfer the archive
scp chicken_rnaseq.tar.gz USERNAME@core.cluster.france-bioinformatique.fr:~/workspace/
# For large raw data files, use compression:
scp -C data/*.fastq.gz USERNAME@core.cluster.france-bioinformatique.fr:~/workspace/chicken_rnaseq/data/
- Set up your working directory:
# Go to your workspace
cd ~/workspace
mkdir -p chicken_rnaseq/{data,genome,scripts,status,results}
cd chicken_rnaseq
# Extract your files
tar xzf ../chicken_rnaseq.tar.gz
# Make scripts executable
chmod +x scripts/*.sh
- Download and index reference genome:
# Submit the genome preparation job
sbatch scripts/download_reference_ifb.sh
# Monitor the job
watch -n 30 'squeue -u $USER'
tail -f download_ref_*.out
- Launch the RNA-seq analysis:
# Check if genome preparation is complete
if [ -f "genome/reference_preparation_complete" ]; then
# Submit the analysis job
sbatch scripts/run_chicken_analysis_ifb.sh
# Monitor the job
watch -n 30 'squeue -u $USER'
tail -f chicken_rnaseq_*.out
else
echo "Genome preparation not complete"
fi
- Monitor progress in the results directory:
# Check the progress of different steps
ls -l results/{qc,alignment,counts}/
# View MultiQC report when available
firefox results/qc/multiqc_report.html # or your preferred browser
If you use this analysis pipeline, please cite both this repository and the original data source:
Original Data:
Piórkowska K, et al. (2016). Genome-wide RNA-Seq analysis of breast muscles of two broiler
chicken groups differing in shear force. Animal Genetics, 47(1):68-80.
Analysis Pipeline:
Nguyen A (2024) - Chicken Muscle Shear Force RNA-seq Analysis: A reproduction study
This project was developed with the assistance of:
- Cursor - The AI-powered code editor that provided intelligent coding assistance throughout the project
- Claude (Anthropic) - The AI model powering Cursor that helped with code development, documentation, and analysis pipeline design
For questions about this reproduction analysis, please open a GitHub issue or contact Alexis NGUYEN (alexisnguyen97@yahoo.fr)
The RNA-seq data undergoes several processing steps to generate the final expression matrix for analysis:
-
Initial Count Matrix (
expression_matrix.txt
):- Generated from STAR alignment outputs using featureCounts
- Raw count data for each gene across all samples
- Format: Rows = Ensembl gene IDs, Columns = Sample replicates
- Contains unprocessed integer counts
-
Normalized Matrix (
expression_matrix_normalized.txt
):- Normalization method: TMM (Trimmed Mean of M-values)
- Implemented using edgeR package in R
- Accounts for:
- Library size differences between samples
- RNA composition biases
- Technical variation
- Values transformed to log2 counts per million (log2CPM)
-
Clean Matrix (
expression_matrix_clean.txt
):- Final processed matrix used for differential expression analysis
- Proper tab-delimited format
- Filtered to remove:
- Low count genes
- Genes with missing values
- Technical artifacts
- Column headers standardized for sample identification
- Ready for downstream statistical analysis
-
Matrix Format Standardization:
# Ensure proper tab-separation between columns sed 's/[[:space:]]\+/\t/g' expression_matrix_clean.txt > expression_matrix_clean_fixed.txt
-
Final Analysis Format:
# R code for importing the cleaned matrix expr_data <- read.delim("expression_matrix_clean_fixed.txt", header=TRUE, stringsAsFactors=FALSE, check.names=FALSE)
This processing pipeline ensures data quality and proper normalization for accurate differential expression analysis between high and low shear force conditions.