This repository contains a generalized and reusable pipeline for performing Gene Ontology (GO) enrichment analysis using the R package topGO.
Originally designed for Enrico's lynx introgression project, this pipeline has been modularized to be applicable to any gene set.
- R (β₯ 4.0 recommended)
- R packages:
topGO
,tidyverse
,optparse
andbiomaRt
if used with ensembl annotation
Automatically installed and load when running the run_enrichment.R script
A plain text file listing one gene ID per line:
GeneA
GeneB
GeneC
...
A tab-delimited file with two columns:
gene_id
: the gene namego_terms
: semicolon-separated GO terms
This file is ignored if --annotation_source ensembl
is used, in which case you must specify --ensembl_dataset
to select the species.
Example:
gene_id go_terms
GeneA GO:0008150,GO:0003674
GeneB GO:0009987
You can directly use the gene_to_go_gff3.py Python script to extract the gene_to_go.tsv
file. The script inspects the gff3 looking for transcript rows and takes the Parent ID (gene_id) and the GO ontology terms associated (go_terms).
Usage:
python gene_to_go_gff3.py <input.gff3> gene_to_go.tsv
From the project root, run the script via Rscript:
Rscript scripts/run_enrichment.R \
--genes data/candidate_genes.txt \
--annotation data/gene_to_GO.tsv \
--annotation_source custom \
--out results/my_project
Rscript scripts/run_enrichment.R \
--genes data/candidate_genes.txt \
--annotation_source ensembl \
--ensembl_dataset fcatus_gene_ensembl \
--out results/my_project_ensembl
Use the --ensembl_dataset
argument to specify the species dataset from Ensembl. Example datasets:
fcatus_gene_ensembl
β domestic cat (Felis catus)hsapiens_gene_ensembl
β humanmmusculus_gene_ensembl
β mouse
You can find the full list using the listDatasets(ensembl) function of biomaRt package.
--annotation_source ensembl
, you do not need to specify a --annotation
file. The script will download the gene-to-GO mappings directly from Ensembl using the biomaRt
package.
In this case, the script will fetch GO terms for Felis catus from Ensembl using the biomaRt
package.
results/my_project_over.tsv
: Full enrichment results for all GO termsresults/my_project_significant.tsv
: Filtered results with Fisher < 0.05 and >1 significant gene
Each row includes:
GO.ID
: GO term identifierTerm
: Description of the GO termSignificant
: Number of candidate genes associated with the termFisher
: Raw p-value from Fisher's exact testGenes
: Semicolon-separated list of candidate genes contributing to the term, using gene names when available (otherwise Ensembl IDs)
- This script uses the "Biological Process" (BP) ontology by default.
- You can change it to "MF" (Molecular Function) or "CC" (Cellular Component) by modifying the line
ontology = "BP"
inside the script.
- You can change it to "MF" (Molecular Function) or "CC" (Cellular Component) by modifying the line
- GO terms are considered significantly enriched if:
- Fisher's exact test p-value < 0.05, and
- More than 1 candidate gene is annotated to the term.
- The output includes a
Genes
column with contributing gene names (or Ensembl IDs if names are unavailable). - When using Ensembl annotations, only protein-coding genes are considered.