Skip to content

Methods

Jaime Huerta-Cepas edited this page Sep 19, 2022 · 2 revisions

(Methods description of eggNOG v6.0)

Orthologous Groups

The initial step in the eggNOG pipeline is the clustering of the 59 million proteins from 12,535 genomes. Homology comparisons are executed by the SIMAP initiative and processed by the eggNOG orthology prediction pipeline. Orthologous groups are automatically generated by dividing species space into ‘core’ species, which are central for defining orthologous groups using the strict triangular criterion, and ‘periphery’ species.

Functional annotation

EggNOG 6 proteins have been linked to reference databases, including RefSeq, UniProtKB, and KEGG. Furthermore, functional annotation was enriched with BiGG reactions, CARD and CAZy families, and sequence domains from PFAM and SMART. Gene Ontology terms were obtained from UniProtKB, and further summarized to GO Slim terms.

Protein annotations are further propagated to the OGs they belong to, and summarized by computing frequencies of identifiers, names, terms and descriptions. OG main descriptions are derived from UniProtKB protein names, RefSeq products and KEGG KO names. Next, the 3 most frequent terms from each annotation source are shown, along with the percentage of OG members which have each term. The full lists of annotations for each OG can be inspected from the “Functional profile” section of each OG. These annotations can be also found in the phylogenetic tree under the “Tree and alignment” section, which also allows exploring the domain architecture of OGs.

Phylogenetic reconstruction and evolutionary analysis

For eggNOG 6 phylogenetic trees we build the multiple sequence alignments for each OG with mafft, when OGs have less than 1000 members , and FAMSA v2 for OG with more than 1000 sequences. A slight trimming was performed with an inhouse script to remove columns with a gap content greater than 90%, and the trees were build with FastTree 2.1. Taxonomic annotations from NCBI Taxonomy were added to the trees with ete4. To provide duplication profiles and the fine-grained orthologs, we run the species-overlap algorithm from ete4 to detect all duplication and speciation events.

Clone this wiki locally