Skip to content
/ IGUA Public

Iterative Gene clUster Analysis, a high-throughput method for gene cluster family identification.

License

Notifications You must be signed in to change notification settings

zellerlab/IGUA

Repository files navigation

🦎 IGUA Stars

Iterative Gene clUster Analysis, a high-throughput method for gene cluster family identification.

Actions Coverage PyPI Bioconda AUR Wheel Python Versions Python Implementations License Source Mirror GitHub issues Docs Changelog Downloads Preprint

🗺️ Overview

IGUA is a method for high-throughput content-agnostic identification of Gene Cluster Families (GCFs) from gene clusters of genomic and metagenomic origin. It performs three clustering iterations to perform GCF assignment:

  • Fragment mapping identification: Reduce the input sequence space by identifying which gene clusters are fragments of each other.
  • Nucleotide deduplication: Find similar gene clusters in genomic space, using linear clustering with lower sequence identity and coverage.
  • Protein representation: Compute a numerical representation of gene clusters in term of protein composition, using representatives from a protein sequence clustering, to identify more distant relatives not captured by the previous step.

Compared to similar methods such as BiG-SLiCE or BiG-SCAPE, IGUA does not use Pfam domains to represent gene cluster composition, using instead representatives from an unsupervised clustering. This allows IGUA to accurately account for proteins that may not be covered by Pfam, and avoids performing a costly annotation step. The resulting protein representatives can be later annotated indepently to transfer annotations to the GCFs.

🔧 Installing

IGUA can be downloaded directly from PyPI, which hosts pre-compiled distributions for Linux, MacOS and Windows. Simply install with pip:

$ pip install igua

Note that you will need to install MMseqs2 yourself through other means.

💡 Running

📥 Inputs

The gene clusters to pass to IGUA must be in GenBank format, with gene annotations inside of CDS features. Several GenBank files can be passed to the same pipeline run.

$ igua -i clusters1.gbk -i clusters2.gbk ...

The GenBank locus identifier will be used as the name of each gene cluster. This may cause problems with gene clusters obtained with some tools, such as antiSMASH. If the input contains duplicate identifiers, the first gene cluster with a given identifier will be used, and a warning will be displayed.

📤 Outputs

The main output of IGUA is a TSV file which assigns a Gene Cluster Family to each gene cluster found in the input. The GCF identifiers are arbitrary, and the prefix can be changed with the --prefix flag. The table will also record the original file from which each record was obtained to facilitate resource management. The table is written to the filename given with the --output flag.

The sequences of the representative proteins extracted from each cluster can be saved to a FASTA file with the --features flag. These proteins are used for compositional representation of gene clusters, and can be used to transfer annotations to the GCF representatives. The final compositional matrix for each GCF representative, which can be useful for computing distances between GCFs, can be saved as an anndata sparse matrix to a filename given with the --compositions flag.

📝 Workspace

MMseqs needs a fast scratch space to work with intermediate files while running linear clustering. By default, this will use a temporary folder obtained with tempfile.TemporaryDirectory, which typically lies inside /tmp. To use a different folder, use the --workdir flag.

🫧 Clustering

By default, IGUA will use average linkage clustering and a relative distance threshold of 0.8, which corresponds to clusters inside a GCF having at most 20% of estimated difference at the amino-acid level. These two options can be changed with the --clustering-method and --clustering-distance flags.

Additionally, the precision of the distance matrix used for the clustering can be lowered to reduce memory usage, using single or half precision floating point numbers instead of the double precision used by default. Use the --precision flag to control numerical precision.

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

🏗️ Contributing

Contributions are more than welcome! See CONTRIBUTING.md for more details.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the GNU General Public License v3.0.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory and the Leiden University Medical Center in the Zeller team.

About

Iterative Gene clUster Analysis, a high-throughput method for gene cluster family identification.

Resources

License

Contributing

Stars

Watchers

Forks

Packages