Skip to content

alepfu/pais-nn

Repository files navigation

PAIS-NN

PAIS-NN is a tool for classifying prokaryotic insertion sequences using cluster labels defined by the Prokaryotic Atlas of Insertion Sequences (PAIS). It applies a neural network model to predict PAIS cluster labels based on k-mer composition and then estimates the ecosystem proportions of a sample using an expectation-maximization (EM) approach. If you use PAIS-NN in your work, please cite the associated paper: (TODO: add citation).

Features

  • Sequence embedding using k-mer frequencies
  • PAIS cluster prediction via a calibrated feedforward neural network
  • Expectation-Maximization (EM) estimation of ecosystem proportions
  • Support for multiple input sites and configurable output formats
  • Plotting of ecosystem composition results

Installation

Environment managed with Conda (see paisnn_environment.yaml), major packages include:

  • torch
  • CUDA 12.8
  • scikit-learn
  • pandas
  • seaborn
  • biopython
# Firsts, clone the repository to your local machine
git clone https://github.com/alepfu/pais-nn.git
cd pais-nn

# Second, setup and activate Conda environment
conda env create -f paisnn_environment.yaml
conda activate paisnn

Usage

python pais-nn.py \
	-m paisnn_min_size_10_clusters.pth \
	-s test_data/river_estuary \
	-e paisnn_ecosystem_priors.csv

Arguments

Flag Description
-m, --model Path to trained .pth model (required)
-s, --sitedir Path to site directory with sequences (required)
-e, --ecosyspriors Path to .csv with ecosystem prior distributions (required)
-p, --plotfmt Output plot format: png (default) or svg
-c, --confthresh Confidence threshold for filtering predictions (default: 0.5)
-v, --verbosity Logging verbosity: 0 = silent, 1 = info, 2 = debug (default)

Inputs

  • Site directory: Should contain subdirectories of (gzipped) FASTA files per site.
  • Model file: A serialized PyTorch model bundle (.pth) including weights and label encoder.
  • Ecosystem priors file: Prior probabilities for ecosystems used in EM. CSV-file with columns: ecosys_label, cluster_label.

Outputs

  • cluster_predictions.csv: Sequence-level predictions with confidence scores.
  • ecosystem_proportions.csv: Estimated ecosystem proportions per site/sample.
  • ecosystem_proportions.png/svg: Grouped barplot visualization of estimated proportions over sites.

Expected directory structure

site_dir/
├── site_A
│   ├── sample1.fasta
│   ├── sample2.fasta
|	└── ...
├── site_B
│   ├── sample1.fasta
│   ├── sample2.fasta
|	└── ...
└── ...

Example Data

  • test_data/river_estuary contains metagenomic samples collected from three different sites (BR2, BR1, and BAY) along the Brisbane River estuary (Prabhu et al. 2024).

License

MIT

About

Predict PAIS cluster labels and estimate ecosystem proportions

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages