PAIS-NN is a tool for classifying prokaryotic insertion sequences using cluster labels defined by the Prokaryotic Atlas of Insertion Sequences (PAIS). It applies a neural network model to predict PAIS cluster labels based on k-mer composition and then estimates the ecosystem proportions of a sample using an expectation-maximization (EM) approach. If you use PAIS-NN in your work, please cite the associated paper: (TODO: add citation).
Features
- Sequence embedding using k-mer frequencies
- PAIS cluster prediction via a calibrated feedforward neural network
- Expectation-Maximization (EM) estimation of ecosystem proportions
- Support for multiple input sites and configurable output formats
- Plotting of ecosystem composition results
Environment managed with Conda (see paisnn_environment.yaml), major packages include:
torch
CUDA 12.8
scikit-learn
pandas
seaborn
biopython
# Firsts, clone the repository to your local machine
git clone https://github.com/alepfu/pais-nn.git
cd pais-nn
# Second, setup and activate Conda environment
conda env create -f paisnn_environment.yaml
conda activate paisnn
python pais-nn.py \
-m paisnn_min_size_10_clusters.pth \
-s test_data/river_estuary \
-e paisnn_ecosystem_priors.csv
Arguments
Flag | Description |
---|---|
-m , --model |
Path to trained .pth model (required) |
-s , --sitedir |
Path to site directory with sequences (required) |
-e , --ecosyspriors |
Path to .csv with ecosystem prior distributions (required) |
-p , --plotfmt |
Output plot format: png (default) or svg |
-c , --confthresh |
Confidence threshold for filtering predictions (default: 0.5) |
-v , --verbosity |
Logging verbosity: 0 = silent, 1 = info, 2 = debug (default) |
Inputs
- Site directory: Should contain subdirectories of (gzipped) FASTA files per site.
- Model file: A serialized PyTorch model bundle (.pth) including weights and label encoder.
- Ecosystem priors file: Prior probabilities for ecosystems used in EM. CSV-file with columns: ecosys_label, cluster_label.
Outputs
- cluster_predictions.csv: Sequence-level predictions with confidence scores.
- ecosystem_proportions.csv: Estimated ecosystem proportions per site/sample.
- ecosystem_proportions.png/svg: Grouped barplot visualization of estimated proportions over sites.
Expected directory structure
site_dir/
├── site_A
│ ├── sample1.fasta
│ ├── sample2.fasta
| └── ...
├── site_B
│ ├── sample1.fasta
│ ├── sample2.fasta
| └── ...
└── ...
Example Data
test_data/river_estuary
contains metagenomic samples collected from three different sites (BR2, BR1, and BAY) along the Brisbane River estuary (Prabhu et al. 2024).