PaperSorter

PaperSorter is an intelligent academic paper recommendation system that helps researchers stay up-to-date with relevant publications. It uses machine learning to filter RSS/Atom feeds and predict which papers match your research interests, then sends notifications to Slack for high-scoring articles.

Key Features

Multi-source feed aggregation: Fetches articles from RSS/Atom feeds (PubMed, bioRxiv, journal feeds, etc.)
ML-powered filtering: Uses XGBoost regression on article embeddings to predict interest levels
Flexible AI integration: Compatible with Solar LLM, Gemini, or any OpenAI-compatible generative AI API
Web-based labeling interface: Interactive UI for labeling articles and improving the model
Slack integration: Automated notifications for interesting papers with customizable thresholds
Semantic Scholar enrichment: Augments articles with citation counts and additional metadata
Multi-channel support: Different models and thresholds for different research groups or topics
AI-powered content generation: Create concise summaries and visual infographics for article collections

Installation

Install PaperSorter using pip:

git clone https://github.com/ChangLabSNU/PaperSorter.git

cd PaperSorter
pip install -e .

System Requirements

Python 3.8+
PostgreSQL 12+ with pgvector extension
Modern web browser (for labeling interface)

Configuration

Create a configuration file at config.yml (or specify with --config). See examples/config.yml for a complete example:

db:
  type: postgres
  host: localhost
  user: papersorter
  database: papersorter
  password: "your_password"

google_oauth:
  client_id: "your_google_client_id"
  client_secret: "your_google_client_secret"
  flask_secret_key: "your_flask_secret_key"  # generate with secrets.token_hex(32)

embedding_api:
  api_key: "your_api_key"
  api_url: "https://api.upstage.ai/v1"       # or your preferred provider
  model: "solar-embedding-1-large-passage"   # or your preferred model
  dimensions: 4096

summarization_api:
  api_key: "your_api_key"
  api_url: "https://generativelanguage.googleapis.com/v1beta/openai"  # For Gemini
  model: "gemini-2.5-pro"

semanticscholar:
  api_key: "your_semantic_scholar_api_key"

web:
  base_url: "https://your-domain.com"  # base URL for web interface

Database Setup

1. Create PostgreSQL Database

First, create a database and user for PaperSorter:

# As PostgreSQL superuser:
sudo -u postgres psql <<EOF
CREATE USER papersorter WITH PASSWORD 'your_password';
CREATE DATABASE papersorter OWNER papersorter;
\c papersorter
CREATE EXTENSION vector;
GRANT ALL ON SCHEMA public TO papersorter;
EOF

Alternatively, if you have an existing database:

# Connect to your database and install pgvector
sudo -u postgres psql -d your_database -c "CREATE EXTENSION vector;"

2. Initialize Database Schema

papersorter init

To reinitialize (drops existing data):

papersorter init --drop-existing

Getting Started

1. Add Feed Sources

Start the web interface and configure your feed sources:

papersorter serve

Navigate to http://localhost:5001 and:

Log in with Google OAuth
Go to Settings → Feed Sources
Add RSS/Atom feed URLs for journals, preprint servers, or PubMed searches

2. Initial Data Collection

Fetch articles from your configured feeds:

papersorter update

3. Label Training Data

Use the web interface to label articles:

Mark articles as "Interested" for papers relevant to your research
Mark articles as "Not Interested" for irrelevant papers
Aim for at least 100 "Interested" articles out of 1000+ total for initial training

4. Train the Model

Once you have sufficient labeled data:

papersorter train

The model performance (ROC-AUC) will be displayed. A score above 0.8 indicates good performance.

5. Deploy Web Interface for Production

For production use, deploy the web interface with a proper WSGI server and HTTPS:

Production Deployment

# Install uWSGI
pip install uwsgi

# Run with uWSGI
uwsgi --http :5001 --module PaperSorter.web.app:app --processes 4

# Configure reverse proxy (nginx example) with SSL:
# server {
#     listen 443 ssl;
#     server_name your-domain.com;
#     ssl_certificate /path/to/cert.pem;
#     ssl_certificate_key /path/to/key.pem;
#
#     location / {
#         proxy_pass http://localhost:5001;
#         proxy_set_header Host $host;
#         proxy_set_header X-Real-IP $remote_addr;
#     }
# }

Development/Testing

For local development or testing with external services:

# Option 1: Local development
papersorter serve --port 5001 --debug

# Option 2: Testing with HTTPS (using ngrok)
papersorter serve --port 5001
ngrok http 5001  # Creates HTTPS tunnel to your local server

6. Configure Slack Notifications

In the web interface:

Go to Settings → Channels
Add a Slack webhook URL
Set the score threshold (e.g., 0.7)
Select which model to use

Optional: Enable Slack Interactivity

To add interactive buttons to Slack messages:

Create a Slack App with Interactive Components enabled
Configure the Request URL in your Slack App:
- Set to: https://your-domain.com/slack-interactivity (must be HTTPS)

7. Regular Operation

Set up these commands to run periodically (e.g., via cron):

# Fetch new articles and generate predictions (every 3 hours)
papersorter update

# Send Slack notifications for high-scoring articles (every 3 hours, 7am-9pm)
papersorter broadcast

Example cron configuration (see examples/ directory for complete scripts with log rotation):

30 */3 * * * /path/to/papersorter/examples/cron-update.sh
0 9,13,18 * * * /path/to/papersorter/examples/cron-broadcast.sh

Command Reference

Core Commands

papersorter init - Initialize database schema
papersorter update - Fetch new articles and generate embeddings
papersorter train - Train or retrain the prediction model
papersorter broadcast - Send Slack notifications for interesting articles
papersorter serve - Start the web interface for labeling and configuration

Common Options

All commands support:

--config PATH - Configuration file path (default: config.yml)
--log-file PATH - Log output to file
-q, --quiet - Suppress console output

Command-Specific Options

update:

--batch-size N - Processing batch size
--limit-sources N - Maximum number of feed sources to process
--check-interval-hours N - Hours between checks for the same feed

train:

-r, --rounds N - XGBoost training rounds (default: 100)
-o, --output PATH - Model output file (default: model.pkl)
--embeddings-table NAME - Embeddings table name (default: embeddings)

broadcast:

--limit N - Maximum items to process per channel
--max-content-length N - Maximum content length for messages
--clear-old-days N - Clear broadcasts older than N days (default: 30)

serve:

--host ADDRESS - Bind address (default: 0.0.0.0)
--port N - Port number (default: 5001)
--debug - Enable Flask debug mode

Web Interface Features

The web interface (http://localhost:5001) provides:

Main Feed View

Browse all articles with predictions
Interactive labeling (Interested/Not Interested)
Semantic article search
Shareable search URLs
Filter by date, score, or label status

Article Features

View full abstracts and metadata
Find similar articles
Direct links to paper PDFs
Semantic Scholar integration for citations

AI-Powered Tools

Generate article summaries
Create visual infographics for article collections

Admin Settings

Manage feed sources
Configure notification channels
View model performance
User management
System event logs

Improving Model Performance

Regular labeling: Continue labeling new articles through the web interface
Balanced labels: Maintain a good ratio of positive/negative examples
Retrain periodically: Run papersorter train after adding new labels
Monitor performance: Check ROC-AUC scores and adjust thresholds accordingly

Architecture Overview

PaperSorter consists of several key components:

Feed Provider System: Modular architecture for different feed sources
Embedding Pipeline: Generates vector representations using LLM APIs
ML Predictor: XGBoost model trained on user preferences
PostgreSQL + pgvector: Efficient storage and similarity search for embeddings
Flask Web Application: Modern interface with Google OAuth authentication
Background Jobs: Asynchronous processing for heavy tasks
Notification System: Multi-channel Slack integration with queuing

License

MIT License - see LICENSE file for details

Author

Hyeshik Chang hyeshik@snu.ac.kr

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests on GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 129 Commits
PaperSorter		PaperSorter
examples		examples
tools		tools
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
GOOGLE_AUTH_SETUP.md		GOOGLE_AUTH_SETUP.md
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
README.md		README.md
SQL_SCHEMA.sql		SQL_SCHEMA.sql
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.py		setup.py

License

hgim01/PaperSorter

Folders and files

Latest commit

History

Repository files navigation

PaperSorter

Key Features

Installation

System Requirements

Configuration

Database Setup

1. Create PostgreSQL Database

2. Initialize Database Schema

Getting Started

1. Add Feed Sources

2. Initial Data Collection

3. Label Training Data

4. Train the Model

5. Deploy Web Interface for Production

Production Deployment

Development/Testing

6. Configure Slack Notifications

Optional: Enable Slack Interactivity

7. Regular Operation

Command Reference

Core Commands

Common Options

Command-Specific Options

Web Interface Features

Main Feed View

Article Features

AI-Powered Tools

Admin Settings

Improving Model Performance

Architecture Overview

License

Author

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages