PaperSorter is an intelligent academic paper recommendation system that helps researchers stay up-to-date with relevant publications. It uses machine learning to filter RSS/Atom feeds and predict which papers match your research interests, then sends notifications to Slack for high-scoring articles.

- Multi-source feed aggregation: Fetches articles from RSS/Atom feeds (PubMed, bioRxiv, journal feeds, etc.)
- ML-powered filtering: Uses XGBoost regression on article embeddings to predict interest levels
- Flexible AI integration: Compatible with Solar LLM, Gemini, or any OpenAI-compatible generative AI API
- Web-based labeling interface: Interactive UI for labeling articles and improving the model
- Slack integration: Automated notifications for interesting papers with customizable thresholds
- Semantic Scholar enrichment: Augments articles with citation counts and additional metadata
- Multi-channel support: Different models and thresholds for different research groups or topics
- AI-powered content generation: Create concise summaries and visual infographics for article collections
Install PaperSorter using pip:
git clone https://github.com/ChangLabSNU/PaperSorter.git
cd PaperSorter
pip install -e .
- Python 3.8+
- PostgreSQL 12+ with pgvector extension
- Modern web browser (for labeling interface)
Create a configuration file at config.yml
(or specify with --config
). See examples/config.yml
for a complete example:
db:
type: postgres
host: localhost
user: papersorter
database: papersorter
password: "your_password"
google_oauth:
client_id: "your_google_client_id"
client_secret: "your_google_client_secret"
flask_secret_key: "your_flask_secret_key" # generate with secrets.token_hex(32)
embedding_api:
api_key: "your_api_key"
api_url: "https://api.upstage.ai/v1" # or your preferred provider
model: "solar-embedding-1-large-passage" # or your preferred model
dimensions: 4096
summarization_api:
api_key: "your_api_key"
api_url: "https://generativelanguage.googleapis.com/v1beta/openai" # For Gemini
model: "gemini-2.5-pro"
semanticscholar:
api_key: "your_semantic_scholar_api_key"
web:
base_url: "https://your-domain.com" # base URL for web interface
First, create a database and user for PaperSorter:
# As PostgreSQL superuser:
sudo -u postgres psql <<EOF
CREATE USER papersorter WITH PASSWORD 'your_password';
CREATE DATABASE papersorter OWNER papersorter;
\c papersorter
CREATE EXTENSION vector;
GRANT ALL ON SCHEMA public TO papersorter;
EOF
Alternatively, if you have an existing database:
# Connect to your database and install pgvector
sudo -u postgres psql -d your_database -c "CREATE EXTENSION vector;"
papersorter init
To reinitialize (drops existing data):
papersorter init --drop-existing
Start the web interface and configure your feed sources:
papersorter serve
Navigate to http://localhost:5001 and:
- Log in with Google OAuth
- Go to Settings → Feed Sources
- Add RSS/Atom feed URLs for journals, preprint servers, or PubMed searches
Fetch articles from your configured feeds:
papersorter update
Use the web interface to label articles:
- Mark articles as "Interested" for papers relevant to your research
- Mark articles as "Not Interested" for irrelevant papers
- Aim for at least 100 "Interested" articles out of 1000+ total for initial training
Once you have sufficient labeled data:
papersorter train
The model performance (ROC-AUC) will be displayed. A score above 0.8 indicates good performance.
For production use, deploy the web interface with a proper WSGI server and HTTPS:
# Install uWSGI
pip install uwsgi
# Run with uWSGI
uwsgi --http :5001 --module PaperSorter.web.app:app --processes 4
# Configure reverse proxy (nginx example) with SSL:
# server {
# listen 443 ssl;
# server_name your-domain.com;
# ssl_certificate /path/to/cert.pem;
# ssl_certificate_key /path/to/key.pem;
#
# location / {
# proxy_pass http://localhost:5001;
# proxy_set_header Host $host;
# proxy_set_header X-Real-IP $remote_addr;
# }
# }
For local development or testing with external services:
# Option 1: Local development
papersorter serve --port 5001 --debug
# Option 2: Testing with HTTPS (using ngrok)
papersorter serve --port 5001
ngrok http 5001 # Creates HTTPS tunnel to your local server
In the web interface:
- Go to Settings → Channels
- Add a Slack webhook URL
- Set the score threshold (e.g., 0.7)
- Select which model to use
To add interactive buttons to Slack messages:
- Create a Slack App with Interactive Components enabled
- Configure the Request URL in your Slack App:
- Set to:
https://your-domain.com/slack-interactivity
(must be HTTPS)
- Set to:
Set up these commands to run periodically (e.g., via cron):
# Fetch new articles and generate predictions (every 3 hours)
papersorter update
# Send Slack notifications for high-scoring articles (every 3 hours, 7am-9pm)
papersorter broadcast
Example cron configuration (see examples/
directory for complete scripts with log rotation):
30 */3 * * * /path/to/papersorter/examples/cron-update.sh
0 9,13,18 * * * /path/to/papersorter/examples/cron-broadcast.sh
papersorter init
- Initialize database schemapapersorter update
- Fetch new articles and generate embeddingspapersorter train
- Train or retrain the prediction modelpapersorter broadcast
- Send Slack notifications for interesting articlespapersorter serve
- Start the web interface for labeling and configuration
All commands support:
--config PATH
- Configuration file path (default: config.yml)--log-file PATH
- Log output to file-q, --quiet
- Suppress console output
update:
--batch-size N
- Processing batch size--limit-sources N
- Maximum number of feed sources to process--check-interval-hours N
- Hours between checks for the same feed
train:
-r, --rounds N
- XGBoost training rounds (default: 100)-o, --output PATH
- Model output file (default: model.pkl)--embeddings-table NAME
- Embeddings table name (default: embeddings)
broadcast:
--limit N
- Maximum items to process per channel--max-content-length N
- Maximum content length for messages--clear-old-days N
- Clear broadcasts older than N days (default: 30)
serve:
--host ADDRESS
- Bind address (default: 0.0.0.0)--port N
- Port number (default: 5001)--debug
- Enable Flask debug mode
The web interface (http://localhost:5001) provides:
- Browse all articles with predictions
- Interactive labeling (Interested/Not Interested)
- Semantic article search
- Shareable search URLs
- Filter by date, score, or label status
- View full abstracts and metadata
- Find similar articles
- Direct links to paper PDFs
- Semantic Scholar integration for citations
- Generate article summaries
- Create visual infographics for article collections
- Manage feed sources
- Configure notification channels
- View model performance
- User management
- System event logs
- Regular labeling: Continue labeling new articles through the web interface
- Balanced labels: Maintain a good ratio of positive/negative examples
- Retrain periodically: Run
papersorter train
after adding new labels - Monitor performance: Check ROC-AUC scores and adjust thresholds accordingly
PaperSorter consists of several key components:
- Feed Provider System: Modular architecture for different feed sources
- Embedding Pipeline: Generates vector representations using LLM APIs
- ML Predictor: XGBoost model trained on user preferences
- PostgreSQL + pgvector: Efficient storage and similarity search for embeddings
- Flask Web Application: Modern interface with Google OAuth authentication
- Background Jobs: Asynchronous processing for heavy tasks
- Notification System: Multi-channel Slack integration with queuing
MIT License - see LICENSE file for details
Hyeshik Chang hyeshik@snu.ac.kr
Contributions are welcome! Please feel free to submit issues or pull requests on GitHub.