Skip to content

brian-hepler-phd/MathResearchCompass

Repository files navigation

Math Research Compass

Math Research Topic Modeling Shiny App

Overview

Math Research Compass analyzes arXiv preprints to identify trending research topics across mathematical subfields. This interactive dashboard visualizes topic modeling results from over 121,000 recent mathematics papers, helping researchers and students discover emerging areas and popular research directions.

The application uses advanced natural language processing to cluster semantically related papers and identify coherent research themes. Recent optimizations have improved performance dramatically, reducing loading times from 15-20 seconds to under 5 seconds through database architecture improvements.

Live Dashboard: Math Research Compass

Project Structure

Core Applications

  • app_v2.py - Optimized Shiny dashboard with database integration
  • optimized_data_manager.py - High-performance data layer with caching and connection pooling
  • create_database.py - Database migration script for converting CSV data to SQLite

Data Processing Pipeline

  • topic_trends_analyzer.py - Performs topic modeling analysis on arXiv papers using BERTopic
  • topic_labeling.py - Enhances topic labels using Claude AI for better readability
  • category_distribution.py - Analyzes distribution of arXiv categories across topics
  • combined_network_analysis.py - Collaboration network analysis (in development)

Configuration Files

  • Procfile - Heroku deployment configuration
  • requirements.txt - Minimal production dependencies
  • runtime.txt - Python version specification

Data Processing Workflow

1. Data Collection and Filtering

The project uses data from the Kaggle ArXiv dataset, containing approximately 2.7 million arXiv papers. We filter this to focus on mathematics papers from 2020-2025, resulting in 121,391 papers across 31 mathematical subfields.

The dataset includes standard arXiv metadata: paper IDs, titles, abstracts, author lists, publication dates, and category classifications.

2. Topic Modeling with BERTopic

The topic modeling pipeline combines several state-of-the-art techniques:

  1. Text preprocessing combines paper titles and abstracts
  2. Sentence-BERT generates semantic embeddings
  3. UMAP reduces dimensionality for efficient clustering
  4. HDBSCAN performs density-based clustering to discover topics
  5. TF-IDF extraction identifies representative keywords

This process discovered 1,938 distinct topics across the mathematics corpus, with each paper assigned to its most relevant topic.

3. AI-Enhanced Topic Labeling

Raw topic keywords are processed through Claude AI to generate human-readable topic descriptions. For example, a topic with keywords like "homotopy", "spectral", "cohomology" becomes "Algebraic Topology - Homotopy Theory and Spectral Sequences".

4. Database Architecture

The application migrated from CSV file processing to an optimized SQLite database. Key tables include:

  • topics - Topic metadata with counts and category classifications
  • papers - Paper information with pre-processed author formatting
  • topic_keywords - Ranked keywords for each topic
  • topic_category_distribution - Category breakdowns within topics
  • topic_top_authors - Author rankings by paper count per topic

This migration reduced initial loading time by 4-5x and optimized memory usage significantly.

5. Category Distribution Analysis

Each topic is analyzed to determine its primary mathematical subfield by calculating the frequency of arXiv categories within that topic's papers. This enables filtering and visualization by mathematical area.

Dashboard Features

Overview Page

The main dashboard provides a high-level view of mathematical research topics:

  • Summary statistics showing total papers and topics
  • Category filtering across 31 math subfields
  • Interactive bar chart of top research topics
  • Dynamic content that updates based on selected category

Topic Explorer

The explorer page offers detailed analysis of individual topics:

  • Topic selection filtered by mathematical category
  • Author rankings showing most prolific contributors
  • Category distribution charts showing topic spread across subfields
  • Representative paper samples with metadata and arXiv links

All interactions are optimized for sub-second response times through database indexing and intelligent caching.

Performance Optimizations

The application implements several performance improvements:

  • Database queries replace CSV file loading, reducing response times to under 0.1 seconds
  • LRU caching stores frequently accessed data in memory
  • Connection pooling manages database connections efficiently
  • Lazy loading only retrieves data when needed by users
  • Indexed queries on frequently filtered columns

These optimizations support 50+ concurrent users while using less than 1GB of memory.

Installation and Usage

Quick Start

git clone https://github.com/brian-hepler-phd/MathResearchCompass.git
cd MathResearchCompass
pip install -r requirements.txt
python app_v2.py

Database Setup (Optional)

To recreate the database from raw data:

python create_database.py
python optimized_data_manager.py  # Test performance

Reproducing the Analysis

# Topic modeling
python topic_trends_analyzer.py --custom-csv data/cleaned/math_arxiv_snapshot.csv --years 5

# AI enhancement
export ANTHROPIC_API_KEY=your_api_key
python topic_labeling.py

# Category analysis
python category_distribution.py

Deployment

The application currently runs on shinyapps.io with plans to migrate to Heroku for improved performance and reliability. The Heroku deployment will provide:

  • Professional hosting with 99.95% uptime
  • SSL certificates and custom domain support
  • Auto-scaling for traffic spikes
  • Continuous deployment from GitHub

Migration files are included (Procfile, runtime.txt) for straightforward deployment.

Future Development

Collaboration Network Analysis

Development is underway for comprehensive author collaboration analysis:

  • Network graphs showing research partnerships within topics
  • Temporal analysis of how collaborations evolve over time
  • Author influence metrics and centrality calculations
  • Cross-topic collaboration discovery

This will analyze collaboration patterns across all 1,938 topics, providing insights into mathematical research communities.

Additional Planned Features

  • Predictive modeling to forecast emerging research areas
  • Citation analysis integration for impact metrics
  • Geographic mapping of research activity
  • Real-time updates as new papers are published
  • API access for programmatic data retrieval

Technologies

The application is built with:

  • Python 3.11 with optimized dependencies
  • Shiny for Python for the interactive web interface
  • SQLite for high-performance data storage
  • BERTopic for advanced topic modeling
  • Sentence-BERT for semantic text embeddings
  • UMAP and HDBSCAN for dimensionality reduction and clustering
  • Plotly for interactive visualizations
  • NetworkX for upcoming collaboration analysis

Performance Metrics

Metric Before Optimization After Optimization Improvement
Initial Load Time 15-20 seconds 2-5 seconds 4-5x faster
Memory Usage Reduced <1 GB Optimized
Query Response N/A <0.1 seconds New capability
Concurrent Users 1-2 50+ 25x increase

Research Applications

The dashboard serves multiple research use cases:

  • Trend Discovery: Identify emerging areas within mathematical subfields
  • Literature Review: Find representative papers and related topics
  • Collaboration Planning: Discover active researchers in specific areas
  • Academic Planning: Understand research landscapes for students and early-career researchers
  • Institutional Strategy: Inform hiring and resource allocation decisions

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • ArXiv for providing access to research paper metadata
  • Kaggle for hosting the ArXiv dataset
  • Anthropic for the Claude API used in topic labeling

Links