Math Research Compass analyzes arXiv preprints to identify trending research topics across mathematical subfields. This interactive dashboard visualizes topic modeling results from over 121,000 recent mathematics papers, helping researchers and students discover emerging areas and popular research directions.
The application uses advanced natural language processing to cluster semantically related papers and identify coherent research themes. Recent optimizations have improved performance dramatically, reducing loading times from 15-20 seconds to under 5 seconds through database architecture improvements.
Live Dashboard: Math Research Compass
app_v2.py
- Optimized Shiny dashboard with database integrationoptimized_data_manager.py
- High-performance data layer with caching and connection poolingcreate_database.py
- Database migration script for converting CSV data to SQLite
topic_trends_analyzer.py
- Performs topic modeling analysis on arXiv papers using BERTopictopic_labeling.py
- Enhances topic labels using Claude AI for better readabilitycategory_distribution.py
- Analyzes distribution of arXiv categories across topicscombined_network_analysis.py
- Collaboration network analysis (in development)
Procfile
- Heroku deployment configurationrequirements.txt
- Minimal production dependenciesruntime.txt
- Python version specification
The project uses data from the Kaggle ArXiv dataset, containing approximately 2.7 million arXiv papers. We filter this to focus on mathematics papers from 2020-2025, resulting in 121,391 papers across 31 mathematical subfields.
The dataset includes standard arXiv metadata: paper IDs, titles, abstracts, author lists, publication dates, and category classifications.
The topic modeling pipeline combines several state-of-the-art techniques:
- Text preprocessing combines paper titles and abstracts
- Sentence-BERT generates semantic embeddings
- UMAP reduces dimensionality for efficient clustering
- HDBSCAN performs density-based clustering to discover topics
- TF-IDF extraction identifies representative keywords
This process discovered 1,938 distinct topics across the mathematics corpus, with each paper assigned to its most relevant topic.
Raw topic keywords are processed through Claude AI to generate human-readable topic descriptions. For example, a topic with keywords like "homotopy", "spectral", "cohomology" becomes "Algebraic Topology - Homotopy Theory and Spectral Sequences".
The application migrated from CSV file processing to an optimized SQLite database. Key tables include:
topics
- Topic metadata with counts and category classificationspapers
- Paper information with pre-processed author formattingtopic_keywords
- Ranked keywords for each topictopic_category_distribution
- Category breakdowns within topicstopic_top_authors
- Author rankings by paper count per topic
This migration reduced initial loading time by 4-5x and optimized memory usage significantly.
Each topic is analyzed to determine its primary mathematical subfield by calculating the frequency of arXiv categories within that topic's papers. This enables filtering and visualization by mathematical area.
The main dashboard provides a high-level view of mathematical research topics:
- Summary statistics showing total papers and topics
- Category filtering across 31 math subfields
- Interactive bar chart of top research topics
- Dynamic content that updates based on selected category
The explorer page offers detailed analysis of individual topics:
- Topic selection filtered by mathematical category
- Author rankings showing most prolific contributors
- Category distribution charts showing topic spread across subfields
- Representative paper samples with metadata and arXiv links
All interactions are optimized for sub-second response times through database indexing and intelligent caching.
The application implements several performance improvements:
- Database queries replace CSV file loading, reducing response times to under 0.1 seconds
- LRU caching stores frequently accessed data in memory
- Connection pooling manages database connections efficiently
- Lazy loading only retrieves data when needed by users
- Indexed queries on frequently filtered columns
These optimizations support 50+ concurrent users while using less than 1GB of memory.
git clone https://github.com/brian-hepler-phd/MathResearchCompass.git
cd MathResearchCompass
pip install -r requirements.txt
python app_v2.py
To recreate the database from raw data:
python create_database.py
python optimized_data_manager.py # Test performance
# Topic modeling
python topic_trends_analyzer.py --custom-csv data/cleaned/math_arxiv_snapshot.csv --years 5
# AI enhancement
export ANTHROPIC_API_KEY=your_api_key
python topic_labeling.py
# Category analysis
python category_distribution.py
The application currently runs on shinyapps.io with plans to migrate to Heroku for improved performance and reliability. The Heroku deployment will provide:
- Professional hosting with 99.95% uptime
- SSL certificates and custom domain support
- Auto-scaling for traffic spikes
- Continuous deployment from GitHub
Migration files are included (Procfile
, runtime.txt
) for straightforward deployment.
Development is underway for comprehensive author collaboration analysis:
- Network graphs showing research partnerships within topics
- Temporal analysis of how collaborations evolve over time
- Author influence metrics and centrality calculations
- Cross-topic collaboration discovery
This will analyze collaboration patterns across all 1,938 topics, providing insights into mathematical research communities.
- Predictive modeling to forecast emerging research areas
- Citation analysis integration for impact metrics
- Geographic mapping of research activity
- Real-time updates as new papers are published
- API access for programmatic data retrieval
The application is built with:
- Python 3.11 with optimized dependencies
- Shiny for Python for the interactive web interface
- SQLite for high-performance data storage
- BERTopic for advanced topic modeling
- Sentence-BERT for semantic text embeddings
- UMAP and HDBSCAN for dimensionality reduction and clustering
- Plotly for interactive visualizations
- NetworkX for upcoming collaboration analysis
Metric | Before Optimization | After Optimization | Improvement |
---|---|---|---|
Initial Load Time | 15-20 seconds | 2-5 seconds | 4-5x faster |
Memory Usage | Reduced | <1 GB | Optimized |
Query Response | N/A | <0.1 seconds | New capability |
Concurrent Users | 1-2 | 50+ | 25x increase |
The dashboard serves multiple research use cases:
- Trend Discovery: Identify emerging areas within mathematical subfields
- Literature Review: Find representative papers and related topics
- Collaboration Planning: Discover active researchers in specific areas
- Academic Planning: Understand research landscapes for students and early-career researchers
- Institutional Strategy: Inform hiring and resource allocation decisions
This project is licensed under the MIT License - see the LICENSE file for details.
- ArXiv for providing access to research paper metadata
- Kaggle for hosting the ArXiv dataset
- Anthropic for the Claude API used in topic labeling
- Live Dashboard: Math Research Compass
- Creator's Website: bhepler.com
- GitHub Repository: MathResearchCompass