Math Research Compass

Overview

Math Research Compass analyzes arXiv preprints to identify trending research topics across mathematical subfields. This interactive dashboard visualizes topic modeling results from over 121,000 recent mathematics papers, helping researchers and students discover emerging areas and popular research directions.

The application uses advanced natural language processing to cluster semantically related papers and identify coherent research themes. Recent optimizations have improved performance dramatically, reducing loading times from 15-20 seconds to under 5 seconds through database architecture improvements.

Live Dashboard: Math Research Compass

Project Structure

Core Applications

app_v2.py - Optimized Shiny dashboard with database integration
optimized_data_manager.py - High-performance data layer with caching and connection pooling
create_database.py - Database migration script for converting CSV data to SQLite

Data Processing Pipeline

topic_trends_analyzer.py - Performs topic modeling analysis on arXiv papers using BERTopic
topic_labeling.py - Enhances topic labels using Claude AI for better readability
category_distribution.py - Analyzes distribution of arXiv categories across topics
combined_network_analysis.py - Collaboration network analysis (in development)

Configuration Files

Procfile - Heroku deployment configuration
requirements.txt - Minimal production dependencies
runtime.txt - Python version specification

Data Processing Workflow

1. Data Collection and Filtering

The project uses data from the Kaggle ArXiv dataset, containing approximately 2.7 million arXiv papers. We filter this to focus on mathematics papers from 2020-2025, resulting in 121,391 papers across 31 mathematical subfields.

The dataset includes standard arXiv metadata: paper IDs, titles, abstracts, author lists, publication dates, and category classifications.

2. Topic Modeling with BERTopic

The topic modeling pipeline combines several state-of-the-art techniques:

Text preprocessing combines paper titles and abstracts
Sentence-BERT generates semantic embeddings
UMAP reduces dimensionality for efficient clustering
HDBSCAN performs density-based clustering to discover topics
TF-IDF extraction identifies representative keywords

This process discovered 1,938 distinct topics across the mathematics corpus, with each paper assigned to its most relevant topic.

3. AI-Enhanced Topic Labeling

Raw topic keywords are processed through Claude AI to generate human-readable topic descriptions. For example, a topic with keywords like "homotopy", "spectral", "cohomology" becomes "Algebraic Topology - Homotopy Theory and Spectral Sequences".

4. Database Architecture

The application migrated from CSV file processing to an optimized SQLite database. Key tables include:

topics - Topic metadata with counts and category classifications
papers - Paper information with pre-processed author formatting
topic_keywords - Ranked keywords for each topic
topic_category_distribution - Category breakdowns within topics
topic_top_authors - Author rankings by paper count per topic

This migration reduced initial loading time by 4-5x and optimized memory usage significantly.

5. Category Distribution Analysis

Each topic is analyzed to determine its primary mathematical subfield by calculating the frequency of arXiv categories within that topic's papers. This enables filtering and visualization by mathematical area.

Dashboard Features

Overview Page

The main dashboard provides a high-level view of mathematical research topics:

Summary statistics showing total papers and topics
Category filtering across 31 math subfields
Interactive bar chart of top research topics
Dynamic content that updates based on selected category

Topic Explorer

The explorer page offers detailed analysis of individual topics:

Topic selection filtered by mathematical category
Author rankings showing most prolific contributors
Category distribution charts showing topic spread across subfields
Representative paper samples with metadata and arXiv links

All interactions are optimized for sub-second response times through database indexing and intelligent caching.

Performance Optimizations

The application implements several performance improvements:

Database queries replace CSV file loading, reducing response times to under 0.1 seconds
LRU caching stores frequently accessed data in memory
Connection pooling manages database connections efficiently
Lazy loading only retrieves data when needed by users
Indexed queries on frequently filtered columns

These optimizations support 50+ concurrent users while using less than 1GB of memory.

Installation and Usage

Quick Start

git clone https://github.com/brian-hepler-phd/MathResearchCompass.git
cd MathResearchCompass
pip install -r requirements.txt
python app_v2.py

Database Setup (Optional)

To recreate the database from raw data:

python create_database.py
python optimized_data_manager.py  # Test performance

Reproducing the Analysis

# Topic modeling
python topic_trends_analyzer.py --custom-csv data/cleaned/math_arxiv_snapshot.csv --years 5

# AI enhancement
export ANTHROPIC_API_KEY=your_api_key
python topic_labeling.py

# Category analysis
python category_distribution.py

Deployment

The application currently runs on shinyapps.io with plans to migrate to Heroku for improved performance and reliability. The Heroku deployment will provide:

Professional hosting with 99.95% uptime
SSL certificates and custom domain support
Auto-scaling for traffic spikes
Continuous deployment from GitHub

Migration files are included (Procfile, runtime.txt) for straightforward deployment.

Future Development

Collaboration Network Analysis

Development is underway for comprehensive author collaboration analysis:

Network graphs showing research partnerships within topics
Temporal analysis of how collaborations evolve over time
Author influence metrics and centrality calculations
Cross-topic collaboration discovery

This will analyze collaboration patterns across all 1,938 topics, providing insights into mathematical research communities.

Additional Planned Features

Predictive modeling to forecast emerging research areas
Citation analysis integration for impact metrics
Geographic mapping of research activity
Real-time updates as new papers are published
API access for programmatic data retrieval

Technologies

The application is built with:

Python 3.11 with optimized dependencies
Shiny for Python for the interactive web interface
SQLite for high-performance data storage
BERTopic for advanced topic modeling
Sentence-BERT for semantic text embeddings
UMAP and HDBSCAN for dimensionality reduction and clustering
Plotly for interactive visualizations
NetworkX for upcoming collaboration analysis

Performance Metrics

Metric	Before Optimization	After Optimization	Improvement
Initial Load Time	15-20 seconds	2-5 seconds	4-5x faster
Memory Usage	Reduced	<1 GB	Optimized
Query Response	N/A	<0.1 seconds	New capability
Concurrent Users	1-2	50+	25x increase

Research Applications

The dashboard serves multiple research use cases:

Trend Discovery: Identify emerging areas within mathematical subfields
Literature Review: Find representative papers and related topics
Collaboration Planning: Discover active researchers in specific areas
Academic Planning: Understand research landscapes for students and early-career researchers
Institutional Strategy: Inform hiring and resource allocation decisions

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

ArXiv for providing access to research paper metadata
Kaggle for hosting the ArXiv dataset
Anthropic for the Claude API used in topic labeling

Links

Live Dashboard: Math Research Compass
Creator's Website: bhepler.com
GitHub Repository: MathResearchCompass

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
images		images
notebooks		notebooks
plans of action		plans of action
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
Procfile		Procfile
README.md		README.md
requirements-analysis.txt		requirements-analysis.txt
requirements-full.txt		requirements-full.txt
requirements.txt		requirements.txt
runtime.txt		runtime.txt

License

brian-hepler-phd/MathResearchCompass

Folders and files

Latest commit

History

Repository files navigation

Math Research Compass

Overview

Project Structure

Core Applications

Data Processing Pipeline

Configuration Files

Data Processing Workflow

1. Data Collection and Filtering

2. Topic Modeling with BERTopic

3. AI-Enhanced Topic Labeling

4. Database Architecture

5. Category Distribution Analysis

Dashboard Features

Overview Page

Topic Explorer

Performance Optimizations

Installation and Usage

Quick Start

Database Setup (Optional)

Reproducing the Analysis

Deployment

Future Development

Collaboration Network Analysis

Additional Planned Features

Technologies

Performance Metrics

Research Applications

License

Acknowledgments

Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages