Brian Hepler
bhepler.com | Math Research Compass
This repository contains the complete, automated analysis pipeline for the paper "Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research." The workflow processes raw metadata from the arXiv preprint server, identifies research topics, builds and analyzes co-authorship networks, and performs a series of statistical and sensitivity analyses to characterize the structural differences between popular and niche fields in mathematics.
The entire pipeline is orchestrated by a Makefile
and managed by a config.yaml
file, ensuring full reproducibility with minimal manual intervention.
.
├── data/
│ ├── raw/
│ └── cleaned/
├── results/
│ ├── collaboration_analysis/
│ ├── disambiguation/
│ └── ...
├── src/ # All Python analysis scripts
├── figures/ # Final figures for the manuscript
├── config.yaml # Manages data flow between scripts
├── Makefile # Automates the entire pipeline
├── requirements.txt # Python package dependencies
└── README.md # This file
-
Clone the Repository:
git clone https://github.com/brian-hepler-phd/MRC-Network-Analysis.git cd MRC-Network-Analysis
-
Create a Python Virtual Environment:
python3 -m venv venv source venv/bin/activate
-
Install Dependencies: All required packages and their specific versions are listed in
requirements.txt
.pip install -r requirements.txt
-
Obtain Raw Data:
- Download the Cornell arXiv dataset from Kaggle: arXiv dataset.
- Place the
arxiv-metadata-oai-snapshot.json
file into thedata/raw/
directory. - Note: The pipeline assumes an initial, one-time manual filtering of this raw JSON to create
data/cleaned/math_arxiv_snapshot.csv
, containing only mathematics papers with the necessary columns (id
,authors
,title
,categories
,abstract
,update_date
, andauthors_parsed
).
The entire pipeline can be executed using simple make
commands from the project's root directory. The Makefile
automatically handles the dependencies between scripts.
To run the full analysis from start to finish and generate all results and figures for the paper, use the default target:
make
or
make all
This will execute the following steps in sequence, using the config.yaml
file to pass data between them:
make topics
: Identifies research topics using BERTopic.make network_data
: Prepares the author-topic network dataset.make disambiguate
: Performs author name disambiguation.make metrics
: Builds networks and calculates all structural metrics.make compare
: Runs the baseline popular vs. niche statistical comparison.make regression
: Runs the fixed and enhanced regression analyses to control for network size.make visualize
: Generates the final figures for the manuscript.
make
is intelligent: if you modify a script, it will only re-run that step and all subsequent steps that depend on it, saving significant computation time.
- Clean the Workspace: To delete all generated results and start fresh:
make clean
- Run Sensitivity Analyses: To run the validation scripts (cutoff sensitivity, topic model stability, COVID-19 temporal analysis):
make sensitivity
- Run Only a Specific Step: To run part of the pipeline (e.g., up to the metrics calculation):
make metrics
The src/
directory contains the modular scripts that perform each stage of the analysis.
config_manager.py
: A helper module for reading from and writing toconfig.yaml
.BERTopic_analyzer.py
: Step 1 - Topic Modeling.prepare_network_data.py
: Step 2 - Data Preparation.author_disambiguation_v4.py
: Step 3 - Author Name Disambiguation.collaboration_network_analysis_v5.py
: Step 4 - Network Metrics Calculation.analyze_popular_vs_niche.py
: Step 5a - Baseline Group Comparison.bootstrap_CI_analysis.py
: Step 5b - Bootstrap Analysis for CIs.fixed_regression_analysis.py
: Step 5c - Main Regression Analysis.enhanced_regression.py
: Step 5d - Continuous Popularity Regression.sensitivity_analysis.py
,bertopic_sensitivity_analysis.py
,covid_temporal_sensitivity.py
: Step 6 - Robustness and Sensitivity Checks.enhanced_network_viz.py
: Step 7 - Final Visualizations.