Modular versus Hierarchical: A Structural Analysis of Mathematical Collaboration

Brian Hepler

This repository contains the complete, automated analysis pipeline for the paper "Modular versus Hierarchical: A Structural Signature of Topic Popularity in Mathematical Research." The workflow processes raw metadata from the arXiv preprint server, identifies research topics, builds and analyzes co-authorship networks, and performs a series of statistical and sensitivity analyses to characterize the structural differences between popular and niche fields in mathematics.

The entire pipeline is orchestrated by a Makefile and managed by a config.yaml file, ensuring full reproducibility with minimal manual intervention.

Project Structure

.
├── data/
│   ├── raw/
│   └── cleaned/
├── results/
│   ├── collaboration_analysis/
│   ├── disambiguation/
│   └── ...
├── src/                          # All Python analysis scripts
├── figures/                      # Final figures for the manuscript
├── config.yaml                   # Manages data flow between scripts
├── Makefile                      # Automates the entire pipeline
├── requirements.txt              # Python package dependencies
└── README.md                     # This file

Setup and Prerequisites

Clone the Repository:

git clone https://github.com/brian-hepler-phd/MRC-Network-Analysis.git
cd MRC-Network-Analysis

Create a Python Virtual Environment:

python3 -m venv venv
source venv/bin/activate

Install Dependencies: All required packages and their specific versions are listed in requirements.txt.
```
pip install -r requirements.txt
```
Obtain Raw Data:
- Download the Cornell arXiv dataset from Kaggle: arXiv dataset.
- Place the arxiv-metadata-oai-snapshot.json file into the data/raw/ directory.
- Note: The pipeline assumes an initial, one-time manual filtering of this raw JSON to create data/cleaned/math_arxiv_snapshot.csv, containing only mathematics papers with the necessary columns (id, authors, title, categories, abstract, update_date, and authors_parsed).

Reproducing the Analysis

The entire pipeline can be executed using simple make commands from the project's root directory. The Makefile automatically handles the dependencies between scripts.

Core Pipeline

To run the full analysis from start to finish and generate all results and figures for the paper, use the default target:

make

or

make all

This will execute the following steps in sequence, using the config.yaml file to pass data between them:

make topics: Identifies research topics using BERTopic.
make network_data: Prepares the author-topic network dataset.
make disambiguate: Performs author name disambiguation.
make metrics: Builds networks and calculates all structural metrics.
make compare: Runs the baseline popular vs. niche statistical comparison.
make regression: Runs the fixed and enhanced regression analyses to control for network size.
make visualize: Generates the final figures for the manuscript.

make is intelligent: if you modify a script, it will only re-run that step and all subsequent steps that depend on it, saving significant computation time.

Other Useful Commands

Clean the Workspace: To delete all generated results and start fresh:
```
make clean
```
Run Sensitivity Analyses: To run the validation scripts (cutoff sensitivity, topic model stability, COVID-19 temporal analysis):
```
make sensitivity
```
Run Only a Specific Step: To run part of the pipeline (e.g., up to the metrics calculation):
```
make metrics
```

The Analysis Scripts (`src/`)

The src/ directory contains the modular scripts that perform each stage of the analysis.

config_manager.py: A helper module for reading from and writing to config.yaml.
BERTopic_analyzer.py: Step 1 - Topic Modeling.
prepare_network_data.py: Step 2 - Data Preparation.
author_disambiguation_v4.py: Step 3 - Author Name Disambiguation.
collaboration_network_analysis_v5.py: Step 4 - Network Metrics Calculation.
analyze_popular_vs_niche.py: Step 5a - Baseline Group Comparison.
bootstrap_CI_analysis.py: Step 5b - Bootstrap Analysis for CIs.
fixed_regression_analysis.py: Step 5c - Main Regression Analysis.
enhanced_regression.py: Step 5d - Continuous Popularity Regression.
sensitivity_analysis.py, bertopic_sensitivity_analysis.py, covid_temporal_sensitivity.py: Step 6 - Robustness and Sensitivity Checks.
enhanced_network_viz.py: Step 7 - Final Visualizations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Modular versus Hierarchical: A Structural Analysis of Mathematical Collaboration

Project Structure

Setup and Prerequisites

Reproducing the Analysis

Core Pipeline

Other Useful Commands

The Analysis Scripts (`src/`)

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data/raw		data/raw
figures		figures
src		src
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
config.yaml		config.yaml
requirements.txt		requirements.txt

License

brian-hepler-phd/MRC-Network-Analysis

Folders and files

Latest commit

History

Repository files navigation

Modular versus Hierarchical: A Structural Analysis of Mathematical Collaboration

Project Structure

Setup and Prerequisites

Reproducing the Analysis

Core Pipeline

Other Useful Commands

The Analysis Scripts (src/)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

The Analysis Scripts (`src/`)

Packages