Skip to content

JHU-CLSP/science-hierarchography

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 

Repository files navigation

🎨 SCIENCE HIERARCHOGRAPHY: Hierarchical Organization of Science Literature

Python 3.8

A tool for automatically generating hierarchical structures from scientific paper collections using:

  1. Embeddings clustering techniques
  2. LLM intelligence

The goal of this project is to develop interpretable, hierarchical representation of science papers.

📋 Table of Contents

💡 Requirements

The requirements are listed in the requirements.txt. Use the following commands to build the environment for this project:

conda create -n science python=3.8
conda activate science
pip install -r requirements.txt

🗂️ Data Preparation

We have two paper collections available:

  • The 2k paper collection SciPile
  • The 10k paper collection SciPileLarge

You can use the following command to download:

cd download/
TODO

🔬 Approaches

🔮 SciChic Hierarchy Generation

The process has two main steps:

Generate Embeddings

First, make sure you have generated all the embeddings for your papers using:

python generate.py --input_folder /path/to/your/papers --output_file ./embeddings/your_embedding_name.pkl

Create Hierarchy

Then you can start creating the hierarchy with:

python main.py \
  --embedding_generator qwen \
  --summary_generator llama \
  --clustering_method kmeans \
  --evaluator qwen \
  --clustering_direction top_down \
  --base_path /project/directory/ \
  --cluster_sizes 276 40 6 \
  --run_time 1 \
  --evaluate_time 1 \
  --test_count 5 \
  --pre_generated_embeddings_file ./embedding_file.pkl \
  --evaluate_type normal \
  --embedding_source all

Parameters Explanation

  • embedding_generator: Model used to generate embeddings (options: qwen, llama, etc.)
  • summary_generator: Model used to generate summaries for clusters
  • clustering_method: Algorithm for clustering (options: kmeans, hierarchical, etc.)
  • clustering_direction: Direction of hierarchy building (top_down or bottom_up)
  • cluster_sizes: Number of clusters at each level of the hierarchy
  • embedding_source: Contribution type used to create the hierarchy:
    • all: Use all paper content
    • problem: Focus on problem statements
    • solution: Focus on proposed solutions
    • results: Focus on research results

🧵 fLMSci Pipeline

fLMSci is an LLM-based scientific hierarchography creation pipeline that offers two approaches:

Pipeline Types

Script Pipeline type Main steps
run_par.sh Parallel 1. Generate topics & rationales → 2. Place topics in parallel → 3. Merge chunked taxonomy → 4. Map papers → (optional) Evaluate
run_incr.sh Incremental 1. Generate topics & rationales → 2. Incrementally place each topic → 3. Map papers → (optional) Evaluate

Setup & Execution

Before running the pipelines, you need to:

  1. Place JSON files inside the jsons folder
  2. Give the shell scripts execute permission (one-time step):
    chmod +x run_par.sh run_incr.sh

Running the Parallel Pipeline

bash run_par.sh                # basic run
bash run_par.sh --evaluate     # run + evaluation

Running the Incremental Pipeline

bash run_incr.sh               # basic run
bash run_incr.sh --evaluate    # run + evaluation

You can also customize the run with additional parameters:

bash run_incr.sh --batch_size 16 --max_depth 8 --evaluate

Note: Each pipeline can also be run step by step by following their individual README files.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •