Skip to content

ISE-FIZKarlsruhe/concept_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concept Extraction (ConExion)

Overview

This repository supports the research presented in the paper:

ConExion: Concept Extraction with Large Language Models
Ebrahim Norouzi, Sven Hertling, Harald Sack (2025)

If you use this code or ideas from this work, please consider citing our paper (see below).

This project contains code and resources for extracting concepts using unsupervised methods and large language models (LLMs). It includes setup instructions, scripts for running the models, and a brief guide on how to get started.

Setup the Environment

To set up the environment, follow these steps:

  1. Create and activate the conda environment:
    conda env create -f environment.yml
    conda activate conexion
  2. Install the required Python packages:
    pip install -r requirements.txt

Running the Scripts

To run the provided scripts, use the following command:

nohup ./run_scripts.sh > logs/master_log.log 2>&1 &

Keyword Extraction Methods

Unsupervised Methods

Unsupervised keyword extraction methods rely on statistical and linguistic features of the text. These methods do not require labeled data. Common techniques include:

  • TF-IDF: Weighs the importance of a term by comparing its frequency in a document to its frequency across all documents.
  • TextRank: An algorithm inspired by PageRank, where words are nodes, and edges represent co-occurrence within a fixed window. Key phrases are identified by their importance in the network.
  • LDA: A generative statistical model that identifies topics in a set of documents, which can then be used to extract relevant keywords.

Large Language Models (LLMs)

Large language models can understand and generate human-like text. LLMs can be fine-tuned for concept extraction on specific datasets or used in zero-shot settings.

File Structure

batch_submit.sh                  Script for submitting batch jobs
batch_test.sh                    Script for testing batch jobs
download_models.sh               Script for downloading models
main.py                          Main script for running the concept extraction
requirements.txt                 Python package dependencies
run_fs_fixed_LLM-batch.sh        Script for running fixed LLMs in batch mode
run_fs_fixed_LLM-job.sh          Script for running fixed LLM jobs
run_fs_fixed_LLM-scripts.sh      Script for running fixed LLM scripts
run_fs_fixed_LLM_all_datasets.sh Script for running fixed LLMs on all datasets
run_scripts.sh                   Main script for running all other scripts
run_zs_LLM_scripts.sh            Script for running zero-shot LLM scripts
run_zs_fixedLLMscripts.sh        Script for running zero-shot fixed LLM scripts

Citation

If you use this work in your research, please cite it as:

@misc{norouzi2025conexionconceptextractionlarge,
  title     = {ConExion: Concept Extraction with Large Language Models},
  author    = {Ebrahim Norouzi and Sven Hertling and Harald Sack},
  year      = {2025},
  eprint    = {2504.12915},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url       = {https://arxiv.org/abs/2504.12915}
}

License

This project is licensed under the MIT License – see the LICENSE file for details.

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •