This repository supports the research presented in the paper:
ConExion: Concept Extraction with Large Language Models
Ebrahim Norouzi, Sven Hertling, Harald Sack (2025)
If you use this code or ideas from this work, please consider citing our paper (see below).
This project contains code and resources for extracting concepts using unsupervised methods and large language models (LLMs). It includes setup instructions, scripts for running the models, and a brief guide on how to get started.
To set up the environment, follow these steps:
- Create and activate the conda environment:
conda env create -f environment.yml conda activate conexion
- Install the required Python packages:
pip install -r requirements.txt
To run the provided scripts, use the following command:
nohup ./run_scripts.sh > logs/master_log.log 2>&1 &
Unsupervised keyword extraction methods rely on statistical and linguistic features of the text. These methods do not require labeled data. Common techniques include:
- TF-IDF: Weighs the importance of a term by comparing its frequency in a document to its frequency across all documents.
- TextRank: An algorithm inspired by PageRank, where words are nodes, and edges represent co-occurrence within a fixed window. Key phrases are identified by their importance in the network.
- LDA: A generative statistical model that identifies topics in a set of documents, which can then be used to extract relevant keywords.
Large language models can understand and generate human-like text. LLMs can be fine-tuned for concept extraction on specific datasets or used in zero-shot settings.
batch_submit.sh Script for submitting batch jobs
batch_test.sh Script for testing batch jobs
download_models.sh Script for downloading models
main.py Main script for running the concept extraction
requirements.txt Python package dependencies
run_fs_fixed_LLM-batch.sh Script for running fixed LLMs in batch mode
run_fs_fixed_LLM-job.sh Script for running fixed LLM jobs
run_fs_fixed_LLM-scripts.sh Script for running fixed LLM scripts
run_fs_fixed_LLM_all_datasets.sh Script for running fixed LLMs on all datasets
run_scripts.sh Main script for running all other scripts
run_zs_LLM_scripts.sh Script for running zero-shot LLM scripts
run_zs_fixedLLMscripts.sh Script for running zero-shot fixed LLM scripts
If you use this work in your research, please cite it as:
@misc{norouzi2025conexionconceptextractionlarge,
title = {ConExion: Concept Extraction with Large Language Models},
author = {Ebrahim Norouzi and Sven Hertling and Harald Sack},
year = {2025},
eprint = {2504.12915},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2504.12915}
}
This project is licensed under the MIT License – see the LICENSE file for details.