Extractomat: Automatic Term Extraction (ATE) for English/German and Ukrainian

Features

Works on top of Spacy
Implements basic/combo-basic/c-value algorithms with extra flexibility and support for single word terms (see matcha.py)
Implements optional reranking using sentence transformers to weight the terms in the context of the document (see sbert_reranker.py)
Allows to run term extraction on txt/pdf/docx documents (see runner.py)
Comes with OTRT dataset (in English/German/Ukrainian)
Covered with tests.
Equipped with other experimental features (keybert_extract.py, gliner_extract.py) and scripts for measuring the performance on the ORTR dataset (tester.py, gt_verificator.py)

Installation

# clone the repo
git clone https://github.com/lang-uk/extractomat
cd extractomat

# Activate virtual environment and install dependencies
virtualenv venv
source venv/bin/activate
pip install -r requirements.txt

# Download Spacy models for your language of interest
spacy download uk_core_news_trf
spacy download en_core_web_trf
spacy download de_dep_news_trf

Running in the batch mode:

To run extractomat on the list of files, you can do:

python runner.py "my_papers_corpus/paper_*" --method cvalue --allow-single-word --n-max 6

Please consult with python runner.py --help for extra options.

Running tests:

Just start python -m pytest and relax.

OTRT dataset

The OTRT dataset (first page of the ONTOLOGIES OF TIME: REVIEW AND TRENDS paper) is in the otrt folder.

gt_terms_*.csv is the unique list of terms from the paper. gt_terms_*_full_ordered.csv is the complete list of terms in the correct order (as they occur in the text) TimeOnto Sample *.docx is the original text of the paper.

Practical application

We used extractomat in our experiments on building unsupervised bilingual glossary. See https://github.com/lang-uk/schmezaurus for details.

Citing the paper

Extractomat has been released as a part of the research results leading to the following paper:

Chaplynskyi, D., Wittenborg, T., Kosa, V., Rabby, G., Ignatenko, O., Auer, S., Ermolayev, V. (2025) Building Multilingual Terminological Bridges between Language-Specific Knowledge Silos. In: Proc. ICTERI-2025, CCIS, Springer (to appear)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Extractomat: Automatic Term Extraction (ATE) for English/German and Ukrainian

Features

Installation

Running in the batch mode:

Running tests:

OTRT dataset

Practical application

Citing the paper

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
otrt		otrt
tests		tests
.gitignore		.gitignore
README.md		README.md
gliner_extract.py		gliner_extract.py
gt_verificator.py		gt_verificator.py
ignored_stopwords.py		ignored_stopwords.py
keybert_extract.py		keybert_extract.py
matcha.py		matcha.py
requirements.txt		requirements.txt
runner.py		runner.py
sbert_reranker.py		sbert_reranker.py
tester.py		tester.py

lang-uk/extractomat

Folders and files

Latest commit

History

Repository files navigation

Extractomat: Automatic Term Extraction (ATE) for English/German and Ukrainian

Features

Installation

Running in the batch mode:

Running tests:

OTRT dataset

Practical application

Citing the paper

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages