Term Extractor

Term Extractor is a desktop application built with Python and Tkinter that allows you to extract candidate terms and their contexts from various text-based documents. It supports multiple file formats including .txt, .docx, .xliff, and .html. The tool uses spaCy's NLP capabilities to identify multi-word terms and provides a user-friendly interface to review, select, and export extracted terms to CSV.

Features

Supports multiple input file formats: TXT, DOCX, XLIFF, HTML
Extracts candidate terms using linguistic patterns (adjective+noun, noun+noun, noun of noun, etc.)
Displays term frequency and context
Interactive GUI to browse files, adjust minimum frequency, and review terms
Select/deselect terms for export
Export selected terms and their contexts to CSV
Progress bar and multi-threading to keep UI responsive during extraction

Requirements

Python 3.7+
spaCy (en_core_web_sm model)
pandas
python-docx
beautifulsoup4

Installation

Clone the repository:

git clone https://github.com/luisaschaefertrindade/term_extractor.git
cd term_extractor

Create a virtual environment and activate it (optional but recommended):

python -m venv venv
source venv/bin/activate  # Linux/macOS
venv\Scripts\activate     # Windows

Install dependencies:
```
pip install -r requirements.txt
```
Download the spaCy English model:
```
python -m spacy download en_core_web_sm
```

Usage

Run the application:

python term_extractor.py

Click Browse to select a supported input file.
Set the minimum frequency for terms to be extracted (default is 1).
Click Extract terms to process the file.
Review extracted terms, toggle selection using the checkbox column.
Click on a term to view its context with the term highlighted.
Click Export Selected to CSV to save selected terms and contexts.

Supported File Formats

Plain text (.txt)
Microsoft Word documents (.docx)
XLIFF translation files (.xliff)
HTML files (.html)

Acknowledgments

spaCy for natural language processing
python-docx for Word document reading
BeautifulSoup for HTML parsing

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
CC BY-NC 4.0		CC BY-NC 4.0
README.md		README.md
requirements.txt		requirements.txt
term_extractor.py		term_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Term Extractor

Features

Requirements

Installation

Usage

Supported File Formats

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

luisaschaefertrindade/term_extractor

Folders and files

Latest commit

History

Repository files navigation

Term Extractor

Features

Requirements

Installation

Usage

Supported File Formats

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages