Term Extractor is a desktop application built with Python and Tkinter that allows you to extract candidate terms and their contexts from various text-based documents. It supports multiple file formats including .txt
, .docx
, .xliff
, and .html
. The tool uses spaCy's NLP capabilities to identify multi-word terms and provides a user-friendly interface to review, select, and export extracted terms to CSV.

- Supports multiple input file formats: TXT, DOCX, XLIFF, HTML
- Extracts candidate terms using linguistic patterns (adjective+noun, noun+noun, noun of noun, etc.)
- Displays term frequency and context
- Interactive GUI to browse files, adjust minimum frequency, and review terms
- Select/deselect terms for export
- Export selected terms and their contexts to CSV
- Progress bar and multi-threading to keep UI responsive during extraction
- Python 3.7+
- spaCy (
en_core_web_sm
model) - pandas
- python-docx
- beautifulsoup4
-
Clone the repository:
git clone https://github.com/luisaschaefertrindade/term_extractor.git cd term_extractor
-
Create a virtual environment and activate it (optional but recommended):
python -m venv venv source venv/bin/activate # Linux/macOS venv\Scripts\activate # Windows
-
Install dependencies:
pip install -r requirements.txt
-
Download the spaCy English model:
python -m spacy download en_core_web_sm
Run the application:
python term_extractor.py
- Click Browse to select a supported input file.
- Set the minimum frequency for terms to be extracted (default is 1).
- Click Extract terms to process the file.
- Review extracted terms, toggle selection using the checkbox column.
- Click on a term to view its context with the term highlighted.
- Click Export Selected to CSV to save selected terms and contexts.
- Plain text (
.txt
) - Microsoft Word documents (
.docx
) - XLIFF translation files (
.xliff
) - HTML files (
.html
)
- spaCy for natural language processing
- python-docx for Word document reading
- BeautifulSoup for HTML parsing