This project compares different methods for classifying non-Hungarian names by language, including large language models (LLMs) via the llm
Python package and traditional machine learning approaches. The analysis compares automated classification results to human annotations to evaluate method performance. The dataset contains 1,174 non-Hungarian names manually classified across 26 languages. The replicator should expect the code to run for approximately 30 minutes on a standard desktop machine.
- I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
- I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.
The data are proprietary. Usage is subject to a licensing agreement with Opten Kft. See input/annotated-names/README.md for details.
-
All data are publicly available.
-
Some data cannot be made publicly available.
-
No data can be made publicly available.
-
Confidential data used in this paper and not provided as part of the public replication package will be preserved for 10 years after publication, in accordance with data retention policies.
Data.Name | Data.Files | Location | Provided | Citation |
---|---|---|---|---|
"Annotated Foreign Names from Cégjegyzék" | names.csv | input/annotated-names/ | TRUE | HUN-REN KRTK (2024) |
The dataset contains 1,174 names from the Hungarian company registry (Cégjegyzék) that were not automatically classified as Hungarian names. A human annotator manually classified them by language across 26 language categories including German (416 names), English (100 names), Italian (89 names), and others. Due to privacy considerations regarding actual personal names, the data is classified as CAT3 and subject to licensing restrictions.
Datafile: input/annotated-names/names.csv
Data file | Source | Notes | Provided |
---|---|---|---|
input/annotated-names/names.csv |
HUN-REN KRTK (2024) | 1,174 manually annotated non-Hungarian names across 26 languages | Yes |
temp/llm_classifications.csv |
LLM processing | Generated by classification pipeline | Generated |
temp/traditional_classifications.csv |
Traditional ML | Generated by classification pipeline | Generated |
output/performance_comparison.csv |
Analysis | Method comparison results | Generated |
-
The replication package contains programs to install all dependencies and set up the necessary directory structure.
-
Python 3.11+
llm
(0.15.0+) for LLM-based classificationpolars
(0.20.0+) for data manipulationscikit-learn
(1.3.0+) for traditional ML methodspandas
(2.0.0+) for data processingnumpy
(1.24.0+) for numerical computationsclick
(8.0.0+) for command-line interfaces- The file
pyproject.toml
lists these dependencies. Runmake setup
to install all requirements.
-
Make (for build automation)
-
Bead (for data dependency management)
- Random seed is set in configuration files for all methods requiring randomization
- No Pseudo random generator is used in the analysis described here.
Approximate time needed to reproduce the analyses on a standard (2025) desktop machine:
- <10 minutes
- 10-60 minutes
- 1-2 hours
- 2-8 hours
- 8-24 hours
- 1-3 days
- 3-14 days
- > 14 days
Approximate storage space needed:
- < 25 MBytes
- 25 MB - 250 MB
- 250 MB - 2 GB
- 2 GB - 25 GB
- 25 GB - 250 GB
- > 250 GB
The code was designed to run on a standard desktop machine with 8GB+ RAM. LLM API calls may introduce variable runtime depending on service availability and rate limits.
- Programs in
code/classify/
implement different classification methods:llm_classifier.py
: Uses LLMs via thellm
package for name classificationtraditional_methods.py
: Implements baseline methods using scikit-learn
- Programs in
code/evaluate/
assess method performance:individual_performance.py
: Evaluates each method against human annotationsmethod_comparison.py
: Compares methods and generates summary statistics
- Programs in
code/utils/
contain shared utilities for data processing - The
Makefile
orchestrates the entire workflow with targets for each major step
The code is licensed under a MIT license. See LICENSE.txt for details.
- Install dependencies:
make setup
- Load input data:
make data-load
(requires bead configuration) - Run full analysis pipeline:
make pipeline
- View results in
output/
directory
make setup
: Installs Python dependencies usinguv
and sets up the project environmentmake data-load
: Uses bead to load the annotated names datasetmake classify-all
: Runs both LLM and traditional classification methodsmake evaluate
: Compares all methods against human annotationsmake pipeline
: Executes the complete analysis workflow
Individual steps can be run separately:
make classify-llm
: LLM-based classification onlymake classify-other
: Traditional methods onlymake evaluate-individual
: Method-specific performance metricsmake evaluate-comparison
: Cross-method comparison
The provided code reproduces:
- All numbers provided in text in the paper
- All tables and figures in the paper
- Selected tables and figures in the paper, as explained and justified below.
Figure/Table # | Program | Output file | Note |
---|---|---|---|
Performance Table | code/evaluate/method_comparison.py | output/performance_comparison.csv | |
Classification Results | code/classify/llm_classifier.py | temp/llm_classifications.csv | |
Baseline Results | code/classify/traditional_methods.py | temp/traditional_classifications.csv |
HUN-REN KRTK (distributor). 2024. "Cégjegyzék LTS [data set]" Published by Opten Zrt, Budapest. Contributions by CEU MicroData.
Project no. 144193 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the KKP_22 funding scheme.