Classification of Non-Hungarian Names by Language

Overview

This project compares different methods for classifying non-Hungarian names by language, including large language models (LLMs) via the llm Python package and traditional machine learning approaches. The analysis compares automated classification results to human annotations to evaluate method performance. The dataset contains 1,174 non-Hungarian names manually classified across 26 languages. The replicator should expect the code to run for approximately 30 minutes on a standard desktop machine.

Data Availability and Provenance Statements

Statement about Rights

I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.

License for Data

The data are proprietary. Usage is subject to a licensing agreement with Opten Kft. See input/annotated-names/README.md for details.

Summary of Availability

All data are publicly available.
Some data cannot be made publicly available.
No data can be made publicly available.
Confidential data used in this paper and not provided as part of the public replication package will be preserved for 10 years after publication, in accordance with data retention policies.

Details on each Data Source

Data.Name	Data.Files	Location	Provided	Citation
"Annotated Foreign Names from Cégjegyzék"	names.csv	input/annotated-names/	TRUE	HUN-REN KRTK (2024)

Annotated Foreign Names from Cégjegyzék

The dataset contains 1,174 names from the Hungarian company registry (Cégjegyzék) that were not automatically classified as Hungarian names. A human annotator manually classified them by language across 26 language categories including German (416 names), English (100 names), Italian (89 names), and others. Due to privacy considerations regarding actual personal names, the data is classified as CAT3 and subject to licensing restrictions.

Datafile: input/annotated-names/names.csv

Dataset list

Data file	Source	Notes	Provided
`input/annotated-names/names.csv`	HUN-REN KRTK (2024)	1,174 manually annotated non-Hungarian names across 26 languages	Yes
`temp/llm_classifications.csv`	LLM processing	Generated by classification pipeline	Generated
`temp/traditional_classifications.csv`	Traditional ML	Generated by classification pipeline	Generated
`output/performance_comparison.csv`	Analysis	Method comparison results	Generated

Computational requirements

Software Requirements

The replication package contains programs to install all dependencies and set up the necessary directory structure.
Python 3.11+
- llm (0.15.0+) for LLM-based classification
- polars (0.20.0+) for data manipulation
- scikit-learn (1.3.0+) for traditional ML methods
- pandas (2.0.0+) for data processing
- numpy (1.24.0+) for numerical computations
- click (8.0.0+) for command-line interfaces
- The file pyproject.toml lists these dependencies. Run make setup to install all requirements.
Make (for build automation)
Bead (for data dependency management)

Controlled Randomness

Random seed is set in configuration files for all methods requiring randomization
No Pseudo random generator is used in the analysis described here.

Memory, Runtime, Storage Requirements

Summary

Approximate time needed to reproduce the analyses on a standard (2025) desktop machine:

Approximate storage space needed:

Details

The code was designed to run on a standard desktop machine with 8GB+ RAM. LLM API calls may introduce variable runtime depending on service availability and rate limits.

Description of programs/code

Programs in code/classify/ implement different classification methods:
- llm_classifier.py: Uses LLMs via the llm package for name classification
- traditional_methods.py: Implements baseline methods using scikit-learn
Programs in code/evaluate/ assess method performance:
- individual_performance.py: Evaluates each method against human annotations
- method_comparison.py: Compares methods and generates summary statistics
Programs in code/utils/ contain shared utilities for data processing
The Makefile orchestrates the entire workflow with targets for each major step

License for Code

The code is licensed under a MIT license. See LICENSE.txt for details.

Instructions to Replicators

Install dependencies: make setup
Load input data: make data-load (requires bead configuration)
Run full analysis pipeline: make pipeline
View results in output/ directory

Details

make setup: Installs Python dependencies using uv and sets up the project environment
make data-load: Uses bead to load the annotated names dataset
make classify-all: Runs both LLM and traditional classification methods
make evaluate: Compares all methods against human annotations
make pipeline: Executes the complete analysis workflow

Individual steps can be run separately:

make classify-llm: LLM-based classification only
make classify-other: Traditional methods only
make evaluate-individual: Method-specific performance metrics
make evaluate-comparison: Cross-method comparison

List of tables and programs

The provided code reproduces:

All numbers provided in text in the paper
All tables and figures in the paper
Selected tables and figures in the paper, as explained and justified below.

Figure/Table #	Program	Output file
Performance Table	code/evaluate/method_comparison.py	output/performance_comparison.csv
Classification Results	code/classify/llm_classifier.py	temp/llm_classifications.csv
Baseline Results	code/classify/traditional_methods.py	temp/traditional_classifications.csv

References

HUN-REN KRTK (distributor). 2024. "Cégjegyzék LTS [data set]" Published by Opten Zrt, Budapest. Contributions by CEU MicroData.

Acknowledgements

Project no. 144193 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the KKP_22 funding scheme.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.bead-meta		.bead-meta
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Classification of Non-Hungarian Names by Language

Overview

Data Availability and Provenance Statements

Statement about Rights

License for Data

Summary of Availability

Details on each Data Source

Annotated Foreign Names from Cégjegyzék

Dataset list

Computational requirements

Software Requirements

Controlled Randomness

Memory, Runtime, Storage Requirements

Summary

Details

Description of programs/code

License for Code

Instructions to Replicators

Details

List of tables and programs

References

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

ceumicrodata/foreign-names

Folders and files

Latest commit

History

Repository files navigation

Classification of Non-Hungarian Names by Language

Overview

Data Availability and Provenance Statements

Statement about Rights

License for Data

Summary of Availability

Details on each Data Source

Annotated Foreign Names from Cégjegyzék

Dataset list

Computational requirements

Software Requirements

Controlled Randomness

Memory, Runtime, Storage Requirements

Summary

Details

Description of programs/code

License for Code

Instructions to Replicators

Details

List of tables and programs

References

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages