Skip to content

ceumicrodata/foreign-names

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Classification of Non-Hungarian Names by Language

Overview

This project compares different methods for classifying non-Hungarian names by language, including large language models (LLMs) via the llm Python package and traditional machine learning approaches. The analysis compares automated classification results to human annotations to evaluate method performance. The dataset contains 1,174 non-Hungarian names manually classified across 26 languages. The replicator should expect the code to run for approximately 30 minutes on a standard desktop machine.

Data Availability and Provenance Statements

Statement about Rights

  • I certify that the author(s) of the manuscript have legitimate access to and permission to use the data used in this manuscript.
  • I certify that the author(s) of the manuscript have documented permission to redistribute/publish the data contained within this replication package.

License for Data

The data are proprietary. Usage is subject to a licensing agreement with Opten Kft. See input/annotated-names/README.md for details.

Summary of Availability

  • All data are publicly available.

  • Some data cannot be made publicly available.

  • No data can be made publicly available.

  • Confidential data used in this paper and not provided as part of the public replication package will be preserved for 10 years after publication, in accordance with data retention policies.

Details on each Data Source

Data.Name Data.Files Location Provided Citation
"Annotated Foreign Names from Cégjegyzék" names.csv input/annotated-names/ TRUE HUN-REN KRTK (2024)

Annotated Foreign Names from Cégjegyzék

The dataset contains 1,174 names from the Hungarian company registry (Cégjegyzék) that were not automatically classified as Hungarian names. A human annotator manually classified them by language across 26 language categories including German (416 names), English (100 names), Italian (89 names), and others. Due to privacy considerations regarding actual personal names, the data is classified as CAT3 and subject to licensing restrictions.

Datafile: input/annotated-names/names.csv

Dataset list

Data file Source Notes Provided
input/annotated-names/names.csv HUN-REN KRTK (2024) 1,174 manually annotated non-Hungarian names across 26 languages Yes
temp/llm_classifications.csv LLM processing Generated by classification pipeline Generated
temp/traditional_classifications.csv Traditional ML Generated by classification pipeline Generated
output/performance_comparison.csv Analysis Method comparison results Generated

Computational requirements

Software Requirements

  • The replication package contains programs to install all dependencies and set up the necessary directory structure.

  • Python 3.11+

    • llm (0.15.0+) for LLM-based classification
    • polars (0.20.0+) for data manipulation
    • scikit-learn (1.3.0+) for traditional ML methods
    • pandas (2.0.0+) for data processing
    • numpy (1.24.0+) for numerical computations
    • click (8.0.0+) for command-line interfaces
    • The file pyproject.toml lists these dependencies. Run make setup to install all requirements.
  • Make (for build automation)

  • Bead (for data dependency management)

Controlled Randomness

  • Random seed is set in configuration files for all methods requiring randomization
  • No Pseudo random generator is used in the analysis described here.

Memory, Runtime, Storage Requirements

Summary

Approximate time needed to reproduce the analyses on a standard (2025) desktop machine:

  • <10 minutes
  • 10-60 minutes
  • 1-2 hours
  • 2-8 hours
  • 8-24 hours
  • 1-3 days
  • 3-14 days
  • > 14 days

Approximate storage space needed:

  • < 25 MBytes
  • 25 MB - 250 MB
  • 250 MB - 2 GB
  • 2 GB - 25 GB
  • 25 GB - 250 GB
  • > 250 GB

Details

The code was designed to run on a standard desktop machine with 8GB+ RAM. LLM API calls may introduce variable runtime depending on service availability and rate limits.

Description of programs/code

  • Programs in code/classify/ implement different classification methods:
    • llm_classifier.py: Uses LLMs via the llm package for name classification
    • traditional_methods.py: Implements baseline methods using scikit-learn
  • Programs in code/evaluate/ assess method performance:
    • individual_performance.py: Evaluates each method against human annotations
    • method_comparison.py: Compares methods and generates summary statistics
  • Programs in code/utils/ contain shared utilities for data processing
  • The Makefile orchestrates the entire workflow with targets for each major step

License for Code

The code is licensed under a MIT license. See LICENSE.txt for details.

Instructions to Replicators

  • Install dependencies: make setup
  • Load input data: make data-load (requires bead configuration)
  • Run full analysis pipeline: make pipeline
  • View results in output/ directory

Details

  • make setup: Installs Python dependencies using uv and sets up the project environment
  • make data-load: Uses bead to load the annotated names dataset
  • make classify-all: Runs both LLM and traditional classification methods
  • make evaluate: Compares all methods against human annotations
  • make pipeline: Executes the complete analysis workflow

Individual steps can be run separately:

  • make classify-llm: LLM-based classification only
  • make classify-other: Traditional methods only
  • make evaluate-individual: Method-specific performance metrics
  • make evaluate-comparison: Cross-method comparison

List of tables and programs

The provided code reproduces:

  • All numbers provided in text in the paper
  • All tables and figures in the paper
  • Selected tables and figures in the paper, as explained and justified below.
Figure/Table # Program Output file Note
Performance Table code/evaluate/method_comparison.py output/performance_comparison.csv
Classification Results code/classify/llm_classifier.py temp/llm_classifications.csv
Baseline Results code/classify/traditional_methods.py temp/traditional_classifications.csv

References

HUN-REN KRTK (distributor). 2024. "Cégjegyzék LTS [data set]" Published by Opten Zrt, Budapest. Contributions by CEU MicroData.

Acknowledgements

Project no. 144193 has been implemented with the support provided by the Ministry of Culture and Innovation of Hungary from the National Research, Development and Innovation Fund, financed under the KKP_22 funding scheme.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published