HomoRich: A Persian Homograph Dataset for G2P Conversion

HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.

Overview

The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:

Distribution of data sources in HomoRich dataset

The source for different parts of the HomoRich dataset

Phoneme Representations:

Persian G2P systems use two common phoneme formats:

Repr. 1: Used in KaamelDict and SentenceBench (compatible with prior studies)
Repr. 2: Adopted by GE2PE (state-of-the-art model enhanced in this work)

The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:

Repr. 1

Repr. 2

Usage

Load the dataset directly from Hugging Face:

import pandas as pd
from datasets import Dataset

file_urls = [
    "https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_01.parquet",
    "https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_02.parquet",
    "https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_03.parquet"
]

# Combine into one dataset
df = pd.concat([pd.read_parquet(url) for url in file_urls], ignore_index=True)
dataset = Dataset.from_pandas(df)

Data Example

{
    'Grapheme': 'روی دیوار ننویسید.',
    'Phoneme': 'ruye divAr nanevisid',
    'Homograph Grapheme': 'رو',
    'Homograph Phoneme': 'ru',
    'Source': 'human', 
    'Source ID': 0,
    'Mapped Phoneme': 'ruye1 divar n/nevisid',
    'Mapped Homograph Phoneme': 'ru'
}

Benchmarks

The dataset was used to improve:

Homo-GE2PE (Neural T5-based model): 76.89% homograph accuracy (29.72% improvement).
HomoFast eSpeak (Rule-based): 74.53% accuracy with real-time performance (30.66% improvement).

See paper Table 3 for full metrics.

Dataset Creation and Processing

The scripts folder contains two key notebooks used in the dataset creation and processing pipeline:

Generate\_Homograph\_Sentences.ipynb: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
Phonemize\_Sentences.ipynb: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study paper.

License

Dataset: Released under CC0-1.0 (public domain).
Code/Models: MIT License (where applicable).

Citation

If you use this project in your work, please cite the corresponding paper:

@misc{qharabagh2025fastfancyrethinkingg2p,
      title={Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models}, 
      author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
      year={2025},
      eprint={2505.12973},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.12973}, 
}

Contributions

Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
data		data
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HomoRich: A Persian Homograph Dataset for G2P Conversion

Overview

Phoneme Representations:

Usage

Data Example

Benchmarks

Dataset Creation and Processing

License

Citation

Contributions

Additional Links

About

Uh oh!

Releases

Packages

Languages

License

MahtaFetrat/HomoRich-G2P-Persian

Folders and files

Latest commit

History

Repository files navigation

HomoRich: A Persian Homograph Dataset for G2P Conversion

Overview

Phoneme Representations:

Usage

Data Example

Benchmarks

Dataset Creation and Processing

License

Citation

Contributions

Additional Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages