HomoRich is the first large-scale, sentence-level Persian homograph dataset designed for grapheme-to-phoneme (G2P) conversion tasks. It addresses the scarcity of balanced, contextually annotated homograph data for low-resource languages. The dataset was created using a semi-automated pipeline combining human expertise and LLM-generated samples, as described in the paper:
Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models.
The dataset contains 528,891 annotated Persian sentences (327,475 homograph-focused) covering 285 homograph words with 2-4 pronunciation variants each. Variants are equally represented (~500 samples each) to mitigate bias. The composition blends multiple sources for diversity, as shown below:
![]() Distribution of data sources in HomoRich dataset |
![]() The source for different parts of the HomoRich dataset |
Persian G2P systems use two common phoneme formats:
- Repr. 1: Used in KaamelDict and SentenceBench (compatible with prior studies)
- Repr. 2: Adopted by GE2PE (state-of-the-art model enhanced in this work)
The HomoRich dataset includes both formats for broad compatibility. Below is a visual comparison:
Load the dataset directly from Hugging Face:
import pandas as pd
from datasets import Dataset
file_urls = [
"https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_01.parquet",
"https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_02.parquet",
"https://huggingface.co/datasets/MahtaFetrat/HomoRich-G2P-Persian/resolve/main/data/part_03.parquet"
]
# Combine into one dataset
df = pd.concat([pd.read_parquet(url) for url in file_urls], ignore_index=True)
dataset = Dataset.from_pandas(df)
{
'Grapheme': 'روی دیوار ننویسید.',
'Phoneme': 'ruye divAr nanevisid',
'Homograph Grapheme': 'رو',
'Homograph Phoneme': 'ru',
'Source': 'human',
'Source ID': 0,
'Mapped Phoneme': 'ruye1 divar n/nevisid',
'Mapped Homograph Phoneme': 'ru'
}
The dataset was used to improve:
- Homo-GE2PE (Neural T5-based model): 76.89% homograph accuracy (29.72% improvement).
- HomoFast eSpeak (Rule-based): 74.53% accuracy with real-time performance (30.66% improvement).
See paper Table 3 for full metrics.
The scripts
folder contains two key notebooks used in the dataset creation and processing pipeline:
-
Generate\_Homograph\_Sentences.ipynb
: This notebook implements the prompt templates used to generate homograph-focused sentences as described in the paper, Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models. -
Phonemize\_Sentences.ipynb
: This notebook applies the phonemization process based on the LLM-powered G2P method detailed in the LLM-Powered Grapheme-to-Phoneme Conversion: Benchmark and Case Study paper.
- Dataset: Released under CC0-1.0 (public domain).
- Code/Models: MIT License (where applicable).
If you use this project in your work, please cite the corresponding paper:
@misc{qharabagh2025fastfancyrethinkingg2p,
title={Fast, Not Fancy: Rethinking G2P with Rich Data and Rule-Based Models},
author={Mahta Fetrat Qharabagh and Zahra Dehghanian and Hamid R. Rabiee},
year={2025},
eprint={2505.12973},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.12973},
}
Contributions and pull requests are welcome. Please open an issue to discuss the changes you intend to make.