Indonesian Word Normalization Implementation

Overview

This repo contains the implementation of word normalization using character-level seq2seq approach. The transformer arcitecture from Vaswani et.al. (2017) is implemented on Colloquial Indonesian Lexicon from Salsabila et.al. (2018) and IndoCollex dataset from Wibowo et.al. (2021). The dataset is taken from those sources then transformed into json files containing the informal-formal word pairs. The dataset containing those combined sources are also created. Since the Colloquial Indonesian Lexicon only contains one file, the dataset is splitted into train, valid, test set with 80:10:10 proportion.

Running Script

 python run_experiment.py --dataset_name <dataset_name> --num_epoch 200 --config_name <config_name> --model_name <output_model_name>

The dataset_name parameters can be filled with this following inputs

indocollex for IndoCollex dataset
col_id_norm for Colloquial Indonesian Lexicon dataset
combined for combined dataset from IndoCollex and Colloquial Indonesian Lexicon

The config_name can be filled using the file name from the ./config folder. The config folder contains configuration for the transformers architecture from

The original "Attention is All You Need" paper form Vaswani et.al. (2017)
The 'smaller transformer' implemented by Wu et.al. (2021) on "Applying the Transformer to Character-level Transduction" with different dropout value

The original dataset repo

Cited Research

If you use this implementation please kindly cited these researches

Aliyah Salsabila, N., Ardhito Winatmoko, Y., Akbar Septiandri, A., Jamal, A., 2018. Colloquial Indonesian Lexicon, in: 2018 International Conference on Asian Language Processing (IALP). Presented at the 2018 International Conference on Asian Language Processing (IALP), pp. 226–229. https://doi.org/10.1109/IALP.2018.8629151
Wibowo, H.A., Nityasya, M.N., Akyürek, A.F., Fitriany, S., Aji, A.F., Prasojo, R.E., Wijaya, D.T., 2021. IndoCollex: A Testbed for Morphological Transformation of Indonesian Colloquial Words, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Presented at the Findings 2021, Association for Computational Linguistics, Online, pp. 3170–3183. https://doi.org/10.18653/v1/2021.findings-acl.280
Wu, S., Cotterell, R., Hulden, M., 2021. Applying the Transformer to Character-level Transduction, in: Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. Presented at the EACL 2021, Association for Computational Linguistics, Online, pp. 1901–1907. https://doi.org/10.18653/v1/2021.eacl-main.163

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
config		config
dataset		dataset
utils		utils
.gitignore		.gitignore
prepare_env.bat		prepare_env.bat
readme.md		readme.md
run_evaluation.py		run_evaluation.py
run_evaluation_batch.bat		run_evaluation_batch.bat
run_evaluation_with_report.py		run_evaluation_with_report.py
run_experiment.py		run_experiment.py
run_tuning.py		run_tuning.py
run_tuning_batch.bat		run_tuning_batch.bat
run_whole_experiment.bat		run_whole_experiment.bat
thesis_env.yaml		thesis_env.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Indonesian Word Normalization Implementation

Overview

Running Script

The original dataset repo

Cited Research

About

Uh oh!

Uh oh!

Languages

fjoeda/indo-word-normalization-implementation

Folders and files

Latest commit

History

Repository files navigation

Indonesian Word Normalization Implementation

Overview

Running Script

The original dataset repo

Cited Research

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages