Enhancing GloVe Embeddings for Low-resource Languages with Graph Knowledge

This project aims to enhance GloVe embeddings for low-resource languages by leveraging graph knowledge, as well as to provide a centralized repository with pre-trained static word embeddings across diverse languages. Below is a table detailing the languages involved in the project along with their dataset sizes and classification.

Language Data Details

ISO	Language Name	Dataset Size	Class	Glove Emb-s	Vocab Size	ConceptNet Data
ss	Swati	86K	1	-----------	----------	✘
sc	Sardinian	143K	1	-----------	----------	✓
yo	Yoruba	1.1M	2	-----------	----------	✓
gn	Guarani	1.5M	1	-----------	----------	✓
qu	Quechua	1.5M	1	-----------	----------	✓
ns	Northern Sotho	1.8M	1	-----------	----------	✘
li	Limburgish	2.2M	1	-----------	----------	✓
ln	Lingala	2.3M	1	-----------	----------	✓
wo	Wolof	3.6M	2	-----------	----------	✓
zu	Zulu	4.3M	2	-----------	----------	✓
rm	Romansh	4.8M	1	-----------	----------	✓
ig	Igbo	6.6M	1	-----------	----------	✘
lg	Ganda	7.3M	1	-----------	----------	✘
as	Assamese	7.6M	1	-----------	----------	✘
tn	Tswana	8.0M	2	-----------	----------	✘
ht	Haitian	9.1M	2	-----------	----------	✓
om	Oromo	11M	1	-----------	----------	✘
su	Sundanese	15M	1	-----------	----------	✓
bs	Bosnian	18M	3	-----------	----------	✘
br	Breton	21M	1	-----------	----------	✓
gd	Scottish Gaelic	22M	1	-----------	----------	✓
xh	Xhosa	25M	2	-----------	----------	✓
mg	Malagasy	29M	1	-----------	----------	✓
jv	Javanese	37M	1	-----------	----------	✓
fy	Frisian	38M	0	-----------	----------	✓
sa	Sanskrit	44M	2	-----------	----------	✓
my	Burmese	46M	1	-----------	----------	✓
ug	Uyghur	46M	1	-----------	----------	✓
yi	Yiddish	51M	1	-----------	----------	✓
or	Oriya	56M	1	-----------	----------	✓
ha	Hausa	61M	2	-----------	----------	✓
la	Lao	63M	2	-----------	----------	✓
sd	Sindhi	67M	1	-----------	----------	✓
ta_rom	Tamil Romanized	68M	3	-----------	----------	✘
so	Somali	78M	1	-----------	----------	✓
te_rom	Telugu Romanized	79M	1	-----------	----------	✘
ku	Kurdish	90M	0	-----------	----------	✓
pu	Punjabi	90M	2	-----------	----------	✓
ps	Pashto	107M	1	-----------	----------	✓
ga	Irish	108M	2	-----------	----------	✓
am	Amharic	133M	2	-----------	----------	✓
ur_rom	Urdu Romanized	141M	3	-----------	----------	✘
km	Khmer	153M	1	-----------	----------	✓
uz	Uzbek	155M	3	-----------	----------	✓
bn_rom	Bengali Romanized	164M	3	-----------	----------	✘
ky	Kyrgyz	173M	3	-----------	----------	✓
my_zaw	Burmese (Zawgyi)	178M	1	-----------	----------	✘
cy	Welsh	179M	1	-----------	----------	✓
gu	Gujarati	242M	1	-----------	----------	✓
eo	Esperanto	250M	1	-----------	----------	✓
af	Afrikaans	305M	3	-----------	----------	✓
sw	Swahili	332M	2	-----------	----------	✓
mr	Marathi	334M	2	-----------	----------	✓
kn	Kannada	360M	1	-----------	----------	✓
ne	Nepali	393M	1	-----------	----------	✓
mn	Mongolian	397M	1	-----------	----------	✓
si	Sinhala	452M	0	-----------	----------	✓
te	Telugu	536M	1	-----------	----------	✓
la	Latin	609M	3	-----------	----------	✓
be	Belarussian	692M	3	-----------	----------	✓
tl	Tagalog	701M	3	-----------	----------	✘
mk	Macedonian	706M	1	-----------	----------	✓
gl	Galician	708M	3	-----------	----------	✓
hy	Armenian	776M	1	-----------	----------	✓
is	Icelandic	779M	2	-----------	----------	✓
ml	Malayalam	831M	1	-----------	----------	✓
bn	Bengali	860M	3	-----------	----------	✓
ur	Urdu	884M	3	-----------	----------	✓
kk	Kazakh	889M	3	-----------	----------	✓
ka	Georgian	1.1G	3	-----------	----------	✓
az	Azerbaijani	1.3G	1	-----------	----------	✓
sq	Albanian	1.3G	1	-----------	----------	✓
ta	Tamil	1.3G	3	-----------	----------	✓
et	Estonian	1.7G	3	-----------	----------	✓
lv	Latvian	2.1G	3	-----------	----------	✓
ms	Malay	2.1G	3	-----------	----------	✓
sl	Slovenian	2.8G	3	-----------	----------	✓
lt	Lithuanian	3.4G	3	-----------	----------	✓
he	Hebrew	6.1G	3	-----------	----------	✓
sk	Slovak	6.1G	3	-----------	----------	✓
el	Greek	7.4G	3	-----------	----------	✓
th	Thai	8.7G	3	-----------	----------	✓
bg	Bulgarian	9.3G	3	-----------	----------	✓
da	Danish	12G	3	-----------	----------	✓
uk	Ukrainian	14G	3	-----------	----------	✓
ro	Romanian	16G	3	-----------	----------	✓
id	Indonesian	36G	3	-----------	----------	✘

ConceptNet Data Details

For this project, we extracted all data from the ConceptNet database. The extraction process involved several steps to clean and analyze the data from the official ConceptNet dump available here.

The final extracted dataset is a JSON file representing a dictionary with language codes and start and end edges for each language. Start edges represent the unique words in a target language, while end edges are the words related to the start edges through various types of relationships. The relationship types and sources are not extracted.

The dataset, as well as details on the amount of extracted data for each language, are available on Hugging Face.

Contributing

Contributions are welcome! Please create a pull request or open an issue for any suggestions or bug reports.

License

This project is licensed under the MIT License - see the [LICENSE] file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhancing GloVe Embeddings for Low-resource Languages with Graph Knowledge

Language Data Details

ConceptNet Data Details

Contributing

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

pyRis/retrofitting-embeddings-lrls

Folders and files

Latest commit

History

Repository files navigation

Enhancing GloVe Embeddings for Low-resource Languages with Graph Knowledge

Language Data Details

ConceptNet Data Details

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages