GitHub - tamtatop/georgian-language-model

see report.pdf for details

Info

Use env.yml to create identical conda env to us.
Unfortunately, since we had a comfort of azure GPU instance we did not try to run these notebooks on google collab. If you run into problems you can message us and we'll give you access to our private juputer-lab instance on azure.

Main notebooks:

no_en_data_prep.ipynb - Remove all letters except georgian; filter data; separate sentences and split data in train, test, validation.
eda.ipynb - EDA
w2v.ipynb - word2vec training on cleaned data
word2vec_analysis.ipynb - this analysis was done on a bigger/earlier w2v model(from geolm.ipynb). before we filtered out english
geolm.ipynb - notebook for training next token prediction transformer (with or without w2v(old)).
- we gave up on this model since it was not learning anything and loss was not decreasing
- we didn't try any evaluation for this since it was hopeless. any sentece would only generate same tokens
v2bert-tokenizer-training.ipynb - training subword tokenizer
subword-w2v.ipynb - subword word2vec and it's simple analysis. subwording is done with tokenizer from above
bert-v2-train.ipynb - notebook for training bert with or without word2vec
- bert-v1-train.ipynb - historical training notebook for v1 bert (no w2v). trained for 10hr (no test data)
bert-textgen-eval.ipynb - text generation and evaluation for bert models
- lm_demo.ipynb - stripped down version of this

Training logs(loss): v1 bert logs: polished/models/bert/trainer/trainer_state_part1.json v2 bert with w2v: ./polished/models/v2bert/trainer/trainer_state.json v2 bert without w2v: ./polished/models/ka_only_no_w2v_bert/trainer/trainer_state.json

Troubleshoot: ka_nse_mil in some notebooks can be replaced with any file from ./no_en_data/. We are keeping it like this for clean history

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Info

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fonts		fonts
notebook_archive		notebook_archive
README.md		README.md
bert-textgen-eval.ipynb		bert-textgen-eval.ipynb
bert-textgen-topk-topp.ipynb		bert-textgen-topk-topp.ipynb
bert-v1-train.ipynb		bert-v1-train.ipynb
bert-v2-train.ipynb		bert-v2-train.ipynb
eda.ipynb		eda.ipynb
env.yml		env.yml
geolm-topk-topp.ipynb		geolm-topk-topp.ipynb
geolm.ipynb		geolm.ipynb
geotok.py		geotok.py
lm_demo.ipynb		lm_demo.ipynb
ngrams-subwords.ipynb		ngrams-subwords.ipynb
ngrams.ipynb		ngrams.ipynb
no_en_data_prep.ipynb		no_en_data_prep.ipynb
report.pdf		report.pdf
subword-w2v.ipynb		subword-w2v.ipynb
v2bert-tokenizer-training.ipynb		v2bert-tokenizer-training.ipynb
w2v.ipynb		w2v.ipynb
word2vec_analysis.ipynb		word2vec_analysis.ipynb

tamtatop/georgian-language-model

Folders and files

Latest commit

History

Repository files navigation

Info

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages