LangSAMP: Language-Script Aware Multilingual Pretraining
-
Updated
Sep 30, 2024 - Python
LangSAMP: Language-Script Aware Multilingual Pretraining
A comprehensive toolkit for end-to-end continued pre-training, fine-tuning, monitoring, testing and publishing of language models with MLX-LM
Adapting LLM to the medical domain through SFT, RAG, and multistep fine-tuning to enhance domain knowledge and performance.
Master's thesis repository
This project evaluates Llama 3.2 3B continued pre-training for Serbian language, using a custom-made cloze-style benchmark. It supports grammatical, lexical, semantic, idiomatic, and factual sentence completion tasks. The evaluation script calculates model accuracy based on log-likelihood scoring over masked token choices.
Add a description, image, and links to the continued-pretraining topic page so that developers can more easily learn about it.
To associate your repository with the continued-pretraining topic, visit your repo's landing page and select "manage topics."