This repository provides the code used to finetune a Large Language Model (LLM) for Moroccan Darija. Specifically, we finetuned a LLaMA-2 7B arabic version using efficient finetuning techniques, here for instance QLoRA for 4-bit quantization. The code also supports other base models, including LLaMA, Noon, and Arabic-specific GPT models, and any other model from Hugging Face. The finetuning was performed using using a A100-40GB GPU.
-
Clone this repository:
git clone https://github.com/BounharAbdelaziz/LLM-Finetuned-Morocco-Darija.git cd LLM-Finetuned-Morocco-Darija
-
Install required Python packages:
pip install -r requirements.txt
-
Set up Hugging Face authentication:
huggingface-cli login
-
Adjust the
utils.py
file to define theget_save_dir_path
andclean_dataset
functions as per your dataset structure. -
Redefine the hyperparameters based on your computes configuration.
Run the training script with the following command:
python train.py --model_name "Llama-7B"
--model_name
: Select the model to fine-tune (e.g., "Llama-7B").
LLM-Finetuned-Morocco-Darija/
├── train.py # Main training script
├── test.py # Testing script
├── utils.py # Utility functions for data cleaning and path setup
├── requirements.txt # Python dependencies
├── README.md # Project documentation
Modify the MODEL_PATHS
dictionary in train.py
to add or replace model configurations.
Update the clean_dataset()
function in utils.py
to adjust preprocessing logic, such as removing unwanted characters or handling Arabizi formats.
This project is part of ongoing efforts to advance Moroccan Darija NLP, leveraging state-of-the-art machine learning techniques. Thanks to the open-source AI community and Hugging Face for providing valuable resources.