LLM-Finetuned-Morocco-Darija

This repository provides the code used to finetune a Large Language Model (LLM) for Moroccan Darija. Specifically, we finetuned a LLaMA-2 7B arabic version using efficient finetuning techniques, here for instance QLoRA for 4-bit quantization. The code also supports other base models, including LLaMA, Noon, and Arabic-specific GPT models, and any other model from Hugging Face. The finetuning was performed using using a A100-40GB GPU.

Setup

Clone this repository:

git clone https://github.com/BounharAbdelaziz/LLM-Finetuned-Morocco-Darija.git
cd LLM-Finetuned-Morocco-Darija

Install required Python packages:
```
pip install -r requirements.txt
```
Set up Hugging Face authentication:
```
huggingface-cli login
```
Adjust the utils.py file to define the get_save_dir_path and clean_dataset functions as per your dataset structure.
Redefine the hyperparameters based on your computes configuration.

Usage

Training the Model

Run the training script with the following command:

python train.py --model_name "Llama-7B"

Arguments:

--model_name: Select the model to fine-tune (e.g., "Llama-7B").

Project Structure

LLM-Finetuned-Morocco-Darija/
├── train.py           # Main training script
├── test.py            # Testing script
├── utils.py           # Utility functions for data cleaning and path setup
├── requirements.txt   # Python dependencies
├── README.md          # Project documentation

Customization

Supported Models

Modify the MODEL_PATHS dictionary in train.py to add or replace model configurations.

Dataset Cleaning

Update the clean_dataset() function in utils.py to adjust preprocessing logic, such as removing unwanted characters or handling Arabizi formats.

Acknowledgments, Feedback & Limitations

This project is part of ongoing efforts to advance Moroccan Darija NLP, leveraging state-of-the-art machine learning techniques. Thanks to the open-source AI community and Hugging Face for providing valuable resources.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
data_preparation.ipynb		data_preparation.ipynb
requirements.txt		requirements.txt
test.ipynb		test.ipynb
test.py		test.py
train.py		train.py
train.sh		train.sh
train_q_lora.py		train_q_lora.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-Finetuned-Morocco-Darija

Setup

Usage

Training the Model

Arguments:

Project Structure

Customization

Supported Models

Dataset Cleaning

Acknowledgments, Feedback & Limitations

About

Uh oh!

Releases

Packages

Uh oh!

Languages

BounharAbdelaziz/LLM-Finetuned-Morocco-Darija

Folders and files

Latest commit

History

Repository files navigation

LLM-Finetuned-Morocco-Darija

Setup

Usage

Training the Model

Arguments:

Project Structure

Customization

Supported Models

Dataset Cleaning

Acknowledgments, Feedback & Limitations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages