Skip to content

A program for training and testing LLM models written in Python, with a beginner-friendly structure.

Notifications You must be signed in to change notification settings

sandroXP2022/ZoeLLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Zoe LLM

Warning

This project is in its experimental phase.

A program for training and testing LLM models written in Python, with a beginner-friendly structure.

Table of Contents

  1. Description
  2. Requirements
  3. Installation
  4. Configuration
  5. Testing

Description

Zoe LLM is a simple and modular toolkit for:

  • Downloading and preprocessing datasets
  • Building, training, and testing GPT-like models from scratch
  • Support for verbose logging (DEBUG=1) or progress bar (tqdm).
  • Interactive chat pipeline for testing

Requirements

Minimum Requirements

  • CPU with bfloat16 support and six cores (12 threads)
  • Memory: ≥8GB RAM.
  • Disk: ≥64GB free for datasets and checkpoints.
  • Python: 3.11

Recommended Requirements

  • CPU with bfloat16 support and twelve cores (24 threads)*
  • SSD for fast I/O.
  • 64GB RAM (depending on model size)
  • ≥256GB free for datasets and checkpoints

*It is possible to use GPU for training, but it requires some modifications to the code

Installation

To install the necessary programs and dependencies, we must clone the repository on the machine that will be used to build the model and set up a virtual environment with the necessary dependencies (remembering that a Python 3.11 interpreter must be installed on the system).

bash
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Configuration

First, we must download the datasets to build the model by running python utils/download_datasets.py

You can also import other datasets manually or modify the script to include more datasets ;)

After downloading the datasets for the model, we must combine them and process them before running the tokenizer. To do this, simply run python utils/preprocess.py and to build the tokenizer, simply run the following scripts:

python utils/compile_unigram.py
python utils/tokenizer.py

I RECOMMEND TESTING THE TOKENIZER USING OUR TESTING TOOL python utils/test_tokenizer.py

After completing the necessary steps for the tokenizer, you can now train your model using python utils/train.py. Adjust the parameters in the train.py, model.py, and chat.py files as needed.

Model training can take a few hours depending on the situation...

Testing

To test the model, you can use python utils/chat.py to access the chat program that will interact with the model.

reescreve o Markdown sem erros de formatação

About

A program for training and testing LLM models written in Python, with a beginner-friendly structure.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages