A simple and lightweight language model implementation using Python and Transformers library. This project helps you fine-tune pre-trained language models (like GPT-Neo and GPT-2) with your own data, allowing you to adapt these models to your specific domain or language while leveraging their existing knowledge.
- Simple command-line interface for text generation
- Fine-tuning capabilities with custom data
- Support for multiple pre-trained models (GPT-Neo, GPT-2, etc.)
- Configurable generation parameters
- Easy to use and modify
- Support for Python version from 3.9 or later
- Comparison between original and fine-tuned models
- Clone the repository:
git clone https://github.com/yourusername/mini-llm.git
cd mini-llm
- Create and activate a virtual environment:
python3 -m venv .venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows
- Install dependencies:
pip install -r requirements.txt
The model is trained using the text from temp/data.txt
. You can download the default training data (Portuguese Bible) from:
https://paulo-storage.s3.us-east-1.amazonaws.com/ai/slm/data.txt
To download the file, you can use one of these commands:
# Create temp directory if it doesn't exist
mkdir -p temp
# Using curl
curl -o temp/data.txt https://paulo-storage.s3.us-east-1.amazonaws.com/ai/slm/data.txt
# Or using wget
wget -O temp/data.txt https://paulo-storage.s3.us-east-1.amazonaws.com/ai/slm/data.txt
You can also modify this file with any text you want to train the model on. After modifying temp/data.txt
, you need to retrain the model using the --train
flag.
To generate text based on a prompt:
# Generate text using your fine-tuned model (saved in temp/model/)
python3 main.py --generate "your prompt here"
# Generate text using the original pre-trained model (not fine-tuned)
python3 main.py --generate "your prompt here" --use-original
By default (without the --use-original
flag), the system will use your fine-tuned model from the temp/model/
directory. If you haven't trained a model yet, or want to compare results with the original pre-trained model, use the --use-original
flag.
For example:
# Using your fine-tuned model
python3 main.py --generate "jesus disse"
# Using the original pre-trained model
python3 main.py --generate "jesus disse" --use-original
To compare the output between the original model and your fine-tuned version:
# Generate with your fine-tuned model
python3 main.py --generate "your prompt here"
# Generate with the original pre-trained model
python3 main.py --generate "your prompt here" --use-original
This is useful for seeing how your training data has influenced the model's output.
To train the model with new data:
python3 main.py --train
For a completely fresh training (clears all cached data):
python3 main.py --train --clean
Note: You must use the --train
flag whenever you modify the temp/data.txt
file to ensure the model learns from the new content.
During model training, you'll see various numbers and metrics displayed in the terminal. Think of these as the "vital signs" of your model's learning process. Even if you're new to machine learning, these metrics can help you understand if your model is learning well.
The output will look something like this:
{'loss': 2.5481, 'grad_norm': 2.6401376724243164, 'learning_rate': 1.538135593220339e-05, 'epoch': 7.01}
The Important Training Numbers (shown frequently):
- loss: Think of this as an "error score" - the lower, the better. It shows how well the model is learning from your data. For language models, values that start high (around 4-5) and gradually decrease to 2-3 indicate good learning progress.
- epoch: Shows how many times the model has gone through your entire dataset. For example, 7.01 means "just started the 7th round of training."
Advanced Metrics (for those who want to know more):
- grad_norm: Shows how dramatically the model is changing its internal knowledge with each update. Consistent values without sudden jumps are good.
- learning_rate: Controls how big of adjustments the model makes when learning. This number typically gets smaller as training progresses.
Evaluation Metrics (shown occasionally):
{'eval_loss': 3.1315, 'eval_runtime': 11.1366, 'eval_samples_per_second': 290.663, 'eval_steps_per_second': 72.733, 'epoch': 6.97}
- eval_loss: Similar to regular loss, but measured on data the model hasn't seen during training. This helps verify if the model is truly learning useful patterns rather than just memorizing.
You're on the right track if:
- The loss numbers are getting smaller over time - this is the most important indicator that training is working.
- The eval_loss is also decreasing - if only the training loss decreases but eval_loss doesn't, the model might just be memorizing your data.
- The loss eventually stabilizes - at some point, the numbers will stop decreasing significantly, which usually indicates the model has learned what it can.
Don't worry too much about the exact values - what matters is the trend. If the loss is generally decreasing, your model is learning!
The system will automatically save the best version of your model (the one with the lowest eval_loss) to the temp/model/
directory.
The model and generation parameters can be configured in config/settings.py
:
MODEL_CONFIG = {
"max_length": 512, # Maximum sequence length
"temperature": 0.7, # Controls randomness
"top_p": 0.9, # Nucleus sampling
"top_k": 50, # Top-k sampling
# ... other parameters
}
You can change the base model in two ways:
-
Using environment variables:
# Linux/macOS export MINI_LLM_MODEL="pierreguillou/gpt2-small-portuguese" python3 main.py --generate "your prompt" # Windows (cmd) set MINI_LLM_MODEL=pierreguillou/gpt2-small-portuguese python main.py --generate "your prompt" # Windows (PowerShell) $env:MINI_LLM_MODEL="pierreguillou/gpt2-small-portuguese" python main.py --generate "your prompt"
-
Editing the settings file: Edit the
config/settings.py
file directly to change the default model:MODEL_NAME = os.environ.get("MINI_LLM_MODEL", "EleutherAI/gpt-neo-2.7B")
The default model is EleutherAI/gpt-neo-2.7B
if no environment variable is set.
You can customize the number of training epochs using an environment variable:
# Linux/macOS
export MINI_LLM_EPOCHS=5
python3 main.py --train
# Windows (cmd)
set MINI_LLM_EPOCHS=5
python main.py --train
# Windows (PowerShell)
$env:MINI_LLM_EPOCHS=5
python main.py --train
The default is 20 epochs if not specified. For models already pre-trained in your target language, fewer epochs (1-5) may be sufficient.
.
├── config/
│ └── settings.py # Configuration settings
├── model/
│ ├── generation.py # Text generation logic
│ └── model_utils.py # Model loading utilities
├── training/
│ └── trainer.py # Training logic
├── utils/
│ ├── device.py # Device utilities
│ └── file.py # File operations utilities
├── temp/ # Temporary files directory
│ ├── data.txt # Training data
│ ├── model/ # Saved model
│ ├── logs/ # Training logs
│ ├── tokenizer_cache/ # Tokenizer cache
│ └── huggingface/ # Hugging Face cache
├── main.py # Main application file
├── backup_to_s3.py # S3 backup utility script
├── requirements.txt # Project dependencies
└── README.md # Project documentation
The project relies on the following Python packages:
- torch: PyTorch deep learning framework that provides tensor computation and neural networks
- transformers: Hugging Face library that provides pre-trained models and utilities for natural language processing
- accelerate: Library that enables easy use of distributed training on multiple GPUs/TPUs
- scipy: Scientific computing library used for various mathematical operations
- datasets: Hugging Face library for managing and processing training datasets
- black: Code formatter used for maintaining consistent code style
- packaging: Utilities for parsing and comparing version numbers
- psutil: Cross-platform library for retrieving system information (memory, CPU)
- boto3: AWS SDK for Python, used by the backup script to interact with S3
The project supports a wide range of models from Hugging Face. Here are some popular options. Each model is listed with its Hugging Face identifier followed by its parameter count in parentheses (e.g., model-name
(size)):
distilgpt2
(334M) – Fastest, works well on mobile devicesgpt2
(124M) – Good balance of size and performancemicrosoft/phi-1_5
(1.3B) – Microsoft's efficient modelTinyLlama/TinyLlama-1.1B
(1.1B) – Efficient Llama variantfacebook/opt-125m
(125M) – Meta's efficient modelstabilityai/stablelm-2-1_6b
(1.6B) – StableLM 2.0Qwen/Qwen-1_5B
(1.5B) – Alibaba's efficient modelgoogle/gemma-2b
(2B) – Google's lightweight modelmicrosoft/phi-3-mini
(1.8B) – Microsoft's small Phi-3 modelmistralai/Mistral-2B-v0.1
(2B) – Mistral's small model
microsoft/phi-2
(2.7B) – Microsoft's Phi modelstabilityai/stablelm-2-3b
(3B) – StableLM 2.0Qwen/Qwen-4B
(4B) – Alibaba's medium modelmistralai/Mistral-7B-v0.1
(7B) – High performancemeta-llama/Llama-2-7b
(7B) – Meta's Llama 2tiiuae/falcon-7b
(7B) – TII's Falcongoogle/gemma-7b
(7B) – Google's Gemma modelmeta-llama/Llama-3-8b
(8B) – Meta's Llama 3microsoft/phi-3
(3B) – Microsoft's next-gen Phi
mistralai/Mixtral-8x7B-v0.1
(47B) – Mixture of Expertsmeta-llama/Llama-2-13b
(13B) – Meta's larger Llama 2meta-llama/Llama-2-70b
(70B) – Meta's largest Llama 2Qwen/Qwen-7B
(7B) – Alibaba's large modelQwen/Qwen-14B
(14B) – Alibaba's larger modelstabilityai/stablelm-2-12b
(12B) – StableLM 2.0meta-llama/Llama-3-70b
(70B) – Meta's largest Llama 3Qwen/Qwen-20B
(20B) – Alibaba's extra large modelmistralai/Mixtral-v2
(55B) – Mistral's next-gen MoE
bigscience/bloom-560m
(560M) – Multilingualbigscience/bloom-1b7
(1.7B) – MultilingualTHUDM/chatglm2-6b
(6B) – Chinese-optimizedfnlp/moss-moon-003-sft
(7B) – Chinese-optimizedQwen/Qwen-7B-Chat
(7B) – Chat-optimizedstabilityai/stablelm-2-12b-chat
(12B) – Chat-optimizedmistralai/Mistral-8B-Instruct-v0.2
(8B) – Mistral's instruct modelmeta-llama/Llama-3-8b-chat
(8B) – Meta's Llama 3 chatgoogle/gemma-7b-instruct
(7B) – Google's instruction modelmicrosoft/phi-2-vision
(2.7B) – Microsoft's vision-language modelAnthropic/Claude-Next-7B
(7B) – Anthropic's Claude Next
distilgpt2
(334M) – Best for iPhone/mobilemicrosoft/phi-1_5
(1.3B) – Excellent quality-to-size ratioTinyLlama/TinyLlama-1.1B
(1.1B) – Optimized for mobilegoogle/gemma-2b
(2B) – Good performance on mobilestabilityai/stablelm-2-1_6b
(1.6B) – Modern architecture
pierreguillou/gpt2-small-portuguese
(124M) – GPT-2 trained specifically for Portugueseunicamp-dl/gpt-neox-pt-small
(125M) – Brazilian model from Unicamppierreguillou/gpt2-small-portuguese-blog
(124M) – Optimized for blog contentneuralmind/bert-large-portuguese-cased
(354M) – BERT for Portuguesejotape/bert-pt-br
(110M) – BERT model for Brazilian Portugueserufimelo/portuguese-gpt2-large
(774M) – Larger Portuguese modelnauvu/brazilian-legal-bert
(110M) – Specialized for Brazilian legal texts
Note: Model availability and compatibility may change. Check the model's documentation on Hugging Face for specific requirements and limitations.
When training on hardware with limited GPU memory (less than 2GB VRAM), you may encounter "CUDA out of memory" errors. The project provides two solutions:
The most reliable solution is to use the --use-cpu
flag which forces the model to run on CPU instead of GPU:
python main.py --train --use-cpu
This approach:
- Bypasses GPU memory limitations completely
- Uses system RAM instead of VRAM
- Works with any model size
- Is significantly slower than GPU training
Instead of using the default 2.7B parameter model, you can switch to a smaller model that fits in your GPU's memory:
# Set a smaller model (124M parameters)
export MINI_LLM_MODEL="gpt2"
# Then train as usual
python main.py --train
Recommended models for low-memory GPUs (2-4GB VRAM):
gpt2
(124M)distilgpt2
(82M)facebook/opt-125m
(125M)bigscience/bloom-560m
(560M)
The project includes a Python script (backup_to_s3.py
) to easily back up your model, training data, and configuration to Amazon S3. This is useful for preserving your work and transferring it between different environments.
- Python 3
- Boto3 package installed (
pip install boto3
) - AWS credentials configured
- Access to an S3 bucket
Install the necessary dependencies for the backup script:
pip install boto3
The backup script requires the following environment variables:
# AWS Credentials (Required)
export AWS_ACCESS_KEY_ID="your_access_key"
export AWS_SECRET_ACCESS_KEY="your_secret_key"
export AWS_DEFAULT_REGION="your_region"
export S3_BUCKET_NAME="your-bucket-name"
# Backup Configuration (Optional - shown with default values)
export S3_BUCKET_PATH="mini-llm-backup/" # Path within the bucket (defaults to "mini-llm-backup/")
export FOLDER_TO_BACKUP="./temp" # Local folder to backup (defaults to "./temp")
You only need to set these optional variables if you want to override the default values.
Once you've set the environment variables, run the backup script:
python3 backup_to_s3.py
The script will:
- Create a timestamped tar.gz archive of your specified folder
- Upload it to your S3 bucket with public read access
- Clean up temporary files
- Output a public URL that can be shared or used for downloads
Note: The backup file will be publicly accessible via the generated URL. Make sure you don't include sensitive information if you're backing up to a public-facing bucket.
After running the backup script, you'll receive a public URL that looks like:
https://your-bucket-name.s3.amazonaws.com/mini-llm-backup/backup_2023-05-08_12-34-56.tar.gz
You can share this URL with others or use it to download your backup from any machine without AWS credentials.
To restore your data from an S3 backup, you have two options:
# Download the backup using the public URL
wget https://your-bucket-name.s3.amazonaws.com/mini-llm-backup/your-backup-file.tar.gz
# or
curl -O https://your-bucket-name.s3.amazonaws.com/mini-llm-backup/your-backup-file.tar.gz
# Extract the archive
rm -rf temp/
tar -xzvf your-backup-file.tar.gz -C .
# Download the backup archive using AWS CLI
aws s3 cp s3://your-bucket-name/mini-llm-backup/your-backup-file.tar.gz .
# Extract the archive
rm -rf temp/
tar -xzvf your-backup-file.tar.gz -C .
This will restore your model, training data, and other files to the specified destination path.
Copyright (c) 2025, Paulo Coutinho