This repository contains a set of Transformer-based language models fine-tuned on a dataset of Russian jokes (anecdotes). The models are designed to generate humorous and coherent Russian text. The repository includes three versions of the model: nano
, mini
, and small
, each with different architectures and training configurations. Additionally, a custom Byte-level BPE tokenizer, trained on the Russian jokes dataset, is provided.
The models are based on the Transformer architecture, enhanced with several advanced techniques:
- Positional Embeddings: ALiBi (Attention with Linear Biases) and RoPE (Rotary Positional Embeddings) are used for positional encoding.
- Attention Mechanism: Grouped-Query Attention (GQA) and Multi-Head Latent Attention (MHLA) are employed to improve efficiency and performance.
- Activation Function: SwiGLU activation is used in the feed-forward layers.
Three versions of the model are available:
- Nano: 3 layers, 4 heads, 96 hidden dimensions.
- Mini: 6 layers, 6 heads, 384 hidden dimensions. Trained with RoPE and MHLA.
- Small: 12 layers, 12 heads, 768 hidden dimensions. Trained with RoPE and MHLA.
The models were trained on the IgorVolochay/russian_jokes dataset.
Key training parameters include:
- Epochs: The number of full iterations over the dataset was determined by the
n_step
parameter in the Trainer initialization. The models were trained for 1 epoch (nano), 1 epoch (mini), and 6 epochs (small). - Batch Size: 32 for nano and mini models, 64 for the small model.
- Learning Rate: 5e-4 with cosine decay for the small model, 3e-4 for the nano and mini models.
- Loss Function: Cross-entropy loss was used for training.
- Hardware: Training was conducted on an NVIDIA A100 GPU via Google Colab.
The performance of each model is summarized below:
Model | Training Loss (min) | Validation Loss (min) |
---|---|---|
Nano | 3.784 | 3.932 |
Mini | 3.127 | 3.144 |
Small | 2.933 | 3.025 |
Training and validation loss curves for each model are provided below:
You can load the models and tokenizer from the Hugging Face Hub using the following code:
# Small model
model_small = TransformerForCausalLM.from_pretrained("estnafinema0/russian-jokes-generator", revision="small")
tokenizer = ByteLevelBPETokenizer.from_pretrained("estnafinema0/russian-jokes-generator")
To generate text using the model, you can use the following code:
text = "Штирлиц пришел домой"
input_ids = torch.tensor(tokenizer.encode(text), device=device)
model_output = model_small.generate(
input_ids[None, :], max_new_tokens=200, eos_token_id=tokenizer.eos_token_id, do_sample=True, top_k=10
)
print(tokenizer.decode(model_output[0].tolist()))
Here are some examples of jokes generated by the small
model:
-
Input: "Пришел Петя в баню и говорит" Output: "Пришел Петя в баню и говорит - Василий Иванович, вы знаете, кто я - Петя, или Петя? - Ахааха, и я - Ахаилая, я - Ахаил! - А какая Петя? - Я - Ахаилая! - Ну и я, когда я банкрот, банкротство, конечно..."
-
Input: "Вышел как-то на крыльцо" Output: "Вышел как-то на крыльцо, а там плачет. Стукнулся: упал, выпал. Плачет – упал."
-
Input: "Священник задает ребёнку вопрос" Output: "Священник задает ребёнку вопрос ему на ухо:- Что, братан, опять несёл?- Братан, ты что, братан, охуел?"
The repository is organized as follows:
- Models: Three versions of the model (
nano
,mini
,small
) are available in different branches:main
: Nano model.mini
: Mini model.small
: Small model.
- Tokenizer: A custom Byte-level BPE tokenizer trained on the Russian jokes dataset.
- Jupyter Notebook: A detailed notebook containing the implementation, training, and evaluation of the models.
The repository includes a Jupyter Notebook (russian_jokes_generator.ipynb
) that provides a step-by-step guide to:
- Training the tokenizer.
- Implementing and training the Transformer models.
- Evaluating the models and generating text.
You can find the notebook in the repository and run it locally or in Google Colab.
P.S. Now the notebook is released in russian.
This project is licensed under the Apache 2.0 License. See the LICENSE file for more details.
Thank you for your time!