LeCarnet is a text dataset of 2 million children's stories in French using very simple vocabulary, inspired by the TinyStories dataset. The purpose of this work is to provide a reliable, high-quality resource for training and evaluating small language models (SLMs) in French. It is aimed at educational and experimental use. This repository contains a minimalist code for data generation, training, evaluation, and inference.
This dataset was created by synthetically generating French short stories using Mistral-Large-Instruct-2411.
The dataset and models are available on Hugging Face:
Using uv
for fast and reliable dependency management.
# Basic environment setup
make env
That's it, you can now run any command you want!
make env
:
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
The training pipeline supports Weights & Biases (WandB) for tracking training and validation losses, as well as perplexity.
Task | Make Command | Equivalent CLI Command | Default Values |
---|---|---|---|
Training | make train |
python src/train/train.py --model_config MODEL_CONFIG |
MODEL_CONFIG=3M |
Push Model to HF | make push-model |
python src/inference/push-model.py --repo_name HF_REPO --model_dir MODEL_DIR |
HF_REPO=MaxLSB/LeCarnet-3M , MODEL_DIR=LeCarnet-3M/model_weights/ |
src/train/configs.py
for fine-grained hyperparameter tuning. MODEL_CONFIG=custom
to use your own custom model config.
For Generation tasks set your API key:
# Linux/MacOS
export MISTRAL_API_KEY=your_api_key
# Windows
$env:MISTRAL_API_KEY="your_api_key"
# Linux/MacOS
export OPENAI_API_KEY=your_api_key
# Windows
$env:OPENAI_API_KEY="your_api_key"
Task | Make Command | Equivalent CLI Command | Default Values |
---|---|---|---|
Generate with Mistral | make generate-mistral |
python src/data/mistral.py --model_name MISTRAL_MODEL --total_requests MISTRAL_REQUESTS, --num_workers NUM_WORKERS |
MISTRAL_MODEL=mistral-large-2411 , MISTRAL_REQUESTS=100000 , NUM_WORKERS=4 |
Generate with OpenAI | make generate-openai |
python src/data/openai.py --model_name OPENAI_MODEL --total_requests OPENAI_REQUESTS |
OPENAI_MODEL=gpt-3.5-turbo , OPENAI_REQUESTS=100000 |
Push Dataset to HF | make push-dataset |
python src/data/push_dataset.py --folder_path FOLDER_PATH --repo_name REPO_NAME |
FOLDER_PATH=./dataset/ , REPO_NAME=MaxLSB/LeCarnet |
To run the evaluation you also need to set up your Mistral API key.
Task | Make Command | Equivalent CLI Command | Default Values |
---|---|---|---|
Evaluation | make eval |
python src/eval/eval.py --model_name EVAL_MODEL --judge_model_name JUDGE_MODEL |
EVAL_MODEL=MaxLSB/LeCarnet-3M , JUDGE_MODEL=mistral-large-2411 |
Inference | make inference |
python src/inference/inference.py --model_name MODEL_NAME --prompt PROMPT --max_new_tokens MAX_NEW_TOKENS |
MODEL_NAME=MaxLSB/LeCarnet-3M , PROMPT="Il était une fois" , MAX_NEW_TOKENS=512 |
Model | Judge | Grammar | Creativity | Coherence | Logic |
---|---|---|---|---|---|
LeCarnet-3M | mistral-large-2411 | 6.12 | 6.42 | 5.94 | 5.90 |
LeCarnet-8M | mistral-large-2411 | 7.06 | 7.20 | 7.56 | 7.28 |
LeCarnet-21M | mistral-large-2411 | 7.72 | 7.48 | 8.32 | 7.90 |