Skip to content

LeCarnet is a 2 M+ corpus of simple French stories, featuring end‑to‑end data generation, evaluation and training pipelines for small language models

License

Notifications You must be signed in to change notification settings

MaxLSB/LeCarnet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LeCarnet: A Dataset for Tiny French Language Models

LeCarnet Logo

1. Introduction

LeCarnet is a text dataset of 2 million children's stories in French using very simple vocabulary, inspired by the TinyStories dataset. The purpose of this work is to provide a reliable, high-quality resource for training and evaluating small language models (SLMs) in French. It is aimed at educational and experimental use. This repository contains a minimalist code for data generation, training, evaluation, and inference.

This dataset was created by synthetically generating French short stories using Mistral-Large-Instruct-2411.

The dataset and models are available on Hugging Face:

2. Quick Setup

Using uv for fast and reliable dependency management.

# Basic environment setup
make env

That's it, you can now run any command you want!

⚠️ You might need to perform the following two steps manually before running make env:

curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env

3. Training

The training pipeline supports Weights & Biases (WandB) for tracking training and validation losses, as well as perplexity.

Task Make Command Equivalent CLI Command Default Values
Training make train python src/train/train.py --model_config MODEL_CONFIG MODEL_CONFIG=3M
Push Model to HF make push-model python src/inference/push-model.py --repo_name HF_REPO --model_dir MODEL_DIR HF_REPO=MaxLSB/LeCarnet-3M, MODEL_DIR=LeCarnet-3M/model_weights/

⚠️ Check src/train/configs.py for fine-grained hyperparameter tuning. MODEL_CONFIG=custom to use your own custom model config.

4. Data Generation

For Generation tasks set your API key:

# Linux/MacOS
export MISTRAL_API_KEY=your_api_key
# Windows
$env:MISTRAL_API_KEY="your_api_key"
# Linux/MacOS
export OPENAI_API_KEY=your_api_key
# Windows
$env:OPENAI_API_KEY="your_api_key"
Task Make Command Equivalent CLI Command Default Values
Generate with Mistral make generate-mistral python src/data/mistral.py --model_name MISTRAL_MODEL --total_requests MISTRAL_REQUESTS, --num_workers NUM_WORKERS MISTRAL_MODEL=mistral-large-2411, MISTRAL_REQUESTS=100000, NUM_WORKERS=4
Generate with OpenAI make generate-openai python src/data/openai.py --model_name OPENAI_MODEL --total_requests OPENAI_REQUESTS OPENAI_MODEL=gpt-3.5-turbo, OPENAI_REQUESTS=100000
Push Dataset to HF make push-dataset python src/data/push_dataset.py --folder_path FOLDER_PATH --repo_name REPO_NAME FOLDER_PATH=./dataset/, REPO_NAME=MaxLSB/LeCarnet

5. Evaluation & Inference

To run the evaluation you also need to set up your Mistral API key.

Task Make Command Equivalent CLI Command Default Values
Evaluation make eval python src/eval/eval.py --model_name EVAL_MODEL --judge_model_name JUDGE_MODEL EVAL_MODEL=MaxLSB/LeCarnet-3M, JUDGE_MODEL=mistral-large-2411
Inference make inference python src/inference/inference.py --model_name MODEL_NAME --prompt PROMPT --max_new_tokens MAX_NEW_TOKENS MODEL_NAME=MaxLSB/LeCarnet-3M, PROMPT="Il était une fois", MAX_NEW_TOKENS=512

6. Results

Model Judge Grammar Creativity Coherence Logic
LeCarnet-3M mistral-large-2411 6.12 6.42 5.94 5.90
LeCarnet-8M mistral-large-2411 7.06 7.20 7.56 7.28
LeCarnet-21M mistral-large-2411 7.72 7.48 8.32 7.90

7. References

About

LeCarnet is a 2 M+ corpus of simple French stories, featuring end‑to‑end data generation, evaluation and training pipelines for small language models

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •