- 📌 CAG vs RAG: Centralized Repository for NACE Revision
This repository is dedicated to the revision of the Nomenclature statistique des Activités économiques dans la Communauté Européenne (NACE).
It provides tools for automated classification and evaluation of business activity codes using Large Language Models (LLMs) and vector-based retrieval systems.
Ensure you have Python 3.12+ and uv or pip installed, then install the required dependencies:
uv sync
or
uv pip install -r pyproject.toml
Set up linting and formatting checks using pre-commit
:
uv run pre-commit autoupdate
uv run pre-commit install
Before running the script download the model and put it in cache using the huggingface CLI which is faster than vllm to download the model.
export MODEL_NAME=Qwen/Qwen2.5-0.5B
uv run huggingface-cli download $MODEL_NAME
To create a searchable database of NACE 2025 codes:
uv run src/build_vector_db.py
For unambiguous classification:
uv run src/encode_unambiguous.py
For ambiguous classification using an LLM:
uv run src/encode_ambiguous.py --experiment_name NACE2025_DATASET --llm_name Qwen/Qwen3-0.6B
⚠️ TO BE UPDATED
Compare different classification models:
uv run src/evaluate_strategies.py
Once all unique ambiguous cases have been recoded using the best strategy, you can rebuild the entire dataset with NACE 2025 labels:
uv run src/build_nace2025_sirene4.py
This repository leverages Large Language Models (LLMs) to assist in classifying business activities. One can also use all open source models available on HuggingFace and compatible with vLLM.
This project supports automated workflows via Argo Workflows. To trigger a workflow, execute:
argo submit argo-workflows/relabel-naf08-to-naf25.yaml
Or use the Argo Workflow UI.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.
- Mieux versionner les prompts qui sont sauvegardé dans s3. Idealement il faudrait versionner via la collection de la base de données + la version du prompt RAG de langfuse
- améliorer les embeddings (faut il garder les notes exlicatives en entier pour faire la similarity search ?)
- implementer le reranker (celui de qwen probablement)
- Inclure des règles métiers dans les prompts
- Inclure des règles code spécifique dans le cas du CAG. (Si LMNP alors on explique ce qui fait la distinction entre les deux code -- cf le fichier de @Nathan) --> Du coup inclure des variables annexes pour aider à départager certaines fois ?