Natural language processing course: Automatic generation of Slovenian traffic news for RTV Slovenija
The aim of this project is to leverage various natural language processing techniques to enhance a large language model's ability to generate short traffic reports. The scope also encompasses a dataset that can be used for any supervised approach. This dataset is comprised of website data that was obtained from the national traffic news website and the final news reports.
-
src/
consolidate_data.py
– Data extraction/preparation for DP2.gams.py
– Baseline inference script (non-fine-tuned variant).input.py
– DP2 input data utilities.output.py
– DP2 output data utilities.utils.py
– Helper functions used for DP2 variant.
-
fine_tunning/
dp1_inf.py
– Inference script for thedp1
model variantdp1.sh
– Shell script to launchdp1_inf.py
dp2_inf.py
– Inference script for thedp2
model variantdp2.sh
– Shell script to rundp2_inf.py
fine_tunning.py
– Main fine-tuning pipeline for gamsft.sh
– Shell script to launchfine_tunning.py
-
evaluation/
evaluation.py
– General evaluation script for computing metrics (e.g., accuracy, BLEU, ROUGE)llm_evaluation.py
– Evaluation routine using LLM (deepseek)subset_preparation.py
– Extract a small subset of input data for DP2
-
dp1/
extract.py
– DP1 extraction of data from rtf and processing it for further use (prompting, fine-tunning)sentenceMatching.py
– Sentence matching algorithms used as a helper functio for extract.py
The report for first submission is available here.
The report for second submission is available here.
The report for final submission is available here.
In the following section, may reference some files/models that are available on the Arnes HPC cluser in a shared directory under /d/hpc/projects/onj_fri/nmlp
Begin by cloning the repository and creating the virtual environment
git clone https://github.com/UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-nmlp.git
cd ul-fri-nlp-course-project-2024-2025-nmlp
python -m venv venv
. venv/bin/activate
pip install -r requirements.txt
pip install -r requirements-pytorch.txt # You may need to adjust the index-url for your CUDA version
python -m spacy download sl_core_news_lg # Spacy trained pipeline for Slovenian
We recommend running any python scripts as modules, e.g.: python -m path.to.file
instead of python path/to/file.py
.
The scripts will automatically download the language models to ~/.cache/huggingface
.
If you're running this on the HPC cluster, you can just create a symlink to the shared directory:
# Optionally make a backup of existing directory
# mv ~/.cache/huggingface ~/.cache/huggingface.bak
ln -s /d/hpc/projects/onj_fri/nmlp/huggingface ~/.cache/huggingface
DP1 was ran locally, so adjusting the paths is what you must change in order for the program to work. You should adjust rtf_base
, which is where all the RTF files provided from RTVSLO are, as well as the excel_path
also provided by RTVSLO. After that you should be able to run with simple python -m dp1.extract
, which outputs a dp1.jsonl
, which is later used in fine_tunning
as well as evaluation.py
.
This file is also available on the cluster.
User warning: it's slow (around 1 output per minute).
The raw input data is part of the repository.
The raw output data however, was too large for comfort, so it is available here and on the HPC cluster as RTVSlo.zip
.
After unzipping it and placing RTVSlo
directory into the project root directory, you can run python -m src.consolidate_data
to start generating the processed data for DP2, which will be saved to dp2.jsonl
.
This file is also available on the cluster.
For fine-tuning, we recommend an HPC node with at least on H100 or equivalent.
Before running the whole process, you can adjust some variables in fine_tunning/fine_tunning.py
:
IO_PAIRS_PATH
: path todp1.jsonl
ordp2.jsonl
file (output from data preprocessing stage).MODEL_NAME
to choose the size of the model you want to FT (we used 27B variant).PEFT_DIR
path to where the checkpoints will be saved to. Now run the process usingpython -m fine_tunning.fine_tunning
.
Disclaimer: When running inference for multiple prompts, you may need to rerun the script in case you run out of VRAM.
Basic inference is demonstrated in src/gams.py
. It takes a subset of the inputs (file data/dp2_inputs.jsonl
, generated using evaluation/subset_preparation.py
) and generates reports for every input.
The result of this is saved to data/basic_outputs.jsonl
.
Basic inference is demonstrated in fine_tunning/dp1_inf.py
. It takes a subset of the inputs (file data/dp1_inputs.jsonl
) and generates reports for every input.
The result of this is saved to data/dp1_outputs.jsonl
.
Basic inference is demonstrated in fine_tunning/dp2_inf.py
. It takes a subset of the inputs (file data/dp2_inputs.jsonl
, generated using evaluation/subset_preparation.py
) and generates reports for every input.
The result of this is saved to data/dp2_outputs.jsonl
.
We evaluate the inference results using an external LLM provider.
Create an account at OpenRouter and then get your api key here.
Then, create a .env
file in the project root and paste DEEPSEEK_API_KEY=[YOUR_API_KEY]
in it.
Now you can run evaluation using python -m evaluation.llm_evaluation
.
This will give a score from 1 to 10 for the basic (data/basic_outputs.jsonl
), DP1 (data/dp1_outputs.jsonl
) and DP2 (data/dp2_outputs.jsonl
) variant of inference.