Natural language processing course: `Automatic generation of Slovenian traffic news for RTV Slovenija`

Authors: Miha Lazić, Luka Gulič, Matevž Crček

About

Traffic reporting is a crucial component of public broadcasting, providing real-time updates on road conditions, accidents, and congestion. At RTV Slovenija, traffic news is currently compiled manually by students who review traffic reports from the promet.si portal and transcribe them into structured news segments, which are then broadcast every 30 minutes. This manual approach is time-consuming, prone to human error, and inefficient given the volume of traffic data that needs to be processed.

With the advancement of large language models (LLMs), there is an opportunity to automate the generation of structured and readable traffic news. This project aims to leverage an existing LLM, apply prompt engineering techniques and fine-tune the model to automatically generate traffic reports that adhere to the broadcasting standards of RTV Slovenija. In the next chapter, we present various existing solutions that we consider using in this project.

Project Structure

/Code contains project source code.
/Data contains project raw data (used for fine-tuning, evaluation, ...).
/Images is output folder for project source code.
/Instructions contains given project instructions and traffic report rules.
/Report contains temporary project report written in LaTeX.
/Models is output folder for trained models. They are too big to be included in the repository, so you will need to get them from drive, see section Generating Text.

Required Python Packages

The required python packages are listed in the requirements.txt file. You can install them using pip:

pip install -r requirements.txt

Generating the Datasets

To generate the datasets, we use the generate_datasets.ipynb notebook. This script processes the raw traffic data from the Data folder and generates structured datasets. Running the notebook generates following files (they are also available on OneDrive):

Training the Model

To train the model use train.py. It takes the following arguments:

model_name: Name of the base model to train. Default is cjvt/GaMS-9B-Instruct which we used in our project.
output_dir: Directory to save the trained model. Required.
dataset: Path to the training dataset in JSONL format. Required. Use the train datasets generated in the previous step.
max_steps: Maximum number of training steps. Required.

Example command to train the model:

python train.py --output_dir ./models/original_reports --dataset train_dataset.jsonl --max_steps 50000

python train.py --output_dir ./models/generated_reports --dataset train_dataset_generated_reports.jsonl --max_steps 50000

Generating Text

To generate text using the model use eval.py. It takes the following arguments:

model_name: Name of the base model to train. Default is cjvt/GaMS-9B-Instruct which we used in our project.
trained_model_path: Path to the trained model. To evaluate only base model, set to "None". The path must be a folder which contains adapter_config.json and adapter_model.safetensors.
dataset: Path to the training dataset in JSONL format. Required. Use the test dataset generated in the previous step.
output_path: Path to file where evaluation results will be saved. Default is eval_results.csv.

Trained model for original reports is available on OneDrive.

Trained model for gemini generated reports is available on OneDrive.

Extract the files to Models/original_reports and Models/generated_reports folders respectively.

eval.py will print the prompt currently being evaluated and the generated text. It will also save the results to the specified output path. Example command to evaluate the model:

python eval.py --trained_model_path ./Models/original_reports --dataset test_dataset.jsonl --output_path eval_results_original.csv

python eval.py --trained_model_path ./Models/generated_reports --dataset test_dataset.jsonl --output_path eval_results_generated.csv

python eval.py --trained_model_path "None" --dataset test_dataset.jsonl --output_path eval_results_base_gams.csv

Evaluation

Evaluation in done in eval_bert_score.ipynb notebook. It will calculate the BERTScore for each of the generated reports and print them out.

The notebook assumes that you have all of the results from the previous step in the Code folder. If you are missing any of the files, you can comment out the code that loads them and run the notebook again.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Natural language processing course: `Automatic generation of Slovenian traffic news for RTV Slovenija`

About

Project Structure

Required Python Packages

Generating the Datasets

Training the Model

Generating Text

Evaluation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Code		Code
Data		Data
Images		Images
Instructions		Instructions
Models		Models
Report		Report
gams-env		gams-env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

License

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-takle_mamo

Folders and files

Latest commit

History

Repository files navigation

Natural language processing course: Automatic generation of Slovenian traffic news for RTV Slovenija

About

Project Structure

Required Python Packages

Generating the Datasets

Training the Model

Generating Text

Evaluation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Natural language processing course: `Automatic generation of Slovenian traffic news for RTV Slovenija`

Packages