Skip to content

UL-FRI-NLP-Course/ul-fri-nlp-course-project-2024-2025-takle_mamo

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural language processing course: Automatic generation of Slovenian traffic news for RTV Slovenija

Authors: Miha Lazić, Luka Gulič, Matevž Crček

About

Traffic reporting is a crucial component of public broadcasting, providing real-time updates on road conditions, accidents, and congestion. At RTV Slovenija, traffic news is currently compiled manually by students who review traffic reports from the promet.si portal and transcribe them into structured news segments, which are then broadcast every 30 minutes. This manual approach is time-consuming, prone to human error, and inefficient given the volume of traffic data that needs to be processed.

With the advancement of large language models (LLMs), there is an opportunity to automate the generation of structured and readable traffic news. This project aims to leverage an existing LLM, apply prompt engineering techniques and fine-tune the model to automatically generate traffic reports that adhere to the broadcasting standards of RTV Slovenija. In the next chapter, we present various existing solutions that we consider using in this project.

Project Structure

/Code contains project source code.
/Data contains project raw data (used for fine-tuning, evaluation, ...).
/Images is output folder for project source code.
/Instructions contains given project instructions and traffic report rules.
/Report contains temporary project report written in LaTeX.
/Models is output folder for trained models. They are too big to be included in the repository, so you will need to get them from drive, see section Generating Text.

Required Python Packages

The required python packages are listed in the requirements.txt file. You can install them using pip:

pip install -r requirements.txt

Generating the Datasets

To generate the datasets, we use the generate_datasets.ipynb notebook. This script processes the raw traffic data from the Data folder and generates structured datasets. Running the notebook generates following files (they are also available on OneDrive):

Training the Model

To train the model use train.py. It takes the following arguments:

  • model_name: Name of the base model to train. Default is cjvt/GaMS-9B-Instruct which we used in our project.
  • output_dir: Directory to save the trained model. Required.
  • dataset: Path to the training dataset in JSONL format. Required. Use the train datasets generated in the previous step.
  • max_steps: Maximum number of training steps. Required.

Example command to train the model:

python train.py --output_dir ./models/original_reports --dataset train_dataset.jsonl --max_steps 50000

python train.py --output_dir ./models/generated_reports --dataset train_dataset_generated_reports.jsonl --max_steps 50000

Generating Text

To generate text using the model use eval.py. It takes the following arguments:

  • model_name: Name of the base model to train. Default is cjvt/GaMS-9B-Instruct which we used in our project.
  • trained_model_path: Path to the trained model. To evaluate only base model, set to "None". The path must be a folder which contains adapter_config.json and adapter_model.safetensors.
  • dataset: Path to the training dataset in JSONL format. Required. Use the test dataset generated in the previous step.
  • output_path: Path to file where evaluation results will be saved. Default is eval_results.csv.

Trained model for original reports is available on OneDrive.

Trained model for gemini generated reports is available on OneDrive.

Extract the files to Models/original_reports and Models/generated_reports folders respectively.

eval.py will print the prompt currently being evaluated and the generated text. It will also save the results to the specified output path. Example command to evaluate the model:

python eval.py --trained_model_path ./Models/original_reports --dataset test_dataset.jsonl --output_path eval_results_original.csv

python eval.py --trained_model_path ./Models/generated_reports --dataset test_dataset.jsonl --output_path eval_results_generated.csv

python eval.py --trained_model_path "None" --dataset test_dataset.jsonl --output_path eval_results_base_gams.csv

Evaluation

Evaluation in done in eval_bert_score.ipynb notebook. It will calculate the BERTScore for each of the generated reports and print them out.

The notebook assumes that you have all of the results from the previous step in the Code folder. If you are missing any of the files, you can comment out the code that loads them and run the notebook again.

About

ul-fri-nlp-classroom-ul-fri-nlp-course-project-2024-2025-Project-template created by GitHub Classroom

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.9%
  • Other 2.1%