Skip to content

LucasMagnana/Pictalk_PrAACT

Repository files navigation

Pictalk_NLP

Implementation of the PrAACT method (described in PrAACT: Predictive Augmentative and Alternative Communication with Transformers) in order to help the development of Pictalk.

Dependencies and installation

This project uses Python 3.10.12. Use the package manager pip to install the dependencies :

pip install -r requirements.txt

Spacy's en_core_web_sm model is also needed. You can download it by running python -m spacy download en_core_web_sm.

Usage

The code is divided in 3 main files, each implementing a step of the method described in the paper :

  1. corpus_annotation.py : Annotates the aactext corpus as described in the paper and push it on HuggingFace. The annotation consists in lemmatizing the sentences and adding the lemmatized and the original sentence in the dataset. Only sentences that do not contain commas are processed, and the punctuation is removed from both the original and the lemmatized sentences.

  2. fine_tuning.py : Finetunes a transformer model using the annotated corpus. The default model used is the Large Bert but it can be changed using the --model/-m argument. The code is an adaptation of this HuggingFace tutorial. The main difference is that the preprocessing of the dataset is pushed on HuggingFace, as the process takes too much time on Google Colab due to low CPU power. The preprocessing is triggered by using the --preprocess argument.

Warning

The preprocessing needs to be redone every time the corpus is changed and every time a new type of model is finetuned (as it uses the tokenizer of the model).

  1. vocabulary_encoding.py : Computes the decoder layer of the final model as described in the paper. An embedding matrix is created using a dataset of pictograms and the embeded layer of a transformer. A linear layer using the matrix as its weights is then created and pushed to HuggingFace. Note that the model remains unchanged during the process. The vocabulary (i.e. the dataset of pictograms) used is CACE-UTAC. It has been translated in english and pushed on HuggingFace using datasets/upload_ARASAAC_CACE.py.

complete_sentences.py has been added in order to show how to use the final models. The decoder layer used is the one computed and pushed with vocabulary_encoding.py, but the --no-encode argument can be utilized to use the decoder layer of the model instead.

About

Implementation of the PrAACT method to help the development of Pictalk.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages