This repository provides a general-purpose complex-simpler parallel sentence simplification dataset for French language: Wikipedia-Vikidia Corpus, WiViCo. It results from the development of a two-step automatic filtering method, that mines register-diversified comparable corpora so as to extract complex-simpler pairs. To do so, we sequentially address the two primary conditions that must be satisfied for a simplified sentence to be considered valid:
- preservation of the original meaning, that we addressed with the use of n:m-aware SBERT-based cosine similarities; and
- simpliciy gain with respect to the source text, that we treated with a text simplicity classification model.
This repository currently contains two different versions:
-
The
wivico_v.1
subfolder. It comprises the initial version of the dataset, by which we operated the aforementioned conditions with the use of n:m-aware SBERT-based cosine similarities (as a proxy to meaning retention) and an FFNN-based simplicity gain classifier. It results from the experiments conducted in the following article:@inproceedings{ormaechea-2023-extracting-simplification-pairs, title = {Extracting Sentence Simplification Pairs from {F}rench Comparable Corpora Using a Two-Step Filtering Method}, author = {Lucía Ormaechea and Nikos Tsourakis}, booktitle = {Proceedings of the 8th edition of the Swiss Text Analytics Conference}, month = {6}, year = {2023}, location = {Neuchâtel, Switzerland}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.swisstext-1.4/}, pages = {30--40} }
-
The
wivico_v.2
subfolder, that includes the newest the version of WiViCo. The data derives from SBERT-based cosine similarities to assess meaning preservation, but it uses a finer-grained method to capture complex-simpler sentence pairs than the one used in the first version. It results from the experiments performed in the following paper:@inproceedings{ormaechea-2023-simple-simpler-beyond, title = {Simple, Simpler and Beyond: A Fine-Tuning BERT-Based Approach to Enhance Sentence Complexity Assessment for Text Simplification}, author = {Lucía Ormaechea, Nikos Tsourakis, Didier Schwab, Pierrette Bouillon and Benjamin Lecouteux}, booktitle = {Proceedings of the 6th International Conference on Natural Language and Speech Processing (ICNSLP)}, month = {12}, year = {2023}, location = {Trento, Italy}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.icnlsp-1.12/}, pages = {120--133} }
Contact person: Lucía Ormaechea, lucia.ormaecheagrijalba@unige.ch
If you have further questions, don't hesitate to send us an email.