-
Notifications
You must be signed in to change notification settings - Fork 0
Data
The data used to train and evaluate the CRF model can be downloaded from here in a form of pickled pandas DataFrame's. You can download either the split sets (train.pkl
137MB, test.pkl
17MB, dev.pkl
17MB) or the full dataset (szeged_fixed.pkl
172MB).
Each row in the df contains a token, its features (see the 'Features' wiki page), its sentence ID, and its label. The labels refer to different types of semantic uncertainty (Szarvas et al. 2012) -
- Epistemic: the proposition is possible, but its truth-value cannot be decided at the moment. Example: She may be already asleep.
- Investigation: the proposition is in the process of having its truth-value determined. Example: She examined the role of NF-kappaB in protein activation.
- Doxatic: the proposition expresses beliefs and hypotheses, which may be known as true or false by others. Example: She believes that the Earth is flat
- CoNdition: the proposition is true or false based on the truth-value of another proposition. Example: If she gets the job, she will move to Utrecht.
- Certain: the token is not an uncertainty cue.
The data is a token-level version of the Szeged Uncertainty Corpus (Szarvas et al. 2012), which is originally available in a sentence-level XML format.
The basis for the token-level version is LUCI's merged_data
file, downloadable from here. However, this file suffers from a few issues:
- The dataset contains 27,760 duplicated sentences.
- The labels include an additional incorrect class -
U
. This is a result of a mix between two versions of the dataset: the binary version, with the classes Certain (C
) and Uncertain (U
), and the multiclass version, withC
and the 4 uncertainty classes (E
,I
,D
, andN
). - The features (e.g. stem, part-of-speech, etc.) are not arranged in a fixed order. Moreover, if a feature is not relevant for a specific row, it is not mentioned at all, which results in varying number of columns for each row.
Therefore, the following procedure was applied to the merged_data
file:
- Duplicated sentences were removed, keeping only the first instance of each sentence.
-
U
labels were re-labeled based on the original Szeged corpus. - Features were placed in their respective columns, with clear column names.
The code for this procedure can be found in the fix_labels.ipynb notebook.