Skip to content
Jenia Kim edited this page Jul 29, 2021 · 2 revisions

Data

The data used to train and evaluate the CRF model can be downloaded from here in a form of pickled pandas DataFrame's. You can download either the split sets (train.pkl 137MB, test.pkl 17MB, dev.pkl 17MB) or the full dataset (szeged_fixed.pkl 172MB).

Each row in the df contains a token, its features (see the 'Features' wiki page), its sentence ID, and its label. The labels refer to different types of semantic uncertainty (Szarvas et al. 2012) -

  • Epistemic: the proposition is possible, but its truth-value cannot be decided at the moment. Example: She may be already asleep.
  • Investigation: the proposition is in the process of having its truth-value determined. Example: She examined the role of NF-kappaB in protein activation.
  • Doxatic: the proposition expresses beliefs and hypotheses, which may be known as true or false by others. Example: She believes that the Earth is flat
  • CoNdition: the proposition is true or false based on the truth-value of another proposition. Example: If she gets the job, she will move to Utrecht.
  • Certain: the token is not an uncertainty cue.

The data is a token-level version of the Szeged Uncertainty Corpus (Szarvas et al. 2012), which is originally available in a sentence-level XML format.

Creation of the token-level version

The basis for the token-level version is LUCI's merged_data file, downloadable from here. However, this file suffers from a few issues:

  • The dataset contains 27,760 duplicated sentences.
  • The labels include an additional incorrect class - U. This is a result of a mix between two versions of the dataset: the binary version, with the classes Certain (C) and Uncertain (U), and the multiclass version, with C and the 4 uncertainty classes (E, I, D, and N).
  • The features (e.g. stem, part-of-speech, etc.) are not arranged in a fixed order. Moreover, if a feature is not relevant for a specific row, it is not mentioned at all, which results in varying number of columns for each row.

Therefore, the following procedure was applied to the merged_data file:

  1. Duplicated sentences were removed, keeping only the first instance of each sentence.
  2. U labels were re-labeled based on the original Szeged corpus.
  3. Features were placed in their respective columns, with clear column names.

The code for this procedure can be found in the fix_labels.ipynb notebook.

Clone this wiki locally