Skip to content

Dataset Inquires  #1

@janetlauyeung

Description

@janetlauyeung

Hi there,

I'm reaching out to ask a few questions about the SciDTB dataset that I couldn't find answers to in this repository or the paper:

Firstly, the dev and test directories are clear in distinguishing the gold annotations and the second annotations while the train directory seems to have everything lumped together since you can see file names like P14-1024_anno1.edu.txt.dep, P14-1024_anno2.edu.txt.dep, and P14-1024_anno3.edu.txt.dep. Could you please clarify which one of the multiple annotated files are considered the gold annotation file?

Secondly, just wanted to double check: are same-unit, joint, and comparison the only discourse relations that are considered multinuclear (i.e. symmetric) in SciDTB?

Thirdly, there are 4 files that contain textual EDUs whose parent head is -1, as exemplified below by eduid=1 (dev file: D14-1080) in the figure. Could this be an error from the original annotation file? The file names for these are as follows:

  • train: P16-1069_anno2
  • dev: D14-1080; D14-1099
  • test: D14_1042

dev_D14-1080

Lastly, is there an annotation manual and/or detailed documentation of this dataset that is publicly available for reference regarding various aspects of the data?

Looking forward to hearing from you soon!

Cheers,
Janet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions