Skip to content

Duplicate entries in med_dialog dataset #3746

@bkj

Description

@bkj

I was looking at the med_dialog dataset in MedHELM and noticed that there are a large number of duplicate questions, especially in the icliniq split.

Looking at the HELM source code here:

source_url="https://worksheets.codalab.org/rest/bundles/0x82f0c47f6d3e4462ae9ef8ea39eebe64/"

I downloaded the "original" dataset from
https://worksheets.codalab.org/rest/bundles/0x82f0c47f6d3e4462ae9ef8ea39eebe64/

and confirmed the duplicate entries are present there as well:

>>> import json
... 
... for subset in ['icliniq', 'healthcaremagic']:
...     for split in ['train', 'valid', 'test']:
...         x = json.load(open(f'{subset}/{split}.json'))
... 
...         n_total = len([xx['src'] for xx in x['data']])
...         n_unique = len(set([xx['src'] for xx in x['data']]))
...         print(f'{subset} {split}: {n_total} total, {n_unique} unique')

gives

icliniq train: 24851 total, 16573 unique # dups
icliniq valid: 3105 total, 2087 unique # dups
icliniq test: 3108 total, 2069 unique # dups
healthcaremagic train: 181122 total, 181112 unique # dups
healthcaremagic valid: 22641 total, 22641 unique
healthcaremagic test: 22642 total, 22642 unique

(I'm not sure what the data hosted at codalab is / where it comes from. Possibly from the BioBART preprocessing? Unfortunately the original MedDialog link is dead: https://github.com/UCSD-AI4H/Medical-Dialogue-System)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions