-
Notifications
You must be signed in to change notification settings - Fork 327
Open
Labels
Description
I was looking at the med_dialog
dataset in MedHELM and noticed that there are a large number of duplicate questions, especially in the icliniq
split.
Looking at the HELM source code here:
source_url="https://worksheets.codalab.org/rest/bundles/0x82f0c47f6d3e4462ae9ef8ea39eebe64/" |
I downloaded the "original" dataset from
https://worksheets.codalab.org/rest/bundles/0x82f0c47f6d3e4462ae9ef8ea39eebe64/
and confirmed the duplicate entries are present there as well:
>>> import json
...
... for subset in ['icliniq', 'healthcaremagic']:
... for split in ['train', 'valid', 'test']:
... x = json.load(open(f'{subset}/{split}.json'))
...
... n_total = len([xx['src'] for xx in x['data']])
... n_unique = len(set([xx['src'] for xx in x['data']]))
... print(f'{subset} {split}: {n_total} total, {n_unique} unique')
gives
icliniq train: 24851 total, 16573 unique # dups
icliniq valid: 3105 total, 2087 unique # dups
icliniq test: 3108 total, 2069 unique # dups
healthcaremagic train: 181122 total, 181112 unique # dups
healthcaremagic valid: 22641 total, 22641 unique
healthcaremagic test: 22642 total, 22642 unique
(I'm not sure what the data hosted at codalab
is / where it comes from. Possibly from the BioBART preprocessing? Unfortunately the original MedDialog link is dead: https://github.com/UCSD-AI4H/Medical-Dialogue-System)