Duplicate entries in med_dialog dataset

I was looking at the `med_dialog` dataset in MedHELM and noticed that there are a large number of duplicate questions, especially in the `icliniq` split.

Looking at the HELM source code here:
https://github.com/stanford-crfm/helm/blob/79909a2fdbb83953912d9b73b5cc1e86d25ab8f0/src/helm/benchmark/scenarios/med_dialog_scenario.py#L118

I downloaded the "original" dataset from
https://worksheets.codalab.org/rest/bundles/0x82f0c47f6d3e4462ae9ef8ea39eebe64/

and confirmed the duplicate entries are present there as well:
```
>>> import json
... 
... for subset in ['icliniq', 'healthcaremagic']:
...     for split in ['train', 'valid', 'test']:
...         x = json.load(open(f'{subset}/{split}.json'))
... 
...         n_total = len([xx['src'] for xx in x['data']])
...         n_unique = len(set([xx['src'] for xx in x['data']]))
...         print(f'{subset} {split}: {n_total} total, {n_unique} unique')
```
gives
```
icliniq train: 24851 total, 16573 unique # dups
icliniq valid: 3105 total, 2087 unique # dups
icliniq test: 3108 total, 2069 unique # dups
healthcaremagic train: 181122 total, 181112 unique # dups
healthcaremagic valid: 22641 total, 22641 unique
healthcaremagic test: 22642 total, 22642 unique
```

(I'm not sure what the data hosted at `codalab` is / where it comes from.  Possibly from the BioBART preprocessing?  Unfortunately the original MedDialog link is dead: https://github.com/UCSD-AI4H/Medical-Dialogue-System)



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Duplicate entries in med_dialog dataset #3746

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Duplicate entries in med_dialog dataset #3746

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions