Unclear on dedupe outputs for decontamination

Hi all,

Thank you for a great tool. I am currently using it to check for train test overlap. More concretely, I am using the recommended pipeline for decontamination where I first read in a shard from the training set, create a bloom filter, and then read in the mmlu test set and use the bloom filter on that set but in read only mode.

However I am seeing some strange outputs in the attributes file... I am not sure what to make of the splices where the text is tagged with values less than 1. How am I supposed to interpret this? Before hand i would only see 1 if there was a match for text that needed to be decontaminated and an empty attributes file otherwise.

I am using paragraphs mode with the ngram config. I have set the ngram length to 8, the stride 0 and the overlap_threshold to 0.7. Please let me know if anything is unclear!

```jsonl
{"attributes":{"starcoder_overlap":[[0,56,0.8125]]},"id":"cais/mmlu-all-dev-decontamination-0"}
{"attributes":{"starcoder_overlap":[]},"id":"cais/mmlu-all-dev-decontamination-1"}
{"attributes":{"starcoder_overlap":[[0,136,0.949999988079071]]},"id":"cais/mmlu-all-dev-decontamination-2"}
{"attributes":{"starcoder_overlap":[[0,138,0.9090909361839294]]},"id":"cais/mmlu-all-dev-decontamination-3"} 
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Unclear on dedupe outputs for decontamination #264

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Unclear on dedupe outputs for decontamination #264

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions