-
Notifications
You must be signed in to change notification settings - Fork 151
Description
Hi all,
Thank you for a great tool. I am currently using it to check for train test overlap. More concretely, I am using the recommended pipeline for decontamination where I first read in a shard from the training set, create a bloom filter, and then read in the mmlu test set and use the bloom filter on that set but in read only mode.
However I am seeing some strange outputs in the attributes file... I am not sure what to make of the splices where the text is tagged with values less than 1. How am I supposed to interpret this? Before hand i would only see 1 if there was a match for text that needed to be decontaminated and an empty attributes file otherwise.
I am using paragraphs mode with the ngram config. I have set the ngram length to 8, the stride 0 and the overlap_threshold to 0.7. Please let me know if anything is unclear!
{"attributes":{"starcoder_overlap":[[0,56,0.8125]]},"id":"cais/mmlu-all-dev-decontamination-0"}
{"attributes":{"starcoder_overlap":[]},"id":"cais/mmlu-all-dev-decontamination-1"}
{"attributes":{"starcoder_overlap":[[0,136,0.949999988079071]]},"id":"cais/mmlu-all-dev-decontamination-2"}
{"attributes":{"starcoder_overlap":[[0,138,0.9090909361839294]]},"id":"cais/mmlu-all-dev-decontamination-3"}