A contamination free code dataset for the evaluation and investigation of LLM behavior.
We give the code to reproduce the dataset in the code folder.
In the repositories folder, we give a list of all repositories we used to generate the dataset.
- We start by scraping repositories from GitHub based on their creation date, license, and amount of stars, using repo_extract.js.
- We extract all files corresponding to the selected language from each repository, using extract_files.py.
To run the exact deduplication we make use of unix (ubuntu) tools, the naming/availability may differ depending on the OS.
- First we run hash_entries.py To calculate and save to a text file all hashes belonging to our and other datasets.
- We generate lists of unique hashes of our dataset, and the other dataset using exact_dedup_hashes_self.py.
- We merge two sets of hashes and record the duplicates using exact_dedup_hashes_other.py.
- We flag duplicates in our dataset with respect to other datasets using exact_dedup_dataset.py.
- We generate and save the LSH object containing all the minhashes of our exact deduplicated dataset, using lsh_creation.py.
- Using the LSH object, we perform near deduplication against other public datasets, using near_dedup.py.
In order to have the most data available for each dataset, we do not filter duplicates from the dataset. Instead we add a boolean mask to The Heap that allows for filtering for unique files in each dataset.
Using the Datasets API, our dataset can be used as follows:
from datasets import load_dataset
dataset_name = 'redpajama'
language = 'Python'
ds = load_dataset(
"WizzF/Heap-Forge",
f"{language}",
split="train",
num_proc=16
)
ds = ds.filter(lambda x: not x[f'exact_duplicates_{dataset_name}'] and not x[f'near_duplicates_{dataset_name}'])
We extended the collection of programming language extensions used for The Stack, in the file langs_extension.json We added the EJS, Raku, Starlark, and WebAssembly languages.