The Heap

A contamination free code dataset for the evaluation and investigation of LLM behavior.

HuggingFace

Layout

We give the code to reproduce the dataset in the code folder.

In the repositories folder, we give a list of all repositories we used to generate the dataset.

Running the code

Code Collection

We start by scraping repositories from GitHub based on their creation date, license, and amount of stars, using repo_extract.js.
We extract all files corresponding to the selected language from each repository, using extract_files.py.

Exact Deduplication

To run the exact deduplication we make use of unix (ubuntu) tools, the naming/availability may differ depending on the OS.

First we run hash_entries.py To calculate and save to a text file all hashes belonging to our and other datasets.
We generate lists of unique hashes of our dataset, and the other dataset using exact_dedup_hashes_self.py.
We merge two sets of hashes and record the duplicates using exact_dedup_hashes_other.py.
We flag duplicates in our dataset with respect to other datasets using exact_dedup_dataset.py.

Near Deduplication

We generate and save the LSH object containing all the minhashes of our exact deduplicated dataset, using lsh_creation.py.
Using the LSH object, we perform near deduplication against other public datasets, using near_dedup.py.

Using the dataset

In order to have the most data available for each dataset, we do not filter duplicates from the dataset. Instead we add a boolean mask to The Heap that allows for filtering for unique files in each dataset.

Using the Datasets API, our dataset can be used as follows:

from datasets import load_dataset

dataset_name = 'redpajama'
language = 'Python'

ds = load_dataset(
    "WizzF/Heap-Forge",
    f"{language}",
    split="train",
    num_proc=16
)

ds = ds.filter(lambda x: not x[f'exact_duplicates_{dataset_name}'] and not x[f'near_duplicates_{dataset_name}'])

Acknowledgements

We extended the collection of programming language extensions used for The Stack, in the file langs_extension.json We added the EJS, Raku, Starlark, and WebAssembly languages.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
code		code
scraped_repos		scraped_repos
README.md		README.md
The Heap_ A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models.pdf		The Heap_ A Contamination-Free Multilingual Code Dataset for Evaluating Large Language Models.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Heap

Layout

Running the code

Code Collection

Exact Deduplication

Near Deduplication

Using the dataset

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

AISE-TUDelft/FORGE-ds-intermediate

Folders and files

Latest commit

History

Repository files navigation

The Heap

Layout

Running the code

Code Collection

Exact Deduplication

Near Deduplication

Using the dataset

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages