Skip to content

AISE-TUDelft/FORGE-ds-intermediate

Repository files navigation

The Heap

A contamination free code dataset for the evaluation and investigation of LLM behavior.

HuggingFace

Layout

We give the code to reproduce the dataset in the code folder.

In the repositories folder, we give a list of all repositories we used to generate the dataset.

Running the code

Code Collection

  1. We start by scraping repositories from GitHub based on their creation date, license, and amount of stars, using repo_extract.js.
  2. We extract all files corresponding to the selected language from each repository, using extract_files.py.

Exact Deduplication

To run the exact deduplication we make use of unix (ubuntu) tools, the naming/availability may differ depending on the OS.

  1. First we run hash_entries.py To calculate and save to a text file all hashes belonging to our and other datasets.
  2. We generate lists of unique hashes of our dataset, and the other dataset using exact_dedup_hashes_self.py.
  3. We merge two sets of hashes and record the duplicates using exact_dedup_hashes_other.py.
  4. We flag duplicates in our dataset with respect to other datasets using exact_dedup_dataset.py.

Near Deduplication

  1. We generate and save the LSH object containing all the minhashes of our exact deduplicated dataset, using lsh_creation.py.
  2. Using the LSH object, we perform near deduplication against other public datasets, using near_dedup.py.

Using the dataset

In order to have the most data available for each dataset, we do not filter duplicates from the dataset. Instead we add a boolean mask to The Heap that allows for filtering for unique files in each dataset.

Using the Datasets API, our dataset can be used as follows:

from datasets import load_dataset

dataset_name = 'redpajama'
language = 'Python'

ds = load_dataset(
    "WizzF/Heap-Forge",
    f"{language}",
    split="train",
    num_proc=16
)

ds = ds.filter(lambda x: not x[f'exact_duplicates_{dataset_name}'] and not x[f'near_duplicates_{dataset_name}'])

Acknowledgements

We extended the collection of programming language extensions used for The Stack, in the file langs_extension.json We added the EJS, Raku, Starlark, and WebAssembly languages.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •