This repo is still in construction....
This repository enables the community to use MGPHot in further research without redistributing restricted files.
The license of the original dataset forbids redistribution of derivative files and does not provide audio.
Therefore, this repository does not include gene_values
or any audio.
Instead, you will:
- Reconstruct the three canonical indices locally.
- Verify each index with MD5 checksums.
- Collect the audio for each track from public sources and verify the files.
What we provide:
- Get the data in two steps:
python reconstruct.py
andpython download_audio.py
. data_preparation/
: scripts to collect audio and build the indices.evaluation_probes/
: code to train lightweight models for evaluation.
Compliance note: do not upload reconstructed indices or audio to this repository or any online service. The goal is reproducible use of MGPHot while respecting the original license.
You will obtain three JSON index files:
-
genome_index_split.json
Task: regression ongene_values
(continuous targets). -
genome_index_split_positive.json
Task: positive music autotagging (binary tags from thresholds overgene_values
). -
genome_index_split_negative.json
Task: negative music autotagging (complement of the positive tags).
Each index already includes the train/validation/test split in the field split
.
MD5 files are used to guarantee that every index is canonical in content and formatting.
Run the reconstruction script.
python reconstruct.py
It will:
- download the Zenodo TSV with
gene_values
, - rebuild the base index with
gene_values
, - generate positive and negative indices,
- compare each output with its reference MD5,
- print a short report with dashed separators.
Outputs created (plus their .md5
files):
genome_index_split.json
genome_index_split_positive.json
genome_index_split_negative.json
If an MD5 does not match, the script prints it clearly. MD5 ensures exact byte match, including field order, indentation, and the trailing newline policy.
python download_audio.py
data_preparation/
— Clean and reliable process to obtain YouTube links and to build the indices.download_audio/
— Scripts to download and verify all audio.evaluation_probes/
— Training and evaluation code for the benchmark (regression and autotagging probes).reconstruct.py
— Rebuilds the three indices and verifies MD5 for each.genome_positive.py
/genome_negative.py
— Convertgene_values
to positive and negative tags.
Audio download is semi-automatic. If you find a wrong or broken link, please open an issue:
Please include:
- Artist name: <Artist>
- Track title: <Title>
- Old YouTube URL: <https://www.youtube.com/watch?v=...>
- Old YouTube ID: <...>
- New YouTube URL: <https://www.youtube.com/watch?v=...>
- New YouTube ID: <...>
- Notes (optional): <...>
If you use this repository in research, please cite the paper:
and the original dataset:
Code and index definitions are released for research and non‑commercial use. See the LICENSE
file for details.