Skip to content

This repository enables the community to use MGPHot in further research without redistributing restricted files.

License

Notifications You must be signed in to change notification settings

MTG/MGPHot-audio

Repository files navigation

Extended Metadata for MGPHot (audio links and more!)

This repo is still in construction....

Purpose of this repository

This repository enables the community to use MGPHot in further research without redistributing restricted files. The license of the original dataset forbids redistribution of derivative files and does not provide audio. Therefore, this repository does not include gene_values or any audio.

Instead, you will:

  1. Reconstruct the three canonical indices locally.
  2. Verify each index with MD5 checksums.
  3. Collect the audio for each track from public sources and verify the files.

What we provide:

  • Get the data in two steps: python reconstruct.py and python download_audio.py.
  • data_preparation/: scripts to collect audio and build the indices.
  • evaluation_probes/: code to train lightweight models for evaluation.

Compliance note: do not upload reconstructed indices or audio to this repository or any online service. The goal is reproducible use of MGPHot while respecting the original license.

What you reconstruct

You will obtain three JSON index files:

  1. genome_index_split.json Task: regression on gene_values (continuous targets).

  2. genome_index_split_positive.json Task: positive music autotagging (binary tags from thresholds over gene_values).

  3. genome_index_split_negative.json Task: negative music autotagging (complement of the positive tags).

Each index already includes the train/validation/test split in the field split. MD5 files are used to guarantee that every index is canonical in content and formatting.

How to reconstruct

Run the reconstruction script.

python reconstruct.py

It will:

  • download the Zenodo TSV with gene_values,
  • rebuild the base index with gene_values,
  • generate positive and negative indices,
  • compare each output with its reference MD5,
  • print a short report with dashed separators.

Outputs created (plus their .md5 files):

  • genome_index_split.json
  • genome_index_split_positive.json
  • genome_index_split_negative.json

If an MD5 does not match, the script prints it clearly. MD5 ensures exact byte match, including field order, indentation, and the trailing newline policy.

Download the audio

python download_audio.py

Repository layout

  • data_preparation/ — Clean and reliable process to obtain YouTube links and to build the indices.
  • download_audio/ — Scripts to download and verify all audio.
  • evaluation_probes/ — Training and evaluation code for the benchmark (regression and autotagging probes).
  • reconstruct.py — Rebuilds the three indices and verifies MD5 for each.
  • genome_positive.py / genome_negative.py — Convert gene_values to positive and negative tags.

Contribute

Audio download is semi-automatic. If you find a wrong or broken link, please open an issue:

Please include:

- Artist name: <Artist>
- Track title: <Title>
- Old YouTube URL: <https://www.youtube.com/watch?v=...>
- Old YouTube ID: <...>
- New YouTube URL: <https://www.youtube.com/watch?v=...>
- New YouTube ID: <...>
- Notes (optional): <...>

Citation

If you use this repository in research, please cite the paper:

and the original dataset:

License

Code and index definitions are released for research and non‑commercial use. See the LICENSE file for details.

About

This repository enables the community to use MGPHot in further research without redistributing restricted files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages