Skip to content

DDMAL/linkedmusic-datalake

Repository files navigation

LinkedMusic Data Lake

This repository contains code, documentation, and sample data set files to:

  • Fetch data dumps from various databases in various file formats.
  • Reconcile entries in these databases against entities and properties in WikiData.
  • Transform reconciled databases into RDF turtle format
  • Upload the RDF files to Virtuoso
  • Generate visuals of the data lake ontology
  • Test and validate the data lake through benchmark SPARQL queries
  • Use LLMs to generate SPARQL through NLQ2SPARQL with the aid of a custom prompt engineering context

Refer to the wiki for a general overview of our current pipeline for adding a new dataset.

Repository Structure

In this repository, you'll find a folder per database (listed below), with the following subdirectories:

  • data/: Contains all the data files. Almost all data files are not contained in the repository due to their size.
  • doc/: Contains documentation files.
  • jsonld_approach/: Contains files related to the now-discontinued JSON-LD approach.
  • openrefine/: Contains history and export files for OpenRefine.
  • src/: Contains scripts.

There also is a shared/ folder in the repository root, which contains shared resources and utilities used across different database scripts. Currently, the only shared resources are the general RDF conversion script and a Wikidata API client.

Finally, the poetry.lock and pyproject.toml files manage the project's Python dependencies and packaging, and are located in the repository root.

Database Introductions

The following datasets are currently at least partially integrated into our data lake.

Refer to the wiki for more details on the project status, including completed work, work in progress, and future directions.

DIAMM

The Digital Image Archive of Medieval Music (DIAMM) is an archive of digital images of European medieval manuscripts. We use a web crawler to fetch metadata from the DIAMM site and use custom scripts to convert the JSON data to CSV, and then to RDF. See the DIAMM manual for more information.

Dig That Lick (DTL1000)

Dig That Lick is a project the extracts and analyses solos from jazz performances. See the Dig That Lick documentation for more information.

The Global Jukebox

The Global Jukebox focuses on traditional folk, indigenous, and popular songs from around the world. Its data can be found on The Global Jukebox Github. See The Global Jukebox manual for more information.

MusicBrainz

MusicBrainz is an open music encyclopedia that provides extensive music metadata and serves as a universal reference for music identification.
MusicBrainz has a public Data Set downloading site. We retrieve those Data Sets in JSON Lines format and process them using RDFLib package from python. See the MusicBrainz manual for more information.

The Session

The Session is a community website dedicated to Irish traditional music. The Session has a public GitHub repo that contains public Data Sets. We retrieve these in CSV format and reconcile them using OpenRefine. Find the Session manual for additional guidance.

RISM

RISM Database is the Répertoire International des Sources Musicales, an international collaborative database that catalogues historical musical sources. It provides detailed information on manuscripts, prints, and other music-related documents, serving as a crucial resource for researchers, librarians, and musicologists seeking to study and reference historical musical materials. RISM provides us their complete Data Sets in RDF format. We use OpenRefine to reconcile the database against WikiData. Refer to the RISM manual for more details.

Cantus DB

Cantus Database is a repository of Latin chants found in medieval manuscripts and early printed books.
Cantus DB provides us their sample Data Sets in CSV format. Refer to the Cantus DB manual for details.

Simssa DB

SIMSSA Database is a discovery tool for symbolic music files (MEI, Kern, MusicXML, MIDI). It evolved from a previous database developed under Julie Cumming’s Digging into Data grant, offering improved functionality. The work is still in progress. Refer to the Simssa DB manual for further instructions.

wikidata_stamp

About

To create mapping strategies for various music databases into our data lake

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 12

Languages