This repo supplements our survey: Are We There Yet? A Brief Survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges
- Paper: arXiv
- Authors: Jaeyong Kang, Dorien Herremans
Abstract: Deep learning models for music have advanced drastically in recent years, but how good are machine learning models at capturing emotion, and what challenges are researchers facing? In this paper, we provide a comprehensive overview of the available music-emotion datasets and discuss evaluation standards as well as competitions in the field. We also offer a brief overview of various types of music emotion prediction models that have been built over the years, providing insights into the diverse approaches within the field. Through this examination, we highlight the challenges that persist in accurately capturing emotion in music, including issues related to dataset quality, annotation consistency, and model generalization. Additionally, we explore the impact of different modalities, such as audio, MIDI, and physiological signals, on the effectiveness of emotion prediction models. Recognizing the dynamic nature of this field, we have complemented our findings with an accompanying GitHub repository. This repository contains a comprehensive list of music emotion datasets and recent predictive models.
A curated list of Datasets and Models for Music Emotion Recognition (MER)
If you find these lists useful, please cite our orginal paper.
@conference {2024,
title = {Are we there yet? A brief survey of Music Emotion Prediction Datasets, Models and Outstanding Challenges},
booktitle = {arXiv:2406.08809},
year = {2024},
url = {https://arxiv.org/abs/2406.08809},
author = {J. Kang and D. Herremans}
}
Want to contribute your models or datasets? Simply do a pull request.
Dataset | Year | # of instances | Length | Type | Categorical | Dimensional | Dynamic | Induced |
---|---|---|---|---|---|---|---|---|
MoodsMIREX | 2007 | 269 | 30s | MP3 | 5 labels | - | Static | Perceived |
CAL500 | 2007 | 500 | full | MP3 | 174 labels | - | Static | Perceived |
Yang-Dim | 2008 | 195 | 25s | WAV | - | Russell | Static | Perceived |
MoodSwings | 2008 | 240 | 15s | MP3 | - | Russell | Dynamic | Perceived |
NTWICM | 2010 | 2,648 | full | MP3 | - | Russell | Static | Perceived |
Soundtrack | 2011 | 470 | 15s-1m | MP3 | 6 labels | 3 dimensions | Static | Perceived |
MoodSwings Turk | 2011 | 240 | 15s | MP3 | - | Russell | Dynamic | Perceived |
Last.fm subset of MSD | 2011 | 505,216 | full | Metadata only | listener tags | - | Static | Perceived |
DEAP | 2012 | 120 | 60s | YouTube id | - | Russell | Static | Induced |
Panda et al.'s dataset | 2013 | 903 | 30s | MP3, MIDI | 21 labels | - | Static | Perceived |
Solymani et al.'s dataset | 2013 | 1000 | 45s | MP3 | - | Russell | Both | Perceived |
CAL500exp | 2014 | 3,223 | 3s-16s | MP3 | 67 labels | - | Static | Perceived |
AMG1608 | 2015 | 1,608 | 30s | WAV | - | Russell | Static | Perceived |
Emotify | 2016 | 400 | 60s | MP3 | GEMS | - | Static | Induced |
Moodo | 2016 | 200 | 15s | WAV | - | Russell | Static | Perceived |
Malheiro et al.'s dataset | 2016 | 200 | 30s | Audio, Lyrics | Quadrants | - | Static | Perceived |
CH818 | 2017 | 818 | 30s | MP3 | - | Russell | Static | Perceived |
MoodyLyrics | 2017 | 2,595 | full | Lyrics | 4 labels | - | Static | Perceived |
4Q-emotion | 2018 | 900 | 30s | MP3 | Quadrants | - | Static | Perceived |
DEAM | 2018 | 2,058 | 45s | MP3 | - | Russell | Both | Perceived |
PMEmo | 2018 | 794 | full | MP3 | - | Russell | Both | Induced |
RAVDESS | 2018 | 1,012 | full | MP3, MP4 | 5 labels | - | Static | Perceived |
DMDD | 2018 | 18,644 | full | Audio, Lyrics | - | Russell | Static | Perceived |
MTG-Jamendo | 2019 | 18,486 | full | MP3 | 56 labels | - | Static | Perceived |
VGMIDI | 2019 | 200 | full | MIDI | - | Russell | Dynamic | Perceived |
Turkish Music Emotion | 2019 | 400 | 30s | MP3 | 4 labels | - | Static | Perceived |
EMOPIA | 2021 | 1,087 | 30s-40s | Audio, MIDI | Quadrants | - | Static | Perceived |
MER500 | 2020 | 494 | 10s | WAV | 5 labels | - | Static | Perceived |
Music4all | 2020 | 109,269 | 30s | WAV | - | 3 dimensions | Static | Perceived |
CCMED-WCMED | 2020 | 800 | 8-20s | WAV | - | Russell | Static | Perceived |
MuSe | 2021 | 90,001 | full | Audio | - | Russell (V-A-D) | Static | Perceived |
HKU956 | 2022 | 956 | full | MP3 | - | Russell | Static | Induced |
MERP | 2022 | 54 | full | WAV | - | Russell | Both | Perceived |
MuVi | 2022 | 81 | full | YouTube id | GEMS | Russell | Both | Perceived |
YM2413-MDB | 2022 | 699 | full | WAV, MIDI | 19 labels | - | Static | Perceived |
MusAV | 2022 | 2,092 | full | WAV | - | Russell | Static | Perceived |
EmoMV | 2023 | 5,986 | 30s | WAV | 6 labels | - | Static | Perceived |
Indonesian Song | 2023 | 476 | full | WAV | 3 labels | - | Static | Perceived |
TROMPA-MER | 2023 | 1,161 | 30s | WAV | 11 labels | - | Static | Perceived |
Music-Mouv | 2023 | 188 | full | Spotify id | GEMS | - | Static | Induced |
ENSA | 2023 | 60 | full | MP3 | - | Russell | Dynamic | Perceived |
EMMA | 2024 | 364 | 30s-60s | WAV | GEMS | - | Static | Induced |
SiTunes | 2024 | 300 | full | WAV | - | Russell | Static | Induced |
MERGE | 2024 | 3,554 | full | Audio, Lyrics | Quadrants | - | Static | Perceived |
Popular Hooks | 2024 | 38,694 | hooks | Video, Audio, Lyrics | Quadrants | - | Static | Perceived |
Affolter and Rohrmeier's dataset | 2024 | 5,892 | full | Spotify id | 8 labels | - | Static | Perceived |
XMIDI | 2025 | 108,023 | full | MIDI | 11 labels | - | Static | Perceived |
- Author: Hu, X., Downie, J.S., Laurier, C., Bay, M., Ehmann, A.F.
- Description and music styles: Selection of the libraries of Associated Production Music (APM), “the world’s leading production music library… offering every imaginable music genre from beautiful classical music recordings to vintage rock to current indie band sounds".
- Annotation strategy: The pieces were rated by 3 raters, and only a subset of agreement by 2 out of 3 is extracted.
- Annotation (categorical): Cluster 1 (passionate, rousing, confident, boisterous, rowdy), Cluster 2 (rollicking, cheerful, fun, sweet, amiable/good natured), Cluster 3 (literate, poignant, wistful, bittersweet, autumnal, brooding), Cluster 4 (humorous, silly, campy, quirky, whimsical, witty, wry), Cluster 5 (aggressive, fiery, tense/anxious, intense, volatile, visceral)
- Link: Offline
- Author: Turnbull, D., Barrington, L., Torres, D., Lanckriet G.
- Description and music styles: Songs were picked from the authors' personal collection of western popular music recorded within the last 50 years.
- Annotation strategy: The authors paid 66 undergraduate students to annotate the CAL500 corpus with semantic concepts from the vocabulary. Participants were rewarded $10 for a one hour annotation block spent listening to the music.
- Annotation (categorical): 18 emotions, found by Skowronek et al. (2006) to be both important and easy to identify, were rated on a scale from one to three (e.g., not happy, neutral, happy).
- Link: http://calab1.ucsd.edu/~datasets/cal500/
- Author: Yang, Y.-H., Lin, Y.-C, Su, Y.-F, Chen, H.-H.
- Description and music styles: The dataset contains 195 popular songs from Western, Chinese, and Japanese albums.
- Annotation strategy: Subjects (most college students) are asked to listen to a subset of music dataset and to choose two values, each ranges from -1.0 to 1.0 in 11 levels, to indicate their feeling about the AV values of the music sample. The ground truth is set as the mean of the AV values of all subjects tested. On the average, more than ten pairs of AV values are collected from the subjective test for each music sample.
- Annotation (dimensional): Arousal and valence
- Link: http://mac.citi.sinica.edu.tw/~yang/MER/taslp08/#Data
- Author: Kim, Y., Schmidt, E., Emelle, L.
- Description and music styles: The authors used US pop music to collect time-varying perception of emotions.
- Annotation strategy: Two players from a game used the mouse to annotate the segment over a continuous AV space.
- Annotation (dimensional): Time-continuous arousal and valence annotation (1 Hz)
- Link: Offline
- Author: Schuller, B., Dorfner, J., Gerhard, R.
- Description and music styles: For building up a ground truth music database the compilation “Now That’s What I Call Music!” (U. K. series, volumes 1–69, double CDs, each) is selected. It represents very well most music styles which are popular today; that ranges from Pop and Rock music over Rap, R&B to electronic dance music as Techno or House.
- Annotation strategy: 4 raters gave static annotations for complete songs for arousal and valence in a discrete range of [−2,−1, 0, 1, 2].
- Annotation (dimensional): Arousal and valence [−2,−1, 0, 1, 2].
- Link: http://openaudio.eu/NTWICM-Mood-Annotation.arff (annotations)
- Author: Eerola, T. & Vuoskoski, J. K.
- Description and music styles: This dataset we refer to as "Film soundtracks" are designed to overcome at least some of the problems mentioned above since they contain not that well known examples (although some could be identified by film aficionados).
- Annotation strategy: The selection of the excerpts has been done in terms of dimensional and discrete emotion model and evaluated by a pilot study. The initial ratings were made by 12 expert musicologists for both dimensional and discrete models. These ratings were then re-tested with 116 university students.
- Annotation: categorical and dimensional
- Link: https://osf.io/p6vkg/
- Author: Speck, J.A., Schmidt, E.M., Morton, B.G. and Kim, Y.E.
- Description and music styles: The authors used US pop music to collect time-varying perception of emotions.
- Annotation strategy: Crowdsourced MTurk.
- Annotation (dimensional): Time-continuous arousal and valence annotation (1 Hz)
- Link: Offline
- Author: Bertin-Mahieux, T., Ellis, D.P.W., Whitman, B., & Lamere, P.
- Description and music styles: A large-scale dataset derived from user-generated tags and similarity data on Last.fm, linked to the Million Song Dataset. Covers a broad range of Western popular music genres including pop, rock, electronic, and hip-hop.
- Annotation strategy: Tags and song similarities were obtained from the Last.fm API. Tags reflect listener-generated annotations, collected via Last.fm’s crowd-based platform. Song similarities were computed by Last.fm based on user listening behavior.
- Annotation: Categorical, based on listener-provided tags. No predefined emotion taxonomy; tag frequency and co-occurrence patterns can be leveraged for emotion-related studies.
- Link: http://millionsongdataset.com/lastfm/
- Author: Koelstra, S., Muehl, C., Soleymani, M., Lee, J.-S., Yazdani, A., Ebrahimi, T., Pun, T., Nijholt, A., Patras, I.
- Description and music styles: From the 120 original videos from YouTube, 60 were manually selected, while the remaining 60 were selected via Last.FM affective tags. EEG and physiological signals were recorded and each participant also rated the videos as above. For 22 participants frontal face video was also recorded.
- Annotation strategy: The participants ratings, physiological recordings and face video of an experiment where 32 volunteers (mostly European students) watched a subset of 40 of the above music videos. EEG and physiological signals were recorded and each participant also rated the videos as above. For 22 participants frontal face video was also recorded.
- Annotation: Dimensional (arousal, valence, and dominance)
- Link: http://www.eecs.qmul.ac.uk/mmv/datasets/deap/
- Author: Panda, R., Malheiro, R., Rocha, B., Oliveira, A., and Paiva, R. P.
- Description and music styles: A multi-modal dataset created from the AllMusic database, comprising 903 Western music excerpts across various genres. Each excerpt is paired with corresponding MIDI files and lyrics, enabling analysis from audio, symbolic, and textual perspectives.
- Annotation strategy: Emotion labels were derived from listener-assigned emotion tags on AllMusic, mapped to the 5 emotion clusters defined in the MIREX Mood Classification Task.
- Annotation: Categorical — 21 tags grouped into 5 MIREX-defined emotion clusters.
- Link: https://mir.dei.uc.pt/resources/MIREX-like_mood.zip
- Author: Soleymani, M., Caro, M.N., Schmidt, E.M., Sha, C.Y. and Yang, Y.H.
- Description and music styles: This dataset includes 1,000 songs sourced from the Free Music Archive (FMA), covering a wide range of music styles under Creative Commons licenses. It is designed for music emotion recognition research and includes 45-second audio excerpts uniformly sampled from full tracks.
- Annotation strategy: Continuous valence and arousal annotations were collected via Amazon Mechanical Turk from at least 10 annotators per song. A two-stage filtering method was used to ensure data quality. Continuous annotations were resampled at 2Hz, and standard deviations were also provided to indicate inter-rater variability. Additionally, static annotations were collected on a 9-point scale.
- Annotation: Dimensional — arousal and valence (time-continuous and static); annotations follow Russell’s model.
- Link: https://cvml.unige.ch/databases/emoMusic/
- Author: Wang, S.Y., Wang, J.C., Yang, Y.H. and Wang, H.M.
- Description and music styles: CAL500exp is an enriched, time-fragment-level version of the CAL500 dataset. It consists of 3,223 audio segments (3–16 seconds) derived from Western popular music tracks included in the original CAL500 dataset.
- Annotation strategy: Eleven musically trained annotators were recruited to label time-varying semantic tags using a purpose-built annotation interface designed to improve label quality and reduce effort.
- Annotation: Categorical — 67 semantic labels including mood, genre, instrumentation, and other musical descriptors, with time-localized annotation.
- Link: https://slam.iis.sinica.edu.tw/demo/CAL500exp/
- Author: Chen, Y.-A, Yang, Y.-H., Wang, J.-C., Chen, H.-H.
- Description and music styles: The dataset contemporary Western music from AMG, which has 34 distinct mood categories defined by music editors.
- Annotation strategy: Each subject is asked to annotate 13 songs and the subject can annotate a song by placing the cursor on the panel to indicate the location of the perceived VA value of the song.
- Annotation (dimensional): Arousal and valence real values between [-1,1] for whole excerpt.
- Link: https://amg1608.blogspot.com/
- Author: Aljanaki, A., Wiering, F., Veltkamp, R.C.
- Description and music styles: The selected songs include mainly four genres (rock, classical, pop and electronic) music.
- Annotation strategy: The annotations were collected using GEMS scale (Geneva Emotional Music Scales). The annotations produced by the game are spread unevenly among the songs, which is caused both by design of the experiment and design of the game. Participants could skip songs and switch between genres, and they were encouraged to do so, because induced emotional response does not automatically occur on every music listening occasion.
- Annotation (categorical): Nine categories (amazement, solemnity, tenderness, nostalgia, calmness, power, joyful activation, tension, sadness)
- Link: http://www2.projects.science.uu.nl/memotion/emotifydata/
- Authors: M Pesek, G Strle, A Kavčič, M Marolt
- Description and music styles: The dataset contains 200 excerpts (15 seconds). 20 from electronic acoustic, 20 from ethno, 80 from popular (Jamendo) and 80 from movies (Eerola and Vuoskoski, 2010).
- Annotation strategy: 741 participants where presented with 10 music excerpts and instructed to choose a color best associated with the music excerpt.
- Annotation (dimensional): Arousal and valence real values between [-1,1] this was mapped to a color.
- Link: http://moodo.musiclab.si
- Author: Malheiro, R., Panda, R., Gomes, P.J. and Paiva, R.P.
- Description and music styles: The dataset includes 200 Western songs spanning various genres and eras. Each song includes both a 30-second audio clip and its corresponding lyrics, aimed at exploring music emotion recognition from audio, lyrics, and bimodal perspectives.
- Annotation strategy: 39 annotators independently rated either the audio or lyrics of each song. They identified the predominant emotion and assigned arousal and valence values on a scale from -4 to 4. Songs with high disagreement (standard deviation > 1.2) were excluded to ensure consistency. Final sets include 162 audio clips, 180 lyrics, and a bimodal subset of 133 songs where audio and lyrics agreed on quadrant classification.
- Annotation: Categorical — mapped to the four quadrants of Russell’s emotion model, derived from averaged valence-arousal ratings.
- Link: Offline
- Author: Hu, X., Yang, Y.
- Description and music styles: Chinese Pop songs released in Taiwan, Hong Kong and Mainland China.
- Annotation strategy: Each clip was annotated by three music experts who were born and raised in Mainland China and thus were with a Chinese cultural background. The annotation was done with an interface consisting of two sliding bars of continuous real values between [-10,10].
- Annotation (dimensional): Arousal and valence real values between [-10,10] for whole excerpt.
- Link: Offline
- Author: Çano, E. and Morisio, M.
- Description and music styles: The dataset consists of 2,595 song lyrics collected from various public sources including Last.fm, Million Song Subset, CAL500, and lyrics.wikia.com. It spans diverse genres such as rock, pop, and blues, and includes songs from different eras, ranging from the 1960s to recent years.
- Annotation strategy: Lyrics were automatically annotated using a lexicon-based sentiment analysis approach. A combined affective lexicon—derived from ANEW, WordNet, and WordNet-Affect—was used to assign Valence and Arousal scores to lyrics. Songs were then classified into one of four emotion quadrants of Russell’s model based on their aggregate Valence and Arousal scores. No human annotation or listener tagging was used.
- Annotation: 4 emotion categories (Happy, Angry, Sad, Relaxed) derived from Russell’s two-dimensional model (Valence-Arousal quadrants).
- Link: https://softeng.polito.it/erion/MoodyLyrics.zip
- Author: Panda R., Malheiro R., Paiva R. P.
- Description and music styles: The AllMusic API served as the source of musical information, providing metadata such as artist, title, genre and emotion information, as well as 30-second audio clips for most songs. Mostly popularly consumed music.
- Annotation strategy: Collected from AllMusic API, emotion tags are selected from the original AllMusic Tags by intersecting them with the Warriner’s list. Finally, a manual blind validation is conducted by subjects.
- Annotation (categorical): Q1 (A+V+), Q2 (A+V-), Q3(A-V-), Q4 (A-V+)
- Link: http://mir.dei.uc.pt/downloads.html
- Author: Soleymani, M., Aljanaki, A., Yang, Y.
- Description and music styles: Royalty-free music from several sources: freemusicarchive.org (FMA), jamendo.com, and the medleyDB datase. The excerpts which were annotated are available in the same package song ids between 1 and 2058. The dataset consists of 2014 development set (744 songs), 2014 evaluation set (1000 songs) and 2015 evaluation set (58 songs). Includes rock, pop, soul, blues, electronic, classical, hip-hop, international, experimental, folk, jazz, country and pop genres.
- Annotation strategy: Crowdsourced MTurk with each excerpt annotated at least by 10 workers. Both arousal and valence were annotated separately. Additional static annotations were collected for the whole 45 second exceprts after dynamic annotations.
- Annotation (dimensional): Time-continuous arousal and valence annotation (1 Hz)
- Link: http://cvml.unige.ch/databases/DEAM/
- Author: Zhang, K., Zhang, H., Li, S., Yang, C., Sun, L.
- Description and music styles: The authors gathered songs popular all around the world: the Billboard Hot 100, the iTunes Top 100 Songs (USA), and the UK Top 40 Singles Chart. They obtained songs available from these charts from 2016 to 2017.
- Annotation strategy: Similar to DEAM, the annotation was done with the slider to collect dynamic annotations at a sampling rate of 2 Hz. Additionally, annotators should make a static annotation for the whole music excerpt on nine-point scale after finishing dynamic labelling. A total of 457 subjects (236 females and 221 males) are recruited to participate in this work. The electrodermal activity was sampled continuously at a sampling rate of 50 Hz.
- Annotation (dimensional): Time-continuous arousal and valence annotation (2 Hz)
- Link: https://github.com/HuiZhangDB/PMEmo
- Author: Livingstone, S.R. and Russo, F.A.
- Description and music styles: The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) is a multimodal, gender-balanced dataset featuring 24 professional actors. It contains emotional speech and song expressions in a neutral North American accent. Songs and speech clips include calm, happy, sad, angry, fearful (with additional emotions like surprise and disgust in speech only), each recorded at two levels of emotional intensity and in various modalities: audio-only, video-only, and audiovisual.
- Annotation strategy: Each of the 7,356 recordings was rated 10 times by a diverse pool of 247 untrained participants from North America for emotional validity, intensity, and genuineness. Test-retest reliability was collected from an additional 72 participants.
- Annotation: 5 to 7 discrete emotion categories depending on modality (e.g., calm, happy, sad, angry, fearful, surprise, disgust).
- Link: https://zenodo.org/records/1188976
- Author: Delbouys, R., Hennequin, R., Piccoli, F., Royo-Letelier, J. and Moussallam, M.
- Description and music styles: The Deezer Mood Detection Dataset (DMDD) includes 18,644 full-length tracks sourced through a mapping between the Million Song Dataset (MSD) and the Deezer catalog. The collection spans a wide range of genres available on commercial streaming platforms. Raw audio and lyrics were retrieved for matched tracks, though lyrics and audio are not temporally aligned.
- Annotation strategy: Valence and arousal values were automatically estimated by embedding Last.fm mood-related tags using affective norms from the lexicon published by Warriner et al. These embeddings were averaged when multiple tags were associated with a track.
- Annotation: Dimensional — valence and arousal based on Russell’s model.
- Link: https://github.com/deezer/deezer_mood_detection_dataset
- Author: Bogdanov, D., Porter A., Tovstogan P., & Won M.
- Description and music styles: The MTG-Jamendo Dataset is an open dataset for music auto-tagging and a subset of the dataset is used in the Emotion and Theme Recognition in Music Task within MediaEval 2019. The dataset contains 87 genre tags, so there is style diversity.
- Annotation strategy: The 56 mood/theme tags were crowdsourced from social media tags on the Jamendo platform. Hence, the annotations can be single- or multi-labeled depending on the excerpt.
- Annotation (categorical): 56 mood/theme tags
- Link: https://multimediaeval.github.io/2019-Emotion-and-Theme-Recognition-in-Music-Task/
- Author: Ferreira, L., Whitehead, J.
- Description and music styles: VGMIDI is a dataset of 200 MIDI labelled piano pieces (video game soundtracks).
- Annotation strategy: Each piece was annotated by 30 human subjects according to a valence-arousal model of emotion. The authors ask the annotators to write two to three sentences describing the short pieces they listened to.
- Annotation (dimensional): Time-continuous arousal and valence annotation
- Link: https://github.com/lucasnfe/vgmidi
- Author: Er, M.B. and Aydilek, I.B.
- Description and music styles: A collection of 400 audio clips from various genres of Turkish music, both verbal and non-verbal, designed to represent four basic emotions: happy, sad, angry, and relaxed. Each emotion category includes 100 music samples, each 30 seconds long.
- Annotation strategy: 13 participants were asked to label 30-second clips based on their perceived emotions. For each clip, the label chosen by the majority of participants was used as the ground truth. The annotation process was conducted over three sessions, and only the most frequently labeled pieces were included.
- Annotation: Categorical — 4 classes: happy, sad, angry, relax.
- Link: https://www.kaggle.com/datasets/blaler/turkish-music-emotion-dataset
- Author: Hung, H.T., Ching, J., Doh, S., Kim, N., Nam, J. and Yang, Y.H.
- Description and music styles: A multi-modal dataset of 1,087 clips from 387 solo piano performances of popular music, including covers of Japanese anime, Korean and Western pop songs, movie soundtracks, and original compositions. Clips are segmented to preserve emotional and musical phrase coherence.
- Annotation strategy: Emotion labels were assigned by four dedicated annotators using Russell’s Circumplex Model (four-quadrant classification). Annotations were done individually per annotator, with periodic cross-validation sessions to ensure consistency in valence/arousal standards.
- Annotation: Categorical — 4 quadrants based on Russell’s model: HVHA, HVLA, LVHA, LVLA.
- Link: https://annahung31.github.io/EMOPIA/
- Author: Velankar, M.
- Description and music styles: This dataset comprises short audio clips from Indian Hindi film songs, categorized into five popular emotional categories. It provides culturally specific content valuable for studying emotion recognition in Indian music.
- Annotation strategy: Songs were manually selected and categorized into emotional classes by the creators, with support from student contributors.
- Annotation: 5 categorical labels — Romantic, Happy, Sad, Devotional, and Party.
- Link: https://www.kaggle.com/datasets/makvel/mer500
- Author: Santana, I.A.P., Pinhelli, F., Donini, J., Catharin, L., Mangolin, R.B., Feltrim, V.D. and Domingues, M.A.
- Description and music styles: A large-scale music dataset containing 109,269 30-second audio clips covering a wide range of genres, languages, and styles. It includes user metadata, lyrics, genre tags, and Spotify-provided audio features, making it suitable for various MIR tasks.
- Annotation strategy: Emotion annotations are indirectly derived from Spotify’s valence, energy, and danceability scores—continuous values representing affective characteristics of the music.
- Annotation: Dimensional (valence, energy, danceability) from Spotify API.
- Link: https://sites.google.com/view/contact4music4all
- Author: Fan, J., Yang, Y.-H., Gong, K, Pasquier, P.
- Description and music styles: The dataset contains 400 excerpts collected from Western classical music recordings and 400 excerpts collected from Chinese classical music recordings.
- Annotation strategy: Two crowdsourcing experiments were carried out to collect emotional annotations (arousal and valence). The authors used a ranking-based method. Instead of providing absolute ratings, participants do pairwise comparisons by deciding which audio excerpt has higher arousal/valence.
- Annotation (dimensional): Arousal and valence real values between [-1,1] for whole excerpt.
- Link: https://metacreation.net/ccmed_wcmed_soundscape/
- Author: Akiki, C. and Burghardt, M.
- Description and music styles: A large-scale music sentiment dataset containing 90,001 songs from a variety of genres, derived from user-generated tags on Last.fm. It includes metadata such as artist, title, genre, MusicBrainz ID, and Spotify ID, enabling linkage to additional musical attributes.
- Annotation strategy: Mood-related tags from Last.fm were filtered using WordNet-Affect and mapped to valence, arousal, and dominance scores using the Warriner et al. (2013) lexicon. For each song, V-A-D values were computed as the weighted average of associated mood tags.
- Annotation: Dimensional (valence, arousal, dominance) based on Russell’s circumplex model extended with a dominance dimension.
- Link: https://www.kaggle.com/datasets/cakiki/muse-the-musical-sentiment-dataset
- Author: Hu, X., Li, F. and Liu, R.
- Description and music styles: A multimodal dataset designed to analyze music-induced emotions and physiological responses. It includes 956 listening records of 592 unique songs sourced from Jamendo (CC-BY license), accompanied by peripheral physiological signals from 30 participants.
- Annotation strategy: Participants listened to 10 or more songs in a 40-minute session while their physiological signals (heart rate, skin conductance, BVP, IBI, skin temperature) were recorded. They self-reported their arousal and valence responses on a [-10, 10] scale. Personality traits were also measured using the TIPI scale.
- Annotation: Dimensional (valence and arousal)
- Link: https://datahub.hku.hk/ndownloader/files/38149263
- Author: Koh, E.Y., Cheuk, K.W., Heung, K.Y., Agres, K.R. and Herremans, D.
- Description and music styles: A dataset of 54 full-length Creative Commons music tracks sourced from the Free Music Archive and DEAM, selected to represent diverse emotional content across the valence-arousal space.
- Annotation strategy: Collected through Amazon Mechanical Turk using a custom 2D graphical interface that recorded mouse positions to annotate valence and arousal dynamically at 10Hz. 452 participants contributed, and user profile information (e.g., musical background, listening preferences) was also collected. Four benchmark tracks from DEAM were used to filter noisy data.
- Annotation: Dimensional (dynamic valence and arousal, sampled at 10Hz)
- Link: https://www.kaggle.com/datasets/kohenyan/music-emotion-recognition-with-profile-information
- Author: Chua, P., Makris, D., Herremans, D., Roig, G. and Agres, K.
- Description and music styles: A multimodal dataset of 81 music videos selected from the LAKH MIDI dataset (aligned with the Million Song Dataset), covering a range of musical genres including pop and soundtrack music. Designed to analyze how auditory and visual modalities contribute to perceived emotion.
- Annotation strategy: 48 participants annotated the videos in one of three modalities (music-only, visual-only, audiovisual) for both dynamic (valence-arousal, 2Hz sampling) and static (overall emotion using GEMS-28 terms) emotion perception. Each clip was rated by 5–9 participants.
- Annotation: Both dynamic (valence-arousal) and static (GEMS-28 terms); perceived emotion.
- Link: https://github.com/AMAAI-Lab/MuVi
- Author: Choi, E., Chung, Y., Lee, S., Jeon, J., Kwon, T. and Nam, J.
- Description and music styles: A multi-label emotion dataset featuring 699 audio and MIDI clips of 1980s FM video game music from Sega and MSX PC games, composed using the YM2413 sound chip. The music is characterized by FM synthesis with unique composition patterns such as unison playing and simulated reverb.
- Annotation strategy: Each song was annotated with 19 emotion tags by two annotators and validated by three additional verifiers to refine label accuracy.
- Annotation: 19 discrete emotion tags, including cheerful, peaceful, creepy, dreamy, touching, and more.
- Link: https://jech2.github.io/YM2413-MDB/
- Author: Bogdanov, D., Lizarraga Seijas, X., Alonso-Jiménez, P. and Serra, X.
- Description and music styles: A public benchmark dataset for music emotion recognition featuring 2,092 audio track previews from Spotify, spanning 1,404 genres. The dataset is designed for evaluating arousal and valence regression models.
- Annotation strategy: Comparative pairwise annotations were collected from 20 annotators to simplify the annotation process and enhance consistency. Annotations are based on relative judgments of AV between track pairs.
- Annotation: Dimensional (Russell’s model) – arousal and valence (relative annotations).
- Link: https://zenodo.org/records/7448344
- Author: Thao, H.T.P., Roig, G. and Herremans, D.
- Description and music styles: A large-scale collection of 5,986 30-second music video segments from YouTube and existing datasets, designed for affective correspondence learning between music and video modalities. The music spans various moods and includes movie soundtracks and popular genres.
- Annotation strategy: For EmoMV-A, human annotations were used from the MVED dataset; EmoMV-B and EmoMV-C used emotion predictions from a modified Feature AttendAffectNet model trained on MVED. Each music-video pair is labeled as either matched or mismatched in emotion content between modalities.
- Annotation: Categorical – 6 emotion labels (exciting, fearful, tense, sad, relaxing, neutral); annotations include matched/mismatched labels and continuous valence/arousal values.
- Link: https://zenodo.org/records/7011072
- Author: Sams, A.S. and Zahra, A.
- Description and music styles: A multimodal dataset consisting of 476 full-length Indonesian pop songs and their corresponding lyrics, focusing on the classification of emotions into positive, neutral, and negative categories.
- Annotation strategy: Crowdsourced annotation via Google Forms, where participants labeled each song based on perceived emotions (positive, neutral, negative) for both audio and lyrics.
- Annotation: Categorical — 3 emotion labels (positive, neutral, negative).
- Link: Offline
- Author: Gómez-Cañón, J.S., Gutiérrez-Páez, N., Porcaro, L., Porter, A., Cano, E., Herrera-Boyer, P., Gkiokas, A., Santos, P., Hernández-Leo, D., Karreman, C. and Gómez, E.
- Description and music styles: A diverse 30-second music excerpt dataset (1,161 tracks) from the Muziekweb collection, focusing on non-Western and Global South music styles to support cross-cultural Music Emotion Recognition (MER) research.
- Annotation strategy: Citizen science-based platform where participants provided (1) free-text emotion words in their native language, (2) forced-choice labels from 11 emotion categories (including GEMS and Ekman emotions), (3) valence-arousal quadrant categorization, (4) preference/familiarity ratings, and (5) reasons for emotion perception and induction.
- Annotation: Categorical — 11 emotion labels plus valence-arousal quadrant information, collected from 181 participants through 4,721 annotations.
- Link: https://github.com/juansgomez87/vis-mtg-mer
- Author: Doumbia, M., Renard, M., Coudrat, L. and Bonnin, G.
- Description and music styles: A multimodal dataset of 188 music listening trials focused on house and other music styles, aimed at studying the impact of music-induced emotions on gait initiation, with data combining music metadata, physiological signals (heart rate, EDA, BVP, temperature), and biomechanical foot pressure recordings.
- Annotation strategy: Participants (35 individuals) provided subjective ratings using Self-Assessment Manikin (SAM) for valence and arousal, GEMS-based emotion labels, liking and familiarity judgments, and additional free-text feedback about emotional perception and walking effects.
- Annotation: Categorical — emotions based on GEMS categories, valence and arousal scores.
- Link: https://homepages.loria.fr/gbonnin/music-mouv/
- Author: Ospitia-Medina, Y., Beltrán, J.R. and Baldassarri, S.
- Description and music styles: A collection of 60 complete original songs by Colombian non-superstar artists, covering various genres with vocal and instrumental diversity, intended for studying music emotion recognition and recommendation while addressing dataset biases.
- Annotation strategy: Emotional labels provided by the artists themselves using a dimensional (Russell) model; for structured songs, annotations were separately provided for verses, choruses, and entire songs; listener questionnaires and low-level audio features were also collected.
- Annotation: Dimensional — valence and arousal (Russell model).
- Link: https://github.com/yesidospitiamedina/ENSA
- Author: Strauss, H., Vigl, J., Jacobsen, P.O., Bayer, M., Talamini, F., Vigl, W., Zangerle, E. and Zentner, M.
- Description and music styles: A curated dataset of 364 music excerpts from classical, pop, and hip-hop genres, designed to study music-evoked emotions with a focus on felt emotion rather than perceived emotion.
- Annotation strategy: Emotion ratings were collected from over 500 participants using the Geneva Emotion Music Scale (GEMS), with each excerpt rated by an average of ~29 participants to ensure stability and reliability of annotations.
- Annotation: Categorical — GEMS emotion categories (music-specific emotion scale).
- Link: https://osf.io/7ptmd/
- Author: Grigorev, V., Li, J., Ma, W., He, Z., Zhang, M., Liu, Y., Yan, M. and Zhang, J.
- Description and music styles: A situational music recommendation dataset containing over 300 popular tracks (from various genres in the Million Song Dataset) with rich physiological, psychological, and environmental information collected through a real-world user study.
- Annotation strategy: Participants provided emotional feedback using valence-arousal dimensions before and after music listening, along with physiological signals (e.g., heart rate, activity type) and environmental data (e.g., weather, location) across three experimental stages.
- Annotation: Dimensional — Valence-Arousal ratings.
- Link: https://github.com/JiayuLi-997/SiTunes_dataset/
- Author: Louro, P.L., Redinho, H., Santos, R., Malheiro, R., Panda, R. and Paiva, R.P.
- Description and music styles: A large-scale bimodal (audio and lyrics) dataset for music emotion recognition, covering a wide range of genres collected via AllMusic API and lyrics platforms, mapped into Russell’s four emotion quadrants.
- Annotation strategy: Semi-automatic mapping of expert emotion tags to Russell’s quadrants using Warriner’s affective word ratings, followed by manual validation by multiple annotators to confirm quadrant assignments and ensure quality.
- Annotation: Quadrants (Valence-Arousal based).
- Link: https://zenodo.org/records/13939205
- Author: Wu, X., Wang, J., Yu, J., Zhang, T. and Zhang, K.
- Description and music styles: A large-scale multimodal music dataset featuring 38,694 popular musical hooks from various genres, with synchronized MIDI, audio, video, and lyrics, aimed at supporting music understanding and generation tasks.
- Annotation strategy: Emotion labels were automatically predicted using a pre-trained multimodal music emotion recognition framework (mapped to Russell’s 4 quadrants) and validated through a user study; also includes detailed annotations for tonality, structure, genre, and region.
- Annotation: Quadrants (Valence-Arousal based).
- Link: https://huggingface.co/datasets/NEXTLab-ZJU/popular-hook
- Author: Affolter, J. and Rohrmeier, M.
- Description and music styles: A dataset of 5,892 Spotify tracks across various genres, built to support Music Emotion Recognition and Auto-Tagging tasks with listener-generated textual data (tags and playlist names) from multiple online sources.
- Annotation strategy: Automatic semantic matching of user-generated text to Plutchik’s eight primary emotions (joy, fear, anger, sadness, disgust, surprise, anticipation, trust) using NLP techniques (Sentence-BERT and NRC Lexicon) to construct 8-dimensional emotion vectors.
- Annotation: 8 emotion labels based on Plutchik’s model.
- Link: https://github.com/joanne-affolter/PlayMood
- Author: Tian, S., Zhang, C., Yuan, W., Tan, W. and Zhu, W.
- Description and music styles: XMIDI is the largest symbolic music dataset with 108,023 MIDI files covering a wide range of genres and emotions, averaging 176 seconds per piece, amounting to over 5,278 hours of music. The dataset supports high-quality symbolic music generation and emotion recognition.
- Annotation strategy: Songs were carefully labeled with 11 distinct emotion categories and genre types by ten professional annotators, involving cross-verification, random quality checks, weekly consistency meetings, and panel discussions for controversial cases.
- Annotation: 11 emotion labels (exciting, warm, happy, romantic, funny, sad, angry, lazy, quiet, fear, magnificent).
- Link: https://github.com/xmusic-project/XMIDI_Dataset
Ref. | Year | Modalities | Approach | Emotion Model | Dataset |
---|---|---|---|---|---|
A new model for emotion prediction in music | 2020 | Audio, Lyrics | Machine learning (e.g., SVM, NB) | Russell | PMEmo |
Multi-view neural networks for raw audio-based music emotion recognition | 2020 | Audio | CNN, LSTM | Russell | Solymani et al.’s dataset |
Musical instrument emotion recognition using deep recurrent neural network | 2020 | Audio | LSTM | 4 classes | Self-built |
A multimodal music emotion classification method based on multifeature combined network classifier | 2020 | Audio | CNN-LSTM | 4 classes | Last.fm |
Attentive RNNs for Continuous-time Emotion Prediction in Music Clips | 2020 | Audio | Attentive LSTM | Russell | Solymani et al.’s dataset |
The multiple voices of musical emotions: source separation for improving music emotion recognition models and their interpretability | 2020 | Audio | Source separation, CNN | Russell | PMEmo |
Cochleogram-based approach for detecting perceived emotions in music | 2020 | Audio | Cochleogram, CNN | Russell | Solymani et al.’s dataset |
Recognition of emotion in music based on deep convolutional neural network | 2020 | Audio | CNN, Local Attention | Quadrants | Soundtrack, Bi-Modal |
Emotion and theme recognition in music using attention-based methods | 2020 | Audio | Attention Based Neural Networks | 56 classes | MTG-Jamendo |
Regression-based Music Emotion Prediction using Triplet Neural Networks | 2020 | Audio | Triplet Neural Networks | Russell | DEAM |
Research on music emotion classification based on CNN-LSTM network | 2021 | Audio | CNN-LSTM | Russell | DEAM |
Music emotion recognition using convolutional long short term memory deep neural networks | 2021 | Audio | CNN, LSTM+DNN | 3 classes | Self-built |
Recognizing Song Mood and Theme: Clustering-based Ensembles | 2021 | Audio | Clustering-based Ensembles | 56 classes | MTG-Jamendo |
Semi-supervised music emotion recognition using noisy student training and harmonic pitch class profiles | 2021 | Audio | Semi-Supervised, Noisy Student Training | 56 classes | MTG-Jamendo |
Frequency Dependent Convolutions for Music Tagging | 2021 | Audio | Frequency Dependent Convolutions | 56 classes | MTG-Jamendo |
SELAB-HCMUS at MediaEval 2021: Music Theme and Emotion Classification with Co-teaching Training Strategy | 2021 | Audio | CNN, Co-teaching Training Strategy | 56 classes | MTG-Jamendo |
Music emotion recognition using recurrent neural networks and pretrained models | 2021 | Audio | LSTM, Pretrained Models | Russell | Self-built |
Transformer-based approach towards music emotion recognition from lyrics | 2021 | Lyrics | Transformers | Russell | MoodyLyrics, MER |
Tracing back music emotion predictions to sound sources and intuitive perceptual qualities | 2021 | Audio | Source-separation based explainer | Russell | DEAM, Midlevel, PMEmo |
A generative adversarial network model based on intelligent data analytics for music emotion recognition under IoT | 2021 | Audio | Generative Adversarial Network | 2 classes | Self-built |
Deep learning-based late fusion of multimodal information for emotion classification of music video | 2021 | Audio, Video | CNN | 6 classes | Self-built |
A multi-genre model for music emotion recognition using linear regressors | 2021 | Audio | Linear Regressors | Russell | Self-built |
Study on music emotion recognition based on the machine learning model clustering algorithm | 2022 | Audio | Clustering, Machine Learning (e.g., SVM) | Russell | Solymani et al.'s dataset |
A novel multi-task learning method for symbolic music emotion recognition | 2022 | MIDI | Multi-Task Learning | Quadrants | EMOPIA, VGMIDI |
Feature selection approaches for optimising music emotion recognition methods | 2022 | Audio | Feature Selection, SVR, RF | Russell | DEAM |
Predicting emotion from music videos: exploring the relative contribution of visual and auditory information to affective responses | 2022 | Audio, Video | LSTM | Russell | Self-built |
MERGE Lyrics: Music Emotion Recognition next Generation--Lyrics Classification with Deep Learning | 2022 | Lyrics | Deep Learning, BERT | Quadrants | MIR Lyrics Emotion |
Music emotion recognition based on segment-level two-stage learning | 2022 | Audio | CNN, LSTM | Russell | PMEmo |
Emotional classification of music using neural networks with the MediaEval dataset | 2022 | Audio | SVM, Random Forest, MLP | Russell | DEAM |
Multi-Modality in Music: Predicting Emotion in Music from High-Level Audio Features and Lyrics | 2023 | Audio, Lyrics | Multi-Modality | Russell | DMDD |
Music emotion recognition based on a neural network with an inception-gru residual structure | 2023 | Audio | Inception-GRU Residual | Quadrants | Soundtrack |
Modeling emotion dynamics in song lyrics with state space models | 2023 | Lyrics | State Space Models | 6 classes | LyricsEmotions |
Tollywood Emotions: Annotation of Valence-Arousal in Telugu Song Lyrics | 2023 | Lyrics | Fine-tuned XLMRoBERTa | Russell | Self-built |
Multimodal music emotion recognition in Indonesian songs based on CNN-LSTM, XLNet transformers | 2023 | Audio, Lyrics | CNN-LSTM, XLNet Transformers | 3 classes | Self-built |
Modularized composite attention network for continuous music emotion recognition | 2023 | Audio | Attention Mechanism | Russell | DEAM, PMEmo |
Automatic music emotion classification model for movie soundtrack subtitling based on neuroscientific premises | 2023 | Audio | CNN | 4 classes | Musical Excerpts |
Transformer-based automatic music mood classification using multi-modal framework | 2023 | Audio, Lyrics | Transformers | Quadrants | MoodyLyrics |
Music Emotion Prediction Using Recurrent Neural Networks | 2024 | Audio | RNN, BRNN, LSTM | Quadrants | 4Q audio, MTG-Jamendo |
MMD-MII model: a multilayered analysis and multimodal integration interaction approach revolutionizing music emotion classification | 2024 | Audio, Lyrics | VGGish, ALBERT | 4 classes | DEAM, FMA |
Verse1-Chorus-Verse2 Structure: A Stacked Ensemble Approach for Enhanced Music Emotion Recognition | 2024 | Audio, Lyrics | Stacked Ensemble Models | 4 classes | Self-built |
A GAI-based multi-scale convolution and attention mechanism model for music emotion recognition and recommendation from physiological data | 2024 | Audio | Multi-scale Parallel Convolution | 8 classes | PMEmo, Soundtrack, RAVDESS |
Improved differential evolution algorithm based convolutional neural network for emotional analysis of music data | 2024 | Audio | CNN with Differential Evolution | 4 classes | self-built, DEAM |