The project consists of five parts:
-
\flask-song-app\
A web application that provides recommendation for children's songs. Users can select a song, and the app will recommend similar songs. Deployed to Heroku: https://children-song-app.herokuapp.com/ -
\notebooks\1_*
Data scrapping and cleaning. -
\notebooks\2_*
Investigation of the relation between age-ratings and audio features. Used NLP to understand the lyrics. -
\notebooks\3_*
Age-Rating Models using audio features and lyrics. -
\notebooks\4_*
Song Recommender.
-
2206 albums with age-ratings scrapped from Common Sense Media, a non-profit whose mission is to ensure digital well-being for kids by providing expert reviews. See more info on how music is rated.
-
Use Spotify API to obtain song tracks in each album. In total, 18K songs along with audio features and ISRC codes are founded.
-
Use MusixMatch API with ISRC codes to obtain lyrics for 12K songs.
-
Two Age-Rating Models:
-
Use audio features to predict age ratings.
A tree regression model uses 13 audio features (key, tempo, duration, etc, explained here) and popularity to predict age-ratings. The model achieves an R^2 score of 0.50, having
popularity
andduration
as the two most important features. -
Use song lyrics to predict age ratings.
After basic text preprocessing (tokenization, lemmatization, removing stop words), the processed lyrics are then feed into a model pipeline consisting of
TfIdfVectorizer
andRidgeRegressor
.GridSearchCV
is used on a smaller subset to select the paramters:min_df
,max_df
forTfIdfVectorizer
, andalpha
forRidgeRegressor
. The parameters forTfIdfVectorizer
will be used later for lyrics-based song recommendation with KNN model (where there is no metric to tune parameters.)The model achieves an R^2^ score of 0.4.
-
-
Song Recommendation K-Nearest Neighborhood model using the follow features:
- Audio features: key, mode, time_signature, duration_ms, danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness valence, tempo. Explained here.
- Song popularity: A Number between 0-100 computed based on the total number of plays the track has had and how recent those plays are. This number is provided by Spotify API.
- Age rating of the album including the song track.
- Song lyrics.
-
What's the relation between age-ratings and audio features?
-
Age-rating is most correlated to
popularity
andduration
(popularity
measures how many users have played the track, andduration
is the time length of the song). (See correlation plot or2.2_Variables_Relation.ipynb
) -
Melodic modes
(major vs minor): In the age group of 2-5, more than 80% of the songs are in major keys, while the age group of 13-18 have only 65% in major keys. (See plot or2.2_Variables_Relation.ipynb
)
-
-
What's the relation between age-ratings and lyrics?
- Visualized by plotting word-polarity: Divide the lyrics into two age-groups: young vs old, and use the conditional word log-probabilities as the (x,y)-coordinate. In the plot, neutral words will approximately lie on the line x=y. (See plot or
2.5_NLP_Visualize_Lyrics_Word_Polarity.ipynb
)
- Visualized by plotting word-polarity: Divide the lyrics into two age-groups: young vs old, and use the conditional word log-probabilities as the (x,y)-coordinate. In the plot, neutral words will approximately lie on the line x=y. (See plot or
-
What are the albums sing about?
- LDA topic modeling is used to define 10 topics among all lyrics. Each topic is described by its topic keywords. (See
2.4_NLP_Topic_Modeling_Using_Song_Lyrics.ipynb
)
- LDA topic modeling is used to define 10 topics among all lyrics. Each topic is described by its topic keywords. (See