This project analyzes Spotify music data using PostgreSQL. The dataset includes various attributes such as artist name, track name, album details, audio features (danceability, energy, loudness, etc.), and streaming statistics. The goal is to extract meaningful insights through SQL queries, covering exploratory data analysis (EDA), data cleaning, and advanced data analysis.
Source: https://www.kaggle.com/datasets/sanjanchaudhari/spotify-dataset/data
Table Name: spotify_data
Columns:
artist
: Name of the artisttrack
: Song titlealbum
: Album namealbum_type
: Type of album (Single, Album, etc.)danceability
,energy
,loudness
,speechiness
,acousticness
,instrumentalness
,liveness
,valence
,tempo
: Audio featuresduration_min
: Track duration in minutesviews
,likes
,comments
: YouTube statisticslicensed
,official_video
: Boolean flags for licensing and official music videosstream
: Spotify streaming countmost_playedon
: Platform where the song is most played (Spotify, YouTube, etc.)
To understand the dataset, the following analyses were performed:
- Data Cleaning: Handled null values, ensured consistent formatting, and checked for duplicate entries.
- Unique Counts: Number of unique artists, albums, and album types.
- Track Durations: Identified the longest and shortest tracks.
- Danceability & Energy Levels: Analyzed how engaging and energetic the songs are.
- Platform Insights: Checked the most played platform for each track.
- Counted the number of distinct artists and albums.
- Retrieved tracks with more than 1 billion streams.
- Found all tracks belonging to album type: Single.
- Listed all albums along with their respective artists.
- Summed the total number of comments for licensed tracks.
- Average danceability per album.
- Top 5 most energetic tracks.
- Total views of tracks per album.
- Tracks with more Spotify streams than YouTube views.
- Used window functions to find top 3 most-viewed tracks per artist.
- Identified tracks with above-average liveness scores.
- Energy-to-liveness ratio analysis to find highly energetic tracks.
- Cumulative sum of likes based on the number of views.
- Database: PostgreSQL
- Query Language: SQL
- Data Cleaning: Excel (for initial pre-processing)
- IDE: pgAdmin / DBeaver
In this project, we employed a variety of SQL techniques to analyze Spotify's music dataset, extracting meaningful insights about track popularity, artist performance, and user engagement. Below are the key SQL methods utilized:
- Used
SELECT
statements to retrieve relevant data fields. - Applied
WHERE
clauses to filter results based on conditions (e.g., filtering tracks with more than 1 billion streams). - Utilized
ORDER BY
to sort results based on numerical and categorical attributes.
SUM()
: Calculated total views and comments for different categories.AVG()
: Found the average danceability and liveness of tracks.COUNT()
: Counted the number of tracks per artist.GROUP BY
: Aggregated data to summarize artist performance, album statistics, and streaming platforms.
- Used
COALESCE()
to handleNULL
values in streaming data. - Applied
CASE WHEN
withinSUM()
to compare streams from different platforms (Spotify vs. YouTube).
- Subqueries: Used in
WHERE
clauses to filter based on computed aggregates (e.g., tracks with liveness above the dataset average). - WITH clauses (CTEs): Improved query readability by breaking down complex calculations into structured temporary tables (e.g., energy difference calculations across albums).
DENSE_RANK()
: Ranked the top 3 most-viewed tracks per artist.SUM() OVER()
: Computed cumulative likes across tracks, ordered by views.PARTITION BY
: Applied ranking within each artist’s dataset to compare their most popular tracks.
These SQL techniques allowed us to extract insights from large-scale music data, enabling deeper analysis of streaming trends and artist performance. 🚀### 📌 Key Learnings
- SQL Query Optimization: Improved query efficiency using
WITH
clauses andWINDOW FUNCTIONS
. - Data Cleaning Techniques: Identified missing and incorrect data before performing analysis.
- Trend Analysis: Discovered insights on track popularity, streaming patterns, and artist performance.
- Integrate Tableau/Power BI for data visualization.
- Apply machine learning techniques for track recommendation.
- Expand dataset to include different time periods for trend analysis.