Lyrics Finder is an automatic music genre classification system based on the textual analysis of song lyrics, using advanced Natural Language Processing (NLP) techniques and the BERT model.
The idea arose from the need to automatically categorize songs based on their lyrics, without relying on metadata or additional information. This approach has applications in areas such as personalized playlist creation, thematic analysis, and music discovery.
However, analyzing music lyrics is not a simple task: the use of metaphors, wordplay, and informal language makes it difficult for traditional models to achieve accurate results. For this reason, we chose BERT, an advanced deep learning model capable of understanding context in a bidirectional manner.
The project is organized into the following folders:
This folder contains executable scripts for Google Colab, developed to experiment with various stages of the pipeline:
data_augmentation.ipynb
: Pipeline with data augmentation techniques.first_pipeline.ipynb
: Initial pipeline developed without data augmentation or optimization.optimization.ipynb
: Optimized pipeline to improve model performance.
This folder contains the final and optimized version of the pipeline, developed for local execution:
cleaning.py
: Data cleaning script.eda.py
: Exploratory data analysis of the dataset.preprocessing.py
: Transformation and preparation of text for the model.modeling.py
: Model creation and configuration using BERT.training.py
: Model training.performance_evaluation.py
: Model performance evaluation.pipeline.py
: Main script that runs the entire pipeline sequentially, from data cleaning to final evaluation.requirements.txt
: File with dependencies needed to run the project locally.
Users can choose to run each stage separately or start the entire process by executing pipeline.py
.
The dataset used contains the following information:
- Index
- Song Title ๐ต
- Year of Release ๐
- Artist ๐ค
- Music Genre ๐ผ
- Song Lyrics โ๏ธ
- Total samples: 362,237
- Most represented genres: Rock (131,377), Pop (49,444)
- Least represented genres: Indie (5,732), R&B (5,935), Folk (3,241)
- Problem: Class imbalance
- Removal of null or missing data
- Elimination of unnecessary punctuation and symbols
- Cleaning of instrumental or corrupted texts
- Dataset balancing via undersampling
- Removal of stop words
- Lemmatization
- Encoding of genres
- Dataset splitting
- Tokenization
- Use of the pre-trained BERT model (BertForSequenceClassification, bert-base-uncased)
- Early Stopping (2 epochs)
- Forward pass โ Loss calculation โ Backpropagation โ Weight update
- Stop at 4th epoch to avoid overfitting
- Confusion Matrix: good accuracy, but difficulties distinguishing between Rock and Hip-Hop
- Classification Report:
- ๐ธ Pop & Metal: Good results
- ๐ค Rock & Hip-Hop: Lower performance
- Overall accuracy: 71%
- Back Translation
- Synonym Replacement
- Improvement in accuracy to 73%
- Increase of maximum token length
- Lower learning rate
- Dropout
- Focal Loss
- Final accuracy: 72%
- ๐ Expand the dataset
- ๐ฅ๏ธ Use more powerful hardware
- ๐ Test new data augmentation techniques
- ๐ Experiment with other BERT models
- ๐ง Optimization through hyperparameter tuning
Lyrics Finder is designed to run in two modes:
- ๐ป Local Execution: for users with NVIDIA GPUs (RTX 3060, 3070, 3080, or higher)
- โ๏ธ Google Colab: for users without advanced hardware resources
- Python 3.x ๐
- PyTorch ๐ฅ
- Transformers (Hugging Face)
- Pandas, NumPy, Scikit-learn
- Google Colab (optional)
-
Open the desired notebook
- Upload one of the
.ipynb
files found in thenotebooks/
folder to Google Colab. - For example, open
optimization.ipynb
to run the optimized pipeline.
- Upload one of the
-
Connect the runtime to a GPU
- Go to
Runtime โ Change runtime type โ Select GPU
.
- Go to
-
Run the cells sequentially
- Follow the order of the cells to execute the various stages of the pipeline.
- Clone the repository
git clone https://github.com/tuo-utente/LyricsFinder.git
cd LyricsFinder/src
- Install the dependencies
pip install -r requirements.txt
- Run the full pipeline
python pipeline.py
- Run a specific phase of the pipeline (optional)
python cleaning.py # Esegue solo la pulizia dei dati
python training.py # Addestra il modello
This project is licensed under the MIT License. See the LICENSE file for details.