This Project focuses on developing and evaluating different text classification models capable of accurately categorizing documents into predefined topics across five domains: business, entertainment, politics, sport, and tech.
- Exploring various text processing techniques to pre-process the text dataset before Text vectorization/ feature extraction techniques
- Utilizing different Text vectorization techniques, such as TF-IDF, Bag of words, and Word Embeddings using pre-trained models
- Splitting the dataset into 70% training and 30% testing
- Training machine learning classifiers, such as support vector machines (SVMs), Naïve Bayes and Neural Networks using different text vectorization techniques.
- Evaluating the performance of the classifiers using appropriate evaluation metrics, such as accuracy, precision, recall, and F1-score, on the test dataset.
- Comparing and contrasting machine learning classifiers performance on different Text Vectorization techniques
The dataset was downloaded from Kaggle website and extracted from its zip folder, it contains five sub-folders – business, entertainment, politics, sport, and tech, which contains various documents relating to each domain.
- Business - 510
- Entertainment - 386
- Politics - 417
- Sport - 511
- Tech - 401