This repository contains an implementation of the following variants of Naive Bayes classifier for detecting fake news.
- Naive Bayes raw (without TF-IDF)
- Naive Bayes with TF-IDF (Term Frequency-Inverse Document Frequency)
Several preprocessing techniques such as tokenization, stopword removal and methods such as Laplace smoothing and assigning TF-IDF weights to probabilities, has been done to maximize the model accuracy.
Note: This implementation is completely done from scratch and uses libraries for text preprocessing and evaluation purposes only.
The dataset used in this project is taken from here.
The columns used for training are:
- text: The content of the news article.
- label: The target label (fake or real).
You can modify the script to work with your dataset by ensuring the column names match the expected structure.
- Python
- pandas
- scikit-learn
- numpy
- matplotlib
- seaborn
Both models have been evaluated according to the specified dataset only and the following results were achieved:
- Accuracy (Raw Naive Bayes): 96%
- Accuracy (TF-IDF Naive Bayes): 97%
Feel free to experiment with the models and improve their performance!
- Word Stemming
- I have assigned the TF-IDF score (named as 'epsilon' in the code) as 1e-9 when the term is not in the TF-IDF table which is comparatively larger number w.r.t range of probabilities in this dataset. So for more fair predictions, we can use 2e-308 as 'epsilon'. This boils down the accuracy from 97% to 96% as on evaluation with the latter changes.