SMS-Spam-Classification using Logistic Regression

predicted the spams with close to 98 % precision

This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University. The dataset was used in the 1983 American Statistical Association Exposition.

link - SPAM data (UCI machine learning repository)

Data Set Information:

This corpus has been collected from free or free for research sources at the Internet:

A collection of 425 SMS spam messages was manually extracted from the Grumbletext Web site. This is a UK forum in which cell phone users make public claims about SMS spam messages, most of them without reporting the very spam message received. The identification of the text of spam messages in the claims is a very hard and time-consuming task, and it involved carefully scanning hundreds of web pages. A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available. A list of 450 SMS ham messages collected from Caroline Tag's PhD Thesis Finally, incorporation of the SMS Spam Corpus. It has 1,002 SMS ham messages and 322 spam messages and it is public available at. This corpus has been used in the academic researches:

Class	count	percentage
Spam	747	13.41 %
Ham	4825	86.59 %

Objective:

Prediction of a SMS into SPAM or NOT A SPAM so that developers come up with the application that can filter messages them based on the prediction

Hurdles -

Looking for external spam_words, to get the spam-word-count to avoid Out of the vocabulary words and biasing towards category 'SPAM'
Reducing the False positive at the minimum cost of False negative (Better Tradeoff between Precision & Recall)
Imbalanced Dataset

Skills Aquired-

Text processing / cleaning
Vectorization (Bag of words/TFIDF)
Classification (Logistic regression)
Synthetic minority oversampling technique (SMOTE)

Limitations of project

The semantics(exact meanings/context) of words are not taken into account
Sometimes/Rarely Model may end up predicting an important message as spam (False positives) when out of the vocabulary word will be encountered.
Model needs to be continuously updated to escape out of the vocabulary words,
Model should incorporate the new slangs, spamwords in the emerging social media.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.streamlit		.streamlit
Ipynb_Notebook		Ipynb_Notebook
Visualizations		Visualizations
serialization		serialization
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SMS-Spam-Classification using Logistic Regression

predicted the spams with close to 98 % precision

Data Set Information:

Objective:

Hurdles -

Skills Aquired-

Limitations of project

About

Uh oh!

Releases

Packages

Languages

Arvindhh931/SMS-Spam-Classification

Folders and files

Latest commit

History

Repository files navigation

SMS-Spam-Classification using Logistic Regression

predicted the spams with close to 98 % precision

Data Set Information:

Objective:

Hurdles -

Skills Aquired-

Limitations of project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages