SMS Spam Classification Using Logistic Regression Overview This project focuses on the automatic classification of SMS spam messages using a Logistic Regression model. It utilizes the SMS Spam Collection Dataset and employs TF-IDF for feature extraction. The model is trained using Gradient Descent, and hyperparameters are optimized using 5-fold Cross-Validation and Grid Search. The final model achieves a high accuracy of 0.96, and demonstrates excellent performance in terms of Precision, Recall, and F1-Score.
Further evaluation of the model's performance is conducted through ROC curves and Precision-Recall curves, providing a thorough analysis of its effectiveness in detecting spam messages.
Key Features Dataset: The project uses the SMS Spam Collection dataset, which consists of labeled SMS messages categorized as "ham" (non-spam) and "spam".
Text Feature Extraction: The TF-IDF (Term Frequency-Inverse Document Frequency) method is employed to transform raw text into a sparse feature matrix for machine learning.
Model Training: The model is trained using Logistic Regression, with Gradient Descent as the optimization method.
Hyperparameter Optimization: Hyperparameters such as regularization strength and learning rate are optimized using Grid Search with 5-fold Cross-Validation.
Performance Metrics: The model's effectiveness is evaluated using various metrics, including:
Accuracy: 96% Precision, Recall, F1-Score ROC Curve Precision-Recall Curve Requirements Python 3.x
Libraries: nltk sklearn pandas numpy matplotlib seaborn
To install the necessary libraries, you can use pip:
pip install nltk scikit-learn pandas numpy matplotlib seaborn Additionally, the NLTK dictionary is used in this project, which can be accessed via the NLTK API.
Getting Started
- Clone the Repository git clone https://github.com/YOUR_USERNAME/BreadcrumbsThe-Research-on-Logistic-Regression-Model-for-SMS-Spam-Classification-.git cd BreadcrumbsThe-Research-on-Logistic-Regression-Model-for-SMS-Spam-Classification-
- Download the NLTK Data Before running the code, make sure the necessary NLTK data is downloaded. The NLTK dictionary can be accessed through the API by running: import nltk nltk.download('stopwords') nltk.download('punkt')
- Run the Code You can run the script to train and evaluate the logistic regression model: python sms_spam_classification.py This will automatically preprocess the dataset, extract features using TF-IDF, train the model, optimize the hyperparameters, and evaluate the performance.
Project Notes The SMS Spam Collection Dataset is available on the UCI Machine Learning Repository. If you haven't already downloaded it, make sure to place it in the correct directory or modify the path in the code.
The NLTK stopwords and punctuation list are used to clean the text data. These can be accessed using the NLTK API as shown in the code.
Grid Search with 5-fold Cross-Validation is computationally intensive, so make sure you have sufficient resources (CPU/RAM) if running on larger datasets.
Make sure to handle missing values and outliers in the dataset to ensure the model performs optimally.
Evaluation Metrics Confusion Matrix: This helps in visualizing the performance of the model, showing the true positives, false positives, true negatives, and false negatives.
ROC Curve: The Receiver Operating Characteristic curve helps in evaluating the trade-off between true positive rate (recall) and false positive rate.
Precision-Recall Curve: A valuable metric, especially for imbalanced datasets like spam classification.
Performance Results After training and optimization, the final model achieved the following performance on the test set:
Accuracy: 96% Precision: 0.95 Recall: 0.98 F1-Score: 0.96 The model was able to effectively classify spam messages, showing high performance on all evaluation metrics.
GitHub Topics Here are some relevant GitHub topics related to this project:
#MachineLearning #LogisticRegression #SMSSpamClassification #NLP #TextClassification #TF-IDF #DataScience #SpamDetection #GradientDescent Conclusion This project demonstrates the application of Logistic Regression for SMS spam classification, achieving high accuracy and strong performance metrics. The use of TF-IDF for text feature extraction and Gradient Descent for training makes this model both effective and efficient. The optimization of hyperparameters through Grid Search and Cross-Validation further enhances the model's predictive power.
Feel free to contribute to this project or open an issue if you find any bugs or have suggestions for improvement!
License This project is licensed under the MIT License - see the LICENSE file for details.
Notes on Project Usage: Ensure all dependencies are installed: Make sure to install the necessary libraries and download the required NLTK datasets as mentioned above. Dataset Path: If the dataset is located in a different path, make sure to adjust the file path in the code accordingly. Running the Code: Before running the code, ensure that you have the correct environment set up and that the necessary files (e.g., the dataset) are in place.
Contact If you have any questions or issues with this project, feel free to reach out to me via email at: Email: [ljw2556826312@gmail.com] I will be happy to assist you!