Skip to content

项目基于SMS Spam Collection数据集,构建并优化了逻辑回归模型进行垃圾短信自动分类。 采用TF-IDF方法进行特征提取,使用梯度下降算法训练模型,并通过5折交叉验证和网格搜索优化超参 数。最终,模型准确率达到0.96,精确度、召回率和F1值均表现优异,研究还通过ROC曲线和精确度 召回率曲线进一步评估了分类效果。该模型在垃圾短信分类任务中展现了较高的性能与实际实用价值

Notifications You must be signed in to change notification settings

YOLOWinnnn/The-Research-on-Logistic-Regression-Model-for-SMS-Spam-Classification-

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SMS Spam Classification Using Logistic Regression Overview This project focuses on the automatic classification of SMS spam messages using a Logistic Regression model. It utilizes the SMS Spam Collection Dataset and employs TF-IDF for feature extraction. The model is trained using Gradient Descent, and hyperparameters are optimized using 5-fold Cross-Validation and Grid Search. The final model achieves a high accuracy of 0.96, and demonstrates excellent performance in terms of Precision, Recall, and F1-Score.

Further evaluation of the model's performance is conducted through ROC curves and Precision-Recall curves, providing a thorough analysis of its effectiveness in detecting spam messages.

Key Features Dataset: The project uses the SMS Spam Collection dataset, which consists of labeled SMS messages categorized as "ham" (non-spam) and "spam".

Text Feature Extraction: The TF-IDF (Term Frequency-Inverse Document Frequency) method is employed to transform raw text into a sparse feature matrix for machine learning.

Model Training: The model is trained using Logistic Regression, with Gradient Descent as the optimization method.

Hyperparameter Optimization: Hyperparameters such as regularization strength and learning rate are optimized using Grid Search with 5-fold Cross-Validation.

Performance Metrics: The model's effectiveness is evaluated using various metrics, including:

Accuracy: 96% Precision, Recall, F1-Score ROC Curve Precision-Recall Curve Requirements Python 3.x

Libraries: nltk sklearn pandas numpy matplotlib seaborn

To install the necessary libraries, you can use pip:

pip install nltk scikit-learn pandas numpy matplotlib seaborn Additionally, the NLTK dictionary is used in this project, which can be accessed via the NLTK API.

Getting Started

  1. Clone the Repository git clone https://github.com/YOUR_USERNAME/BreadcrumbsThe-Research-on-Logistic-Regression-Model-for-SMS-Spam-Classification-.git cd BreadcrumbsThe-Research-on-Logistic-Regression-Model-for-SMS-Spam-Classification-
  2. Download the NLTK Data Before running the code, make sure the necessary NLTK data is downloaded. The NLTK dictionary can be accessed through the API by running: import nltk nltk.download('stopwords') nltk.download('punkt')
  3. Run the Code You can run the script to train and evaluate the logistic regression model: python sms_spam_classification.py This will automatically preprocess the dataset, extract features using TF-IDF, train the model, optimize the hyperparameters, and evaluate the performance.

Project Notes The SMS Spam Collection Dataset is available on the UCI Machine Learning Repository. If you haven't already downloaded it, make sure to place it in the correct directory or modify the path in the code.

The NLTK stopwords and punctuation list are used to clean the text data. These can be accessed using the NLTK API as shown in the code.

Grid Search with 5-fold Cross-Validation is computationally intensive, so make sure you have sufficient resources (CPU/RAM) if running on larger datasets.

Make sure to handle missing values and outliers in the dataset to ensure the model performs optimally.

Evaluation Metrics Confusion Matrix: This helps in visualizing the performance of the model, showing the true positives, false positives, true negatives, and false negatives.

ROC Curve: The Receiver Operating Characteristic curve helps in evaluating the trade-off between true positive rate (recall) and false positive rate.

Precision-Recall Curve: A valuable metric, especially for imbalanced datasets like spam classification.

Performance Results After training and optimization, the final model achieved the following performance on the test set:

Accuracy: 96% Precision: 0.95 Recall: 0.98 F1-Score: 0.96 The model was able to effectively classify spam messages, showing high performance on all evaluation metrics.

GitHub Topics Here are some relevant GitHub topics related to this project:

#MachineLearning #LogisticRegression #SMSSpamClassification #NLP #TextClassification #TF-IDF #DataScience #SpamDetection #GradientDescent Conclusion This project demonstrates the application of Logistic Regression for SMS spam classification, achieving high accuracy and strong performance metrics. The use of TF-IDF for text feature extraction and Gradient Descent for training makes this model both effective and efficient. The optimization of hyperparameters through Grid Search and Cross-Validation further enhances the model's predictive power.

Feel free to contribute to this project or open an issue if you find any bugs or have suggestions for improvement!

License This project is licensed under the MIT License - see the LICENSE file for details.

Notes on Project Usage: Ensure all dependencies are installed: Make sure to install the necessary libraries and download the required NLTK datasets as mentioned above. Dataset Path: If the dataset is located in a different path, make sure to adjust the file path in the code accordingly. Running the Code: Before running the code, ensure that you have the correct environment set up and that the necessary files (e.g., the dataset) are in place.

Contact If you have any questions or issues with this project, feel free to reach out to me via email at: Email: [ljw2556826312@gmail.com] I will be happy to assist you!

About

项目基于SMS Spam Collection数据集,构建并优化了逻辑回归模型进行垃圾短信自动分类。 采用TF-IDF方法进行特征提取,使用梯度下降算法训练模型,并通过5折交叉验证和网格搜索优化超参 数。最终,模型准确率达到0.96,精确度、召回率和F1值均表现优异,研究还通过ROC曲线和精确度 召回率曲线进一步评估了分类效果。该模型在垃圾短信分类任务中展现了较高的性能与实际实用价值

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages