This project implements an end-to-end pipeline for detecting SMS spam using LLM-based embeddings (Mistral), interpretable machine learning, and risk-aware reporting.
It includes:
- Exploratory Data Analysis (EDA)
- Embedding generation using
ollama
Mistral model - Random Forest classifier with performance evaluation
- LIME explanations for interpretability
- Executive-level HTML/PDF reporting using LLM-generated narrative
├── data/
│ ├── spam.csv # Raw dataset (Kaggle UCI SMS Spam)
│ ├── model_metrics.csv # Saved model evaluation metrics
├── plots/
│ ├── *.png # Visuals from EDA, LIME, Confusion Matrix
├── reports/
│ ├── fraud_detection_report.html
│ ├── fraud_detection_report.pdf
├── notebooks/
│ ├── 01_EDA.ipynb
│ ├── 02_llm_finetuning_prediction.ipynb
│ ├── 03_llm_executive_report.ipynb
├── README.md
-
Clone the repository:
git clone https://github.com/sumitdeole/LLM_text_data.git cd LLM_text_data
-
Create environment and install dependencies:
conda create -n sms-fraud python=3.10 -y conda activate sms-fraud pip install -r requirements.txt
-
Install additional system dependencies:
- WeasyPrint requires GTK3 runtime for Windows
- Add
C:\Program Files\GTK3-Runtime Win64\bin
to your system PATH
- Loads and visualizes data
- Generates word clouds and top spam unigrams
- Generates 4096-D embeddings using
ollama
's Mistral - Trains a balanced Random Forest
- Saves metrics and plots
- Feeds metrics and text features to LLM for narrative
- Renders HTML/PDF executive report with visuals
MIT License
If you find this project helpful, feel free to give it a ⭐ on GitHub!