This project demonstrates a sarcasm detection model using a Long Short-Term Memory (LSTM) neural network. The model processes textual input to classify it as sarcastic or non-sarcastic. The notebook includes data preprocessing, model training, evaluation, testing, and visualization steps.
Sarcasm detection is crucial in natural language processing (NLP), especially for applications in social media sentiment analysis, chatbots, and opinion mining. Since sarcasm often conveys an opposite meaning from the literal words, it can mislead sentiment detection models. This project employs an LSTM model to learn sequential dependencies in text data for identifying sarcastic language.
The notebook uses labeled datasets containing sarcastic and non-sarcastic text samples from sources such as Reddit, including:
- DWAEF-sarc
- GEN-sarc
- HYP-sarc
- RQ-sarc
- REDDIT-sarc
The preprocessing steps include:
- Noise Removal: Removing special characters, URLs, and non-alphanumeric tokens to reduce noise.
- Tokenization and Padding: Tokenizing the text and padding sequences for uniform input length.
- Label Encoding: Converting sarcasm labels into a binary format for classification.
The following libraries are required to run the notebook:
pip install numpy pandas torch scikit-learn matplotlib nltk seaborn- Load Dataset: Import and load the dataset into a Pandas DataFrame.
- Preprocess Data: Execute data cleaning, tokenization, padding, and label encoding.
- Build Model: Define an LSTM model with an embedding layer, LSTM layer, and dense output layer for binary classification.
- Train Model: Train the model on the training dataset with cross-validation.
- Evaluate Results: Calculate performance metrics and visualize the results.
To run the notebook:
jupyter notebook lstm-sarcasm-detection.ipynbThe LSTM model architecture consists of:
- Embedding Layer: Converts tokens into dense vectors, capturing semantic meaning.
- LSTM Layer: Processes sequences, learning dependencies over time steps to detect contextual cues for sarcasm.
- Dense Layer: Outputs a binary classification for sarcastic and non-sarcastic labels.
The model undergoes 10-fold cross-validation to ensure reliable performance metrics. For each fold:
- The model is trained on a subset of data, and validation is performed on the remaining set.
- Metrics Calculated Per Fold:
- Accuracy: Measures correct predictions over total predictions.
- Precision: Indicates the accuracy of sarcasm predictions.
- Recall: Reflects the model's capability to identify all sarcastic samples.
- F1 Score: Combines precision and recall for balanced evaluation.
- Accuracy: 70.97%
- Precision: 72%
- Recall: 69%
- F1 Score: 70%
These metrics are computed and averaged to assess the model's ability to generalize across different subsets.