A multimodal sarcasm detection system utilizing image-caption generation and natural language processing, developed for the UITC2024 competition, where we achieved 1st place.
Table of Contents
The Multimodal Sarcasm Detection System is designed to detect sarcasm in multimedia content using image-text pairs. It generates captions from images using a pre-trained Vintern-1B-v2 model, then processes the data through three input streams: original text, generated captions, and image features. The system integrates these inputs into a unified model that classifies sarcasm across different categories: text sarcasm, image sarcasm, multi-modal sarcasm, and no sarcasm.
This system is developed for the UITC2024 competition and aims to advance the understanding of sarcasm detection in multimodal contexts.
-
Multimodal Input Handling
- Text-based input: Handles original text and captions generated from images.
- Image-based input: Generates image captions for context-based analysis.
-
Sarcasm Classification
- Classifies content into four categories: image sarcasm, text sarcasm, multi sarcasm, and not sarcasm.
-
Model Architecture
- Utilizes state-of-the-art models like Vintern-1B-v2 for image captioning and transformers for text analysis.
- Integrated ViT and Jina Embedding V3 for feature extraction, with optimization using Cross Entropy and Focal Loss.
-
Voting Model Integration
- Combines the predictions of four different models trained for 2-class, 3-class, and 4-class tasks to ensure accurate final predictions.
Team Name | F1 | Precision | Recall |
---|---|---|---|
Faster-United | 0.4475 | 0.4403 | 0.4563 |
US1 | 0.4403 | 0.4462 | 0.5678 |
AIbou | 0.4386 | 0.4256 | 0.4935 |
BEd | 0.4328 | 0.4240 | 0.4574 |
MeowProfs | 0.4293 | 0.4185 | 0.4511 |
Our team Faster-United achieved 1st place with an F1 score of 0.4475. The table above shows the top 5 teams and their corresponding F1, Precision, and Recall scores. We are proud of the results and our system's performance across various metrics in the UITC2024 competition.
-
Clone the Repository
git clone https://github.com/xndien2004/Multimodal-Sarcasm-Detection-for-UITC2024.git cd Multimodal-Sarcasm-Detection-for-UITC2024
-
Install Dependencies Make sure Python is installed and then install the necessary dependencies:
pip install -r requirements.txt
-
Run trainer To run train, execute:
bash run_trainer.sh
This will start the application and allow you to test the sarcasm detection on your input data.
- Data Processing: The system processes image and text data, generating captions for images and using the original text for classification.
- Model Training: The four models (trained for 2-class, 3-class, and 4-class tasks) work together to detect sarcasm across different types of input.
- Voting Model: The predictions of individual models are aggregated using a Voting Model to produce the final classification.
├── Multimodal-Sarcasm-Detection-for-UITC2024/
│ ├── config/
│ │ ├── config_trainer.yaml
│ ├── pic/
│ ├── src/
│ │ ├── data_processing/
│ │ ├── multimodal_classifier/
│ │ ├── pipeline_notebook/
| | ├── utils.py
│ ├── requirements.txt
- Trần Xuân Diện
- Võ Trọng Nhơn
- Nguyễn Đăng Tuấn Huy