MoRE: Mixture of Retrieval-Augmented Multimodal Experts

This repo is the official implementation of Biting Off More Than You Can Detect: Retrieval-Augmented Multimodal Experts for Short Video Hate Detection accepted by WWW 2025. The paper can be accessed via Openreview.

Abstract

Short Video Hate Detection (SVHD) is increasingly vital as hateful content — such as racial and gender-based discrimination — spreads rapidly across platforms like TikTok, YouTube Shorts, and Instagram Reels. Existing approaches face significant challenges: hate expressions continuously evolve, hateful signals are dispersed across multiple modalities (audio, text, and vision), and the contribution of each modality varies across different hate content. To address these issues, we introduce MoRE (Mixture of Retrievalaugmented multimodal Experts), a novel framework designed to enhance SVHD. MoRE employs specialized multimodal experts for each modality, leveraging their unique strengths to identify hateful content effectively. To ensure model’s adaptability to rapidly evolving hate content, MoRE leverages contextual knowledge extracted from relevant instances retrieved by a powerful joint multimodal video retriever for each target short video. Moreover, a dynamic sample-sensitive integration network adaptively adjusts the importance of each modality on a per-sample basis, optimizing the detection process by prioritizing the most informative modalities for each instance. Our MoRE adopts an end-to-end training strategy that jointly optimizes both expert networks and the overall framework, resulting in nearly a twofold improvement in training efficiency, which in turn enhances its applicability to real-world scenarios. Extensive experiments on three benchmarks demonstrate that MoRE surpasses state-of-the-art baselines, achieving an average improvement of 6.91% in macro-F1 score across all datasets.

Framework

Source Code Structure

data        # dir of each dataset
- HateMM 
- MultiHateClip     # i.e., MHClip-B and MHClip-Y
    - en
    - zh

retrieval   # code of retrieval

src         # code of MoRE
- config    # training config
- model     # model implementation
- utils     # training utils
- data      # dataloader of MoRE

Dataset

We provide video IDs for each dataset in both temporal and five-fold splits. Due to copyright restrictions, the raw datasets are not included. You can obtain the datasets from their respective original project sites.

HateMM

Access the full dataset from hate-alert/HateMM.

MHClip-B and MHClip-Y

Access the full dataset from Social-AI-Studio/MultiHateClip: Official repository for ACM Multimedia'24 paper "MultiHateClip: A Multilingual Benchmark Dataset for Hateful Video Detection on YouTube and Bilibili".

Usage

Requirements

To set up the environment, run the following commands:

conda create --name py312 python=3.12
pip install torch transformers tqdm loguru pandas torchmetrics scikit-learn colorama wandb hydra-core

Data Preprocess

Sample 16 frames from each video in the dataset.
Extract on-screen text from keyframes using Paddle-OCR.
Extract audio transcripts from video audio using Whisper-v3.
Encode visual feature from each video using a pre-trained ViT model.
Encode audio feature to MFCC with libsora.
Encode textual feature using a pre-trained BERT model.

# refer the preprossing pipeline and make sure correct configuration of the paths of source data files
cd preprocess

Retrieval

Encode audio transcirpt using BERT to make audio memory bank.
Encode title and description using BERT to make textual memory bank.
Encode 16 frames using ViT to make visual memory bank.

# conduct retrieval
python retrieve/make_retrieval_result.py

Run

# Run ExMRD for the HateMM dataset
python src/main.py --config-name HateMM_MoRE

# Run ExMRD for the MHClip-Y dataset
python src/main.py --config-name MHClipEN_MoRE

# Run ExMRD for the MHClip-B dataset
python src/main.py --config-name MHClipZH_MoRE

Citation

If you find our research useful, please cite this paper:

@inproceedings{lang2025biting,
	author = {Lang, Jian and Hong, Rongpei and Xu, Jin and Li, Yili and Xu, Xovee and Zhou, Fan},
	booktitle = {The {Web} {Conference} ({WWW})},
	year = {2025},
	organization = {ACM},
	title = {Biting Off More Than You Can Detect: Retrieval-Augmented Multimodal Experts for Short Video Hate Detection},
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
preprocess		preprocess
retrieve		retrieve
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MoRE: Mixture of Retrieval-Augmented Multimodal Experts

Abstract

Framework

Source Code Structure

Dataset

HateMM

MHClip-B and MHClip-Y

Usage

Requirements

Data Preprocess

Retrieval

Run

Citation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

Jian-Lang/MoRE

Folders and files

Latest commit

History

Repository files navigation

MoRE: Mixture of Retrieval-Augmented Multimodal Experts

Abstract

Framework

Source Code Structure

Dataset

HateMM

MHClip-B and MHClip-Y

Usage

Requirements

Data Preprocess

Retrieval

Run

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages