Skip to content

riccardogibello/HLPD

 
 

Repository files navigation

Decoding the Hierarchy: A Hybrid Approach to Hierarchical Multi-Label Text Classification

Introduction

Alt text

Hierarchical multi-label text classification (HMTC) aims to predict multiple labels from a tree-like hierarchy for a given input text. Recent approaches frame HMTC as a Seq2Seq problem, where the objective is to predict the sequence of associated labels, regardless of their order or position in the hierarchy. Despite promising results, these approaches rely solely on attention mechanisms from previously generated tokens. This limit prevents them from acquiring information about the global hierarchy and may lead to the accumulation of errors as the model learns hierarchical cues among labels. We propose a novel HMTC model based on a hybrid version of the encoder-decoder architecture where the decoder is pre-populated with the entire label embeddings. By leveraging the decoder's cross-attention and hierarchical self-attention mechanisms, we achieve a label representation that benefits from instance and global label-wise information. Empirical experiments on four HMTC benchmark datasets demonstrated the effectiveness of our approach by settling new state-of-the-art results. Code and datasets are made available to facilitate the reproducibility and future work.

Datasets

We conduct experiments on four public datasets:

  • Reuters corpus RCV1-V2
  • Blurb Genre Collection BGC
  • Web-Of-Science WOS
  • AAPD

The original Reuters corpus dataset can be acquired by signing an agreement. You can find the other datasets in the data repository: data/hiera_multilabel_bench/data_files.

Experiments

To run experiments please use the train_DATASET_NAME.sh shell script.

Main Requirements

  • torch==1.12.0
  • transformers==4.20.0
  • datasets==2.6.1
  • scikit-learn==1.0.0
  • tqdm==4.62.0
  • wandb==0.12.0

Citation

  • Torba, F., Gravier, C., Laclau, C., Kammoun, A., Subercaze, J. (2025). Decoding the Hierarchy: A Hybrid Approach to Hierarchical Multi-label Text Classification. In: Hauff, C., et al. Advances in Information Retrieval. ECIR 2025. Lecture Notes in Computer Science, vol 15572. Springer, Cham. https://doi.org/10.1007/978-3-031-88708-6_26

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.6%
  • Shell 1.1%
  • Batchfile 0.3%