Test the Best Hybrid Partition Generated by Hierarchical Community Detection Methods with k-NN Sparsification
This project aims to induce the CLUS-FRAMEWORK or RANDOM FOREST in the best hybrid partition to enhance multilabel classification.
@misc{Gatto2025,
author = {Gatto, E. C.},
title = {Test Hybrid Partitions using Communities Detection Methods for Multilabel Classification},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/cissagatto/TcpKnnH}}
}
- Hybrid partition testing for multilabel classification
- Utilizes Hierarchical Community Detection methods
- k-NN sparsification applied in clustering
- Compatible with CLUS-FRAMEWORK and Random Forest classifiers
- Supports multiple multilabel datasets
Coming soon...
libraries.R
→ Loads required librariesutils.R
→ Helper functions & preprocessing utilitiesrun.R
→ Main script executionrun-rf.R
→ Runs the Random Forest classifiervalidateMaF1.R
→ validates the hybrid partitions with Macro-F1 criteriavalidateSilho.R
→ validates the hybrid partitions with Silhouette Coeficient criteriatestMaF1.R
→ clus: test the best hybrid partition chosen with Macro-F1 criteriatestSilho.R
→ clus: test the best hybrid partition chosen with Silhoutte Coeficient criteriatest-asoc.R
→ random forests: test the best hybrid partition chosen with Silhouette Coeficient criteria
tcp.R
→ Runs the experimentconfig-files.R
→ Configuration file template
A file named datasets-original.csv must be placed in the project root. This file contains metadata about 90 multilabel datasets. To use a custom dataset, include it in this file with the following structure:
Parameter | Status | Description |
---|---|---|
Id | Mandatory | Unique integer identifier for the dataset |
Name | Mandatory | Dataset name (follow benchmark naming conventions) |
Domain | Optional | Dataset domain |
Instances | Mandatory | Total number of instances |
Attributes | Mandatory | Total number of attributes |
Labels | Mandatory | Total number of labels |
Inputs | Mandatory | Number of input attributes |
Cardinality | Optional | Cardinality value |
Density | Optional | Density value |
Max.freq | Optional | Maximum frequency |
Mean.IR | Optional | Mean imbalance ratio |
AttStart | Mandatory | Column index where attributes begin |
AttEnd | Mandatory | Column index where attributes end |
LabelStart | Mandatory | Column index where labels begin |
LabelEnd | Mandatory | Column index where labels end |
xn | Mandatory | X dimension of Kohonen map |
yn | Mandatory | Y dimension of Kohonen map |
gridn | Mandatory | X * Y value (must be square) |
max.neighbors | Mandatory | Maximum number of neighbors (Labels - 1) |
📖 Click here for a detailed explanation of these properties.
- The experiment requires X-Fold Cross-Validation files in tar.gz format.
- Download pre-generated 10-fold cross-validation files for multiple datasets here.
- For a new dataset, add it to datasets-original.csv and generate cross-validation files using this repository.
- The tar.gz file can be stored in any directory, with its absolute path set in the configuration file.
Ensure Java, Python, and R dependencies are installed manually. This project does not provide automatic installation.
-
Recommended: Use the Conda Environment:
conda env create -f AmbienteTeste.yaml
-
Alternatively, use AppTainer Containers for SLURM cluster execution. Tutorial (Portuguese).
Create a CSV file with the following structure:
Config | Value |
---|---|
Dataset_Path | Absolute path to dataset tar.gz |
Temporary_Path | Path for temporary processing ¹ |
Partitions_Path | Path to partition files |
Validation | "Silhouette", "Macro-F1", etc. |
Similarity | "jaccard", "rogers", etc. |
Classifier | "clus" or "random-forests" |
Dataset_Name | Name from datasets-original.csv |
Number_Dataset | ID from datasets-original.csv |
Number_Folds | Cross-validation folds |
Number_Cores | Number of CPU cores to use |
R_clone | 1 = Upload results to cloud, 0 otherwise |
Save_csv_files | 1 = Save CSV files |
📌 ¹ Use directories like /dev/shm
, /tmp
, or /scratch
.
To obtain partitions, use this repository.
📥 Download partitions here.
- RStudio Version: 1.4.1106 (Ubuntu Bionic)
- R Language Version: 4.1.0 ("Camp Pontanezen")
- Parallel execution is highly recommended.
- In our experiments, we used 10 cores.
- Tested on Ubuntu 20.04.2 LTS (Focal Fossa) with an Intel Core i7-10750H processor.
Open a terminal, navigate to ~/TcpKnnH/examples
, and execute:
Rscript tcp.R [absolute_path_to_config_file]
Example:
Rscript tcp.R "~/TcpKnnH/config-files/jaccard/Silhouette/random-forests/jsrf-emotions.csv"
[Click here]
This study was funded by:
- CAPES (Finance Code 001)
- CNPQ (Process Number 200371/2022-3)
- FAPESP
📧 Elaine Cecília Gatto – elainececiliagatto@gmail.com
Website | LinkedIn | GitHub | [YouTube](https://www