Skip to content

Test the best hybrid partition generated by hierarchical comunity detection methods and k-NN sparsification

License

Notifications You must be signed in to change notification settings

cissagatto/TcpKnnH

Repository files navigation

Test the Best Hybrid Partition Generated by Hierarchical Community Detection Methods with k-NN Sparsification

🎯 Project Goal

This project aims to induce the CLUS-FRAMEWORK or RANDOM FOREST in the best hybrid partition to enhance multilabel classification.

📌 How to Cite

@misc{Gatto2025,
  author = {Gatto, E. C.},
  title = {Test Hybrid Partitions using Communities Detection Methods for Multilabel Classification},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/cissagatto/TcpKnnH}}
}

🔑 Main Features

  • Hybrid partition testing for multilabel classification
  • Utilizes Hierarchical Community Detection methods
  • k-NN sparsification applied in clustering
  • Compatible with CLUS-FRAMEWORK and Random Forest classifiers
  • Supports multiple multilabel datasets

📊 Flowchart

Coming soon...


⚙️ Code Structure

R Folder (Main Scripts):

  • libraries.R → Loads required libraries
  • utils.R → Helper functions & preprocessing utilities
  • run.R → Main script execution
  • run-rf.R → Runs the Random Forest classifier
  • validateMaF1.R → validates the hybrid partitions with Macro-F1 criteria
  • validateSilho.R → validates the hybrid partitions with Silhouette Coeficient criteria
  • testMaF1.R → clus: test the best hybrid partition chosen with Macro-F1 criteria
  • testSilho.R → clus: test the best hybrid partition chosen with Silhoutte Coeficient criteria
  • test-asoc.R → random forests: test the best hybrid partition chosen with Silhouette Coeficient criteria

Examples Folder:

  • tcp.R → Runs the experiment
  • config-files.R → Configuration file template

🛠️ Preparing Your Experiment

Step 1: Dataset Setup

A file named datasets-original.csv must be placed in the project root. This file contains metadata about 90 multilabel datasets. To use a custom dataset, include it in this file with the following structure:

Parameter Status Description
Id Mandatory Unique integer identifier for the dataset
Name Mandatory Dataset name (follow benchmark naming conventions)
Domain Optional Dataset domain
Instances Mandatory Total number of instances
Attributes Mandatory Total number of attributes
Labels Mandatory Total number of labels
Inputs Mandatory Number of input attributes
Cardinality Optional Cardinality value
Density Optional Density value
Max.freq Optional Maximum frequency
Mean.IR Optional Mean imbalance ratio
AttStart Mandatory Column index where attributes begin
AttEnd Mandatory Column index where attributes end
LabelStart Mandatory Column index where labels begin
LabelEnd Mandatory Column index where labels end
xn Mandatory X dimension of Kohonen map
yn Mandatory Y dimension of Kohonen map
gridn Mandatory X * Y value (must be square)
max.neighbors Mandatory Maximum number of neighbors (Labels - 1)

📖 Click here for a detailed explanation of these properties.


Step 2: Cross-Validation Files

  • The experiment requires X-Fold Cross-Validation files in tar.gz format.
  • Download pre-generated 10-fold cross-validation files for multiple datasets here.
  • For a new dataset, add it to datasets-original.csv and generate cross-validation files using this repository.
  • The tar.gz file can be stored in any directory, with its absolute path set in the configuration file.

Step 3: Install Dependencies

Ensure Java, Python, and R dependencies are installed manually. This project does not provide automatic installation.


Step 4: Configuration File Setup

Create a CSV file with the following structure:

Config Value
Dataset_Path Absolute path to dataset tar.gz
Temporary_Path Path for temporary processing ¹
Partitions_Path Path to partition files
Validation "Silhouette", "Macro-F1", etc.
Similarity "jaccard", "rogers", etc.
Classifier "clus" or "random-forests"
Dataset_Name Name from datasets-original.csv
Number_Dataset ID from datasets-original.csv
Number_Folds Cross-validation folds
Number_Cores Number of CPU cores to use
R_clone 1 = Upload results to cloud, 0 otherwise
Save_csv_files 1 = Save CSV files

📌 ¹ Use directories like /dev/shm, /tmp, or /scratch.


Step 5: Generate Partitions

To obtain partitions, use this repository.

📥 Download partitions here.


🖥️ Software Requirements

  • RStudio Version: 1.4.1106 (Ubuntu Bionic)
  • R Language Version: 4.1.0 ("Camp Pontanezen")

💻 Hardware Requirements

  • Parallel execution is highly recommended.
  • In our experiments, we used 10 cores.
  • Tested on Ubuntu 20.04.2 LTS (Focal Fossa) with an Intel Core i7-10750H processor.

▶️ Running the Experiment

Open a terminal, navigate to ~/TcpKnnH/examples, and execute:

Rscript tcp.R [absolute_path_to_config_file]

Example:

Rscript tcp.R "~/TcpKnnH/config-files/jaccard/Silhouette/random-forests/jsrf-emotions.csv"

📥 Download Results

[Click here]


🏆 Acknowledgments

This study was funded by:

  • CAPES (Finance Code 001)
  • CNPQ (Process Number 200371/2022-3)
  • FAPESP

📬 Contact

📧 Elaine Cecília Gattoelainececiliagatto@gmail.com

🔗 Useful Links

Website | LinkedIn | GitHub | [YouTube](https://www

About

Test the best hybrid partition generated by hierarchical comunity detection methods and k-NN sparsification

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published