Skip to content

FrancescoVinci/TSP_Clustering_compression

Repository files navigation

Clustering documents to compress inverted index

Project for the Information Retrieval and Web Search course for the 2023/2024 academic year.

For further information, see IRWS_Project_Report.pdf

Requirements

  • rich is a Python library for rich text and beautiful formatting in the terminal.

     pip install rich
    
  • matplotlib is a comprehensive library for creating static, animated, and interactive visualizations.

     pip install matplotlib
    

Assumptions

We assume that the medoid for each cluster is the first document inserted. In the stream_cluster function, in the clustering.py file, we find a commented part of the code where the medoid for each cluster is calculated according to the jaccard distance, uncomment this part to perform the calculation of the actual medoid, keep in mind that this will make the computation extremely slow and inefficient.

Structure

TSP_Clustering_IRWS/
│
├── README.md
├── clustering.py
├── compression.py
├── create_dictionary.py
├── log.py
├── tsp.py
├── main.py
│
├── blocks/
├── plot/
└── collections/
    ├── lyrl2004_tokens_test_pt0.dat
    └── lyrl2004_tokens_test_pt1.dat

Before running the program you need to create the /blocks folder. The lyrl2004_tokens_test_pt0.dat and lyrl2004_tokens_test_pt1.dat collections are compressed in a folder, before running you need to extract them.

  • blocks/: will contain the blocks generated by SPIMI-invert.
  • plot/: will contain the output graphs.
  • report/: contains the report.pdf with information relating to the project.
  • collections/: the lyrl2004_tokens_test_pt0.dat and lyrl2004_tokens_test_pt1.dat collections are compressed in a zip folder, before running you need to extract them.

Usage

python3 main.py

About

Project for the Information Retrieval and Web Search course for the 2023/2024 academic year

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages