Project for the Information Retrieval and Web Search course for the 2023/2024 academic year.
For further information, see IRWS_Project_Report.pdf
-
rich
is a Python library for rich text and beautiful formatting in the terminal.pip install rich
-
matplotlib
is a comprehensive library for creating static, animated, and interactive visualizations.pip install matplotlib
We assume that the medoid for each cluster is the first document inserted.
In the stream_cluster
function, in the clustering.py
file, we find a commented part of the code where the medoid for each cluster is calculated according to the jaccard distance, uncomment this part to perform the calculation of the actual medoid, keep in mind that this will make the computation extremely slow and inefficient.
TSP_Clustering_IRWS/
│
├── README.md
├── clustering.py
├── compression.py
├── create_dictionary.py
├── log.py
├── tsp.py
├── main.py
│
├── blocks/
├── plot/
└── collections/
├── lyrl2004_tokens_test_pt0.dat
└── lyrl2004_tokens_test_pt1.dat
Before running the program you need to create the /blocks
folder. The lyrl2004_tokens_test_pt0.dat
and lyrl2004_tokens_test_pt1.dat
collections are compressed in a folder, before running you need to extract them.
blocks/
: will contain the blocks generated by SPIMI-invert.plot/
: will contain the output graphs.report/
: contains thereport.pdf
with information relating to the project.collections/
: thelyrl2004_tokens_test_pt0.dat
andlyrl2004_tokens_test_pt1.dat
collections are compressed in a zip folder, before running you need to extract them.
python3 main.py