A text retrieval system implementing various information retrieval models with a Streamlit-based user interface.
- Multiple retrieval models:
- Boolean Retrieval Model
- Vector Space Model (VSM)
- Latent Semantic Analysis (LSA)
- Combined Model (Boolean + Vector)
- Interactive search interface
- Document statistics and visualizations
- Model evaluation metrics
- Clone this repository:
git clone https://github.com/dangvonguyen/IR-CS419.P21.git
cd IR-CS419.P21- Set up the environment and install dependencies
# Using pip
python -m venv .venv
source .venv/bin/activate
pip install -e .
# Using uv (faster installation)
uv sync
source .venv/bin/activateRun the application using Streamlit:
streamlit run app.py- Load Data: Use the sidebar to select data source and parameters
- Select Model: Choose between Boolean, VSM, LSA, or Combined retrieval models
- Search: Enter queries in the search tab to retrieve relevant documents
- Analyze: View document statistics and model performance in the Statistics tab
- Browse: View loaded documents in the Documents tab
app.py: Main Streamlit applicationsrc/models/: Implementation of retrieval modelsboolean_model.py: Boolean retrieval with inverted indexvsm_model.py: Vector Space Model with TF-IDFlsa_model.py: Latent Semantic Analysis modelcombined_model.py: Combined Boolean and Vector model
src/utils.py: Utility functions for text processingsrc/evaluate.py: Evaluation metrics for retrieval modelsui/: User interface components