Python Search Engine with TF-IDF

A "from-scratch" implementation of a search engine in Python. This project indexes a corpus of French Wikipedia articles and ranks them using TF-IDF and cosine similarity. It includes a custom TF-IDF vectorizer implementation to compare against Scikit-learn's, and benchmarks different NLP preprocessing pipelines (lemmatization vs. stemming).

About The Project

This project was developed as part of an academic course on Natural Language Processing. The main objective was to build a complete information retrieval system from the ground up, covering every step from text preprocessing to performance benchmarking. The engine operates on a corpus of 2,000 French Wikipedia articles.

Key Features

TF-IDF Vectorization: Includes both Scikit-learn's TfidfVectorizer and a custom, from-scratch implementation for comparison.
Advanced NLP Preprocessing: A configurable pipeline to compare the impact of lemmatization (with SpaCy) versus stemming (with NLTK).
Dual-Mode Operation:
- query mode for interactive, real-time searches via the command line.
- test mode for automated benchmarking on a predefined set of 100 queries.
Cosine Similarity Ranking: Ranks documents based on the cosine similarity between query and document vectors.
Performance Benchmarking: Calculates Top-1 and Top-5 accuracy to evaluate the effectiveness of different configurations.

Built With

Python 3
Scikit-learn
NLTK (for stop-words and stemming)
SpaCy (for lemmatization)
NumPy

Getting Started

To get a local copy up and running, follow these steps.

Clone the repository:

git clone https://github.com/nico916/best_search_engine-.git

Install dependencies: It is recommended to use a virtual environment.
```
pip install -r requirements.txt
# You may also need to download NLTK and SpaCy data:
# python -m nltk.downloader stopwords
# python -m spacy download fr_core_news_sm
```
(Note: If a requirements.txt file is not provided, you will need to install the libraries listed under "Built With" manually.)

Usage

The main script search_engine.py can be run from the command line.

To perform an interactive query:
```
python search_engine.py --mode query
```
To run the benchmark with default settings (Scikit-learn + lemmatization):
```
python search_engine.py --mode test
```

To test the custom vectorizer with stemming:

python search_engine.py --mode test --custom_vectorizer --preprocessing stemming

Technical Deep Dive

The project follows a classic information retrieval pipeline.

Preprocessing Pipeline

Each document and query goes through a normalization process:

Lowercasing: Converts all text to lowercase.
Tokenization: Splits text into individual words (tokens).
Stop-word Removal: Filters out common, non-informative words.
Morphological Normalization: Reduces words to their base form using either stemming or lemmatization.

TF-IDF Vectorization

The core of the engine is the transformation of text into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency). I implemented a custom vectorizer to understand the inner workings and compared it to Scikit-learn's highly optimized version.

Ranking & Evaluation

The relevance of a document to a query is determined by the cosine similarity between their respective TF-IDF vectors. In test mode, the engine automatically computes the Top-1 and Top-5 accuracy against a ground-truth dataset.

Results & Analysis

Scenario	Top-1 Accuracy	Top-5 Accuracy
A: Scikit-learn + Lemmatization	82%	97%
B: Scikit-learn + Stemming	85%	97%
C: Custom Vectorizer + Lemmatization	81%	97%
D: Custom Vectorizer + Stemming	85%	97%

Key Insight: Stemming consistently provided a slight edge in Top-1 accuracy over lemmatization in this context.
Robustness: All configurations achieved a high Top-5 accuracy of 97%, indicating that the correct document was almost always ranked among the most relevant results.
Error Analysis: Systematic failures on queries like "Elizabeth Ière" (vs. "Élisabeth Ire") highlight the model's sensitivity to exact lexical matches and the limits of a purely keyword-based approach.

Future Improvements

Semantic Search: Integrate word embeddings (e.g., SBERT) to understand query intent beyond keywords.
Query Expansion & Correction: Add spell-checking and synonym expansion.
Scalable Indexing: Replace the in-memory matrix with a persistent inverted index (e.g., Whoosh, FAISS) for larger corpora.
RAG Prototype: Use the retrieval system as a foundation for a Retrieval-Augmented Generation (RAG) pipeline with a Large Language Model.

License

Distributed under the MIT License. See LICENSE file for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
wiki_split_extract_2k		wiki_split_extract_2k
LICENSE		LICENSE
README.md		README.md
requetes.jsonl		requetes.jsonl
search_engine.py		search_engine.py
tfidf_vectorizer.py		tfidf_vectorizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Python Search Engine with TF-IDF

Table of Contents

About The Project

Key Features

Built With

Getting Started

Usage

Technical Deep Dive

Preprocessing Pipeline

TF-IDF Vectorization

Ranking & Evaluation

Results & Analysis

Future Improvements

License

About

Uh oh!

Releases

Packages

Languages

License

nico916/best_search_engine-

Folders and files

Latest commit

History

Repository files navigation

Python Search Engine with TF-IDF

Table of Contents

About The Project

Key Features

Built With

Getting Started

Usage

Technical Deep Dive

Preprocessing Pipeline

TF-IDF Vectorization

Ranking & Evaluation

Results & Analysis

Future Improvements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages