Wiki-Search-Engine

A 200k-articles-based Wikipedia Search Engine using multiple NLP models, ranging from basic SVM models to high-level techniques.

The Current Best Performing Model in this project

Streamline

document_preprocessor: RegexTokenizer('\w+')
l2r: True
VectorRanker
psedofeedback: {pseudofeedback_num_docs=10, pseudofeedback_alpha=0.5, pseudofeedback_beta=0.5}

Performance

mymodel outperformed baseline(L2R+BM25) in terms of MAP 64% and comparable at ndcg (-3%).

Dataset	MAP	NDCG
mymodel	0.064(0.018,0.110)	0.331(0.296,0.366)
L2R_BM25	0.039(0.018,0.060)	0.344(0.308,0.379)

In parentheses is interval range given 95% significance level

Pipeline Overview

Pipeline Details

`document_preprocessor.py`

`Tokenizer` class

A generic class for objects that turn strings into sequences of tokens. It is defaulted to lowcase all words and skip mulwiword expression match.

Child classes include: SplitTokenizer, RegexTokenizer , SpacyTokenizer.

Note that: RegexTokenizer with pattern \w+ is the defaulted tokenizer for this project.

`indexing.py`

`IndexType` class

Currently two types are supported by the system, PositionalIndex and BasicInveredIndex

`InvertedIndex` class

attributes include:

statistics: vocab corpus vocabulary tokens count, total_token_count # of total tokens including those filtered , number_of_documents , unique_token_count, stored_total_token_count # of total tokens excluding those filtered
vocabulary: corpus vocabulary
document_metadata: {docid:dict}, where dict includes length,unique_tokens,tokens_count
index:{token:[(docid1,freq),(docid2,freq)]}

main functions include:

.remove_doc(*)
.add_doc(*)
.get_postings(*)
.get_doc_metadata(*)
.get_term_metadata(*)
.get_statistics(*)
.save()
.load()

Note that: BasicInveredIndex is the defaulted indexing method for this project.

`Indexer` class

main functions include:

.create_index(*)

`network_features.py`

A class to generate network features, such as PageRank scores,HITS hub and authority scores. ad-hoc network features files required

main functions include:

.load_network(*)
.calaulate_page_rank(*)
.calculate_hits(*)
.get_all_network_statistics(*): return dataframe with columns ['docid', 'pagerank', 'authority_score', 'hubscore']

`l2r.py`

`L2RRanker` class

A class to generate L2RRanker model.

main attributes include:

dindex: InvertedIndex, document body index
tindex: InvertedIndex, document title index
tokenizer:Tokenizer, document preprocessor
stopwords: set[str], stopwords set
ranker: Ranker
ftextractor:L2RFeatureExtractor

main functions include:

.prepare_training_data(*)
.train(*)
.predict(*)
.query(*)

`L2RFeatureExtractor` class

A class to generate document features.

main attributes include:

dindex: InvertedIndex, document body index
tindex: InvertedIndex, document title index
tokenizer:Tokenizer, document preprocessor
stopwords: set[str], stopwords set
doc_category_info:external imported doc_category_info dictionary that map document id to a list of categories.
docid_to_network_features:dict, generated by netword_feature.py
ranker: Ranker
ftextractor:L2RFeatureExtractor
dindex_bm25: BM25 ranker for document body index
dindex_piv: PivotedNormalization ranker for document body index
ce_scorer: cross encoder

main functions include:

.generate_features(*)

`LambdaMART` class

Initialize and train L2R model

`ranker.py`

`Ranker` class

This class is responsible for generating a list of documents for a given query, ordered by scores.

main attributes include:

index
document_preprocessor
stopwords
scorer

main functions include:

.query(*)

`RelevanceScorer` class

The parent class for scorers. The child classes include: WordCountCosineSimilarity,DirichletLM, BM25, PivotedNormalization,TF_IDF

main functions include:

.score(*)

`CrossEncoderScorer` class

An object that uses cross-encoder to compute relevance of a document and query

`VectorRanker.py`

`VectorRanker` class

Initialize a vectorranker and set up query function.

main attributes include:

biencoder_model:the name of huggingface model to initialize a 'SentencTransformer'
encoded_docs: A matrix where each row is an already-encoded document
row_to_docid:A list mapping each row in the matrix to document id

main functions include:

.query(*)

`relevance.py`

main functions include:

map_score()
ndcg_score()
run_relevance_test()

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.DS_Store		.DS_Store
.gitignore		.gitignore
.zshrc		.zshrc
README.md		README.md
app.py		app.py
document_preprocessor.py		document_preprocessor.py
image-1.png		image-1.png
image.png		image.png
indexing.py		indexing.py
l2r.py		l2r.py
models.py		models.py
network_features.py		network_features.py
pipeline.py		pipeline.py
ranker.py		ranker.py
relevance.py		relevance.py
vector_ranker.py		vector_ranker.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Wiki-Search-Engine

The Current Best Performing Model in this project

Streamline

Performance

Pipeline Overview

Pipeline Details

`document_preprocessor.py`

`Tokenizer` class

`indexing.py`

`IndexType` class

`InvertedIndex` class

`Indexer` class

`network_features.py`

`l2r.py`

`L2RRanker` class

`L2RFeatureExtractor` class

`LambdaMART` class

`ranker.py`

`Ranker` class

`RelevanceScorer` class

`CrossEncoderScorer` class

`VectorRanker.py`

`VectorRanker` class

`relevance.py`

About

Uh oh!

Releases

Packages

Languages

engmargo/Wiki-Search-Engine

Folders and files

Latest commit

History

Repository files navigation

Wiki-Search-Engine

The Current Best Performing Model in this project

Streamline

Performance

Pipeline Overview

Pipeline Details

document_preprocessor.py

Tokenizer class

indexing.py

IndexType class

InvertedIndex class

Indexer class

network_features.py

l2r.py

L2RRanker class

L2RFeatureExtractor class

LambdaMART class

ranker.py

Ranker class

RelevanceScorer class

CrossEncoderScorer class

VectorRanker.py

VectorRanker class

relevance.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`document_preprocessor.py`

`Tokenizer` class

`indexing.py`

`IndexType` class

`InvertedIndex` class

`Indexer` class

`network_features.py`

`l2r.py`

`L2RRanker` class

`L2RFeatureExtractor` class

`LambdaMART` class

`ranker.py`

`Ranker` class

`RelevanceScorer` class

`CrossEncoderScorer` class

`VectorRanker.py`

`VectorRanker` class

`relevance.py`

Packages