A 200k-articles-based Wikipedia Search Engine using multiple NLP models, ranging from basic SVM models to high-level techniques.
- document_preprocessor: RegexTokenizer('\w+')
- l2r: True
- VectorRanker
- psedofeedback: {pseudofeedback_num_docs=10, pseudofeedback_alpha=0.5, pseudofeedback_beta=0.5}
mymodel outperformed baseline(L2R+BM25) in terms of MAP 64% and comparable at ndcg (-3%).
Dataset | MAP | NDCG |
---|---|---|
mymodel | 0.064(0.018,0.110) | 0.331(0.296,0.366) |
L2R_BM25 | 0.039(0.018,0.060) | 0.344(0.308,0.379) |
In parentheses is interval range given 95% significance level
A generic class for objects that turn strings into sequences of tokens. It is defaulted to lowcase all words and skip mulwiword expression match.
Child classes include: SplitTokenizer
, RegexTokenizer
, SpacyTokenizer
.
Note that: RegexTokenizer
with pattern \w+
is the defaulted tokenizer for this project.
Currently two types are supported by the system, PositionalIndex
and BasicInveredIndex
attributes include:
statistics
:vocab
corpus vocabulary tokens count,total_token_count
# of total tokens including those filtered ,number_of_documents
,unique_token_count
,stored_total_token_count
# of total tokens excluding those filteredvocabulary
: corpus vocabularydocument_metadata
: {docid:dict}, where dict includeslength
,unique_tokens
,tokens_count
index
:{token:[(docid1,freq),(docid2,freq)]}
main functions include:
.remove_doc(*)
.add_doc(*)
.get_postings(*)
.get_doc_metadata(*)
.get_term_metadata(*)
.get_statistics(*)
.save()
.load()
Note that: BasicInveredIndex
is the defaulted indexing method for this project.
main functions include:
.create_index(*)
A class to generate network features, such as PageRank scores,HITS hub and authority scores. ad-hoc network features files required
main functions include:
.load_network(*)
.calaulate_page_rank(*)
.calculate_hits(*)
.get_all_network_statistics(*)
: return dataframe with columns ['docid', 'pagerank', 'authority_score', 'hubscore']
A class to generate L2RRanker model.
main attributes include:
dindex
: InvertedIndex, document body indextindex
: InvertedIndex, document title indextokenizer
:Tokenizer, document preprocessorstopwords
: set[str], stopwords setranker
: Rankerftextractor
:L2RFeatureExtractor
main functions include:
.prepare_training_data(*)
.train(*)
.predict(*)
.query(*)
A class to generate document features.
main attributes include:
dindex
: InvertedIndex, document body indextindex
: InvertedIndex, document title indextokenizer
:Tokenizer, document preprocessorstopwords
: set[str], stopwords setdoc_category_info
:external imported doc_category_info dictionary that map document id to a list of categories.docid_to_network_features
:dict, generated by netword_feature.pyranker
: Rankerftextractor
:L2RFeatureExtractordindex_bm25
: BM25 ranker for document body indexdindex_piv
: PivotedNormalization ranker for document body indexce_scorer
: cross encoder
main functions include:
.generate_features(*)
Initialize and train L2R model
This class is responsible for generating a list of documents for a given query, ordered by scores.
main attributes include:
index
document_preprocessor
stopwords
scorer
main functions include:
.query(*)
The parent class for scorers. The child classes include: WordCountCosineSimilarity
,DirichletLM
, BM25
, PivotedNormalization
,TF_IDF
main functions include:
.score(*)
An object that uses cross-encoder to compute relevance of a document and query
Initialize a vectorranker and set up query function.
main attributes include:
biencoder_model
:the name of huggingface model to initialize a 'SentencTransformer'encoded_docs
: A matrix where each row is an already-encoded documentrow_to_docid
:A list mapping each row in the matrix to document id
main functions include:
.query(*)
main functions include:
map_score()
ndcg_score()
run_relevance_test()