Skip to content

Arjungambhir/Wiki_search_engine

Repository files navigation

Problem Statement ####

Task:​ ​ ​ Construct​ ​ the​ ​ Inverted​ ​ Index​ ​ from​ ​ the​ ​ given​ ​ small​ ​ snapshot​ ​ of​ ​ Wikipedia​ ​ dump.

Basic​ ​ Stages​ ​ (in​ ​ order):

● XML parsing [Prefer SAX parser over DOM parser. If you use DOM parser, you can’t scale​ ​ it​ ​ up​ ​ for​ ​ the​ ​ full​ ​ Wikipedia​ ​ dump​ ​ later​ ​ on.] ● Tokenization ● Case​ ​ folding ● Stop​ ​ words​ ​ removal ● Stemming
● Posting​ ​ List​ ​ /​ ​ Inverted​ ​ Index​ ​ Creation ● Optimize

Desirable​ ​ Features: ● Support for Field Queries ​ . Fields include Title, Infobox, Body, Category, Links, and References of a Wikipedia page. This helps when a user is interested in searching for the movie ‘Up’ where he would like to see the page containing the word ‘Up’ in the title and the word ‘Pixar’ in the Infobox. You can store field type along with the word when you​ ​ index. ● Index​ ​ size​ ​ should​ ​ be​ ​ less​ ​ than​ ​ 1⁄4​ ​ of​ ​ dump​ ​ size.​ ​ ● Scalable​ ​ index​ ​ construction​ ​ [See​ ​ Chapter​ ​ 4​ ​ in​ ​ the​ ​ ‘Intro​ ​ to​ ​ IR’​ ​ book.]

Project details ####

Title : Wikipedia search engine (IIITH) Platform : J2EE Description : A Wikipedia search engine built using Java, XML Parsing using SAX Parser,Ranking Algorithms High level of indexing which reduces the search time ,The index terms are hashed to characters 'a' - 'z', Index is compressed at bitlevel, Special infobox parsing to provide direct answers if possible, Index compression to make index half of its size, Special search fields provided so that user can directly search Special Features : Index compression to make index half of its size. ( bit level compression ) Special search fields provided so that user can directly search info infobox

Statistics :

For 100 GB of data Wiki XML Dump
Size of index (primary + secondary ) : 25 GB
Time to index : 6hr (average)
Time to search : 0.251 sec (average on 100 searches)

#############################################################################################################

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published