GitHub

Problem Statement ####

Task: Construct the Inverted Index from the given small snapshot of Wikipedia dump.

Basic Stages (in order):

● XML parsing [Prefer SAX parser over DOM parser. If you use DOM parser, you can’t scale it up for the full Wikipedia dump later on.] ● Tokenization ● Case folding ● Stop words removal ● Stemming
● Posting List / Inverted Index Creation ● Optimize

Desirable Features: ● Support for Field Queries . Fields include Title, Infobox, Body, Category, Links, and References of a Wikipedia page. This helps when a user is interested in searching for the movie ‘Up’ where he would like to see the page containing the word ‘Up’ in the title and the word ‘Pixar’ in the Infobox. You can store field type along with the word when you index. ● Index size should be less than 1⁄4 of dump size. ● Scalable index construction [See Chapter 4 in the ‘Intro to IR’ book.]

Project details ####

Title : Wikipedia search engine (IIITH) Platform : J2EE Description : A Wikipedia search engine built using Java, XML Parsing using SAX Parser,Ranking Algorithms High level of indexing which reduces the search time ,The index terms are hashed to characters 'a' - 'z', Index is compressed at bitlevel, Special infobox parsing to provide direct answers if possible, Index compression to make index half of its size, Special search fields provided so that user can directly search Special Features : Index compression to make index half of its size. ( bit level compression ) Special search fields provided so that user can directly search info infobox

Statistics :

For 100 GB of data Wiki XML Dump
Size of index (primary + secondary ) : 25 GB
Time to index : 6hr (average)
Time to search : 0.251 sec (average on 100 searches)

#############################################################################################################

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
2019900004_SRC		2019900004_SRC
CreateIndex.jar		CreateIndex.jar
LICENSE.txt		LICENSE.txt
README.md		README.md
Readme_donotuse		Readme_donotuse
SearchQuery.jar		SearchQuery.jar
index.sh		index.sh
search.sh		search.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Arjungambhir/Wiki_search_engine

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages