-
Notifications
You must be signed in to change notification settings - Fork 4
WcbTop
The Wikipedia Corpus Builder (WCB) is a toolkit for extracting relevant linguistic content from Wikipedia. It was used in the creation of the 2012 versions of WeScience and WikiWoods, through the MSc thesis of Lars Jørgen Solberg at the Department of Informatics at the University of Oslo.
Make sure that the following prerequisites are installed:
-
mwlib - http://pediapress.com/code/
-
mw.cdb (if mwlib ver. > 0.13.11) - http://pypi.python.org/pypi/mwlib.cdb/0.1.1
-
tokenizer - http://www.cis.uni-muenchen.de/~wastl/misc/
If the command python -c 'from mwlib import cdb' does not give any error message and your shell is able to find tokenizer and ngram you should be in good shape.
Download WDB: git clone https://github.com/larsjsol/wcb.git.
Home | Forum | Discussions | Events