Skip to content
LarsJørgenSolberg edited this page Nov 29, 2012 · 20 revisions

Background

The Wikipedia Corpus Builder (WCB) is a toolkit for extracting relevant linguistic content from Wikipedia. It was used in the creation of the 2012 versions of WeScience and WikiWoods, through the MSc thesis of Lars Jørgen Solberg at the Department of Informatics at the University of Oslo.

Installation

Make sure that the following prerequisites are installed:

If the command python -c 'from mwlib import cdb' does not give any error message and your shell is able to find tokenizer and ngram you should be in good shape.

Download WDB: git clone https://github.com/larsjsol/wcb.git.

Running on the English Wikipedia

Adaptations to Other Languages

Construction of WeScience 2.0

Clone this wiki locally