-
Notifications
You must be signed in to change notification settings - Fork 4
WcbTop
The Wikipedia Corpus Builder (WCB) is a toolkit for extracting relevant linguistic content from Wikipedia. It was used in the creation of the 2012 versions of WeScience and WikiWoods, through the MSc thesis of Lars Jørgen Solberg at the Department of Informatics at the University of Oslo.
Make sure that the following prerequisites are installed:
-
mwlib - http://pediapress.com/code/
-
mwlib.cdb - http://pypi.python.org/pypi/mwlib.cdb/0.1.1
-
tokenizer - http://www.cis.uni-muenchen.de/~wastl/misc/
If the command python -c 'from mwlib.cdb import cdbwiki' does not give any error message and your shell is able to find tokenizer and ngram you should be in good shape.
WCB itself can be downloaded from https://github.com/larsjsol/wcb/archive/master.tar.gz.
The setup used in the creation of WikiWoods 2.0 is included in the wcb/enwiki-20080727 directory. It should be usable on newer snapshots as well.
First prepare a database snapshot:
-
Download a snapshot from either from the WikiWoods page or from http://dumps.wikimedia.org/.
-
Decompress the the snapshot: bunzip enwiki-20080727-pages-articles.xml.bz2
-
Create a Constant Database: mw-buildcdb --input enwiki-20080727-pages-articles.xml --output OUTDIR
-
Change the wikiconf entry in wcb/enwiki-20080727/paths.txt so it points to the file wikiconf.txt created in the previous step.
Most of the modules in WCB need access to a paths.txt-file and determines its location by examining the environment variable PATHSFILE. This variable can be set by doing something like export PATHSFILE=./wcb/enwiki-20080727/paths.txt.
As a test run ./wcb/scripts/gml.py --senseg 'Context-free language', which should print some GML to stdout. The first invocation of this command will take some time as it will examine all templates in the snapshot.
WCB can create corpora directly from a snapshot or from files containing wiki markup by using the scripts build_corpus.py (snapshot) or build_corpus_files.py (plain files). The following example shows the creation a corpus containing all articles in a snapshot using 20 parallel processes:
mkdir out-dir
./wcb/scripts/build_corpus.py -p 20 out-dir
Details on the command line parameters for these scripts can be found by using the --help switch.
In addition to the preparations described in the above section, a few extra steps necessary to run WCB on a snapshot from a non-Engish Wikipedia or from a different wiki.
A "siteinfo file" contains information about the configuration (e.g. active namespaces, links to other localized versions of this wiki) of a Mediawiki instance and is bundled with mwlib for several of the translations of Wikipeida (see https://github.com/pediapress/mwlib/tree/master/mwlib/siteinfo for a list).
If a fitting siteinfo file is not bundled, it can be downloaded directly from the wiki in question:
-
...
The "siteinfo" entry in paths.txt should point to the location of this file.
Home | Forum | Discussions | Events