Skip to content

WikiWoods

StephanOepen edited this page May 19, 2010 · 20 revisions

Background

WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The first public release will be available for download from this site on or before June 1, 2010.

Corpus Organization

The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises around 1.2 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles] (3.5 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files] (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles.

First Release (1004)

Acknowledgements

Clone this wiki locally