Skip to content

WikiWoods

StephanOepen edited this page Oct 21, 2012 · 20 revisions

Background

WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full English Wikipedia. A high-level discussion of the WikiWoods motivation and general methodology is provided by Flickinger et al. (2010). The corpus itself and a preliminary set of parsing results (in [incr tsdb()] form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release is now available for download from this site, in two different formats (see below).

Corpus Organization

The WikiWoods corpus is extracted from a Wikipedia snapshot of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of raw articles (4.4 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented text files (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles. Please see Flickinger et al. (2010) for details.

First Release (1004)

As of May 2010, parsing the WikiWoods corpus is complete, and [incr tsdb()] profiles are available for download (typically, one would extract the HPSG derivation from the 'result' relation, i.e. field #11 of the underlying tsdb(1) data files). Each archive contains [incr tsdb()] data files for about 1300 WikiWoods segments, and the files are designed to 'plug into' the directory structure of the so-called LOGON distribution.

To simplify access to the derivation trees, and to readily make available other views on the HPSG analyses—as described by Flickinger et al. (2010)—we also provide a set of plain text files, exported from [incr tsdb()]. As of early June, export files are available for download as ten archives, each containing compressed export files for about 1300 segments. Due to technical issues in a few corner cases, some 30 segments are currently still missing from these exports.

Acknowledgements

This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the TITAN high-performance computing facilities at the University of Oslo, and we are grateful to the Scientific Computation staff at UiO, as well as to the Norwegian Metacenter for Computational Science. Distribution of the WikiWoods data is supported by the national NorStore Storage Infrastructure and the UiO on-line Language Technology Resources collection.

Related Projects

Following is an attempt at listing related initiatives. In case you know of additional pointers that should be included, please email StephanOepen.

Clone this wiki locally