-
Notifications
You must be signed in to change notification settings - Fork 4
WikiWoods
WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The corpus itself and a preliminary set of parsing results (in itsdb form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release is now available for download from this site, in two different formats (see below).
The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles] (3.5 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files] (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles. Please see [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)] for details.
As of May 2010, parsing the WikiWoods corpus is complete, and itsdb profiles are available for [http://www.delph-in.net/wikiwoods/1004 download] (typically, one would extract the HPSG derivation from the 'result' relation, i.e. field #11 of the underlying tsdb(1) data files). Each archive contains itsdb data files for about 1300 WikiWoods segments, and the files are designed to 'plug into' the directory structure of the so-called [wiki:LogonTop LOGON] distribution.
To simplify access to the derivation trees, and to readily make available other views on the HPSG analyses—as described by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]—we also provide a set of plain text files, exported from itsdb. As of early June, export files are available for [http://www.delph-in.net/wikiwoods/1004 download] as ten archives, each containing compressed export files for about 1300 segments. Due to technical issues in a few corner cases, some 57 segments are currently still missing from these exports.
This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the [http://hpc.uio.no TITAN] high-performance computing facilities at the University of Oslo, and we are grateful to the [http://www.usit.uio.no/suf/vd Scientific Computation staff at UiO], as well as to the [http://www.notur.no Norwegian Metacenter for Computational Science].
Home | Forum | Discussions | Events