-
Notifications
You must be signed in to change notification settings - Fork 4
WikiWoods
WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The corpus itself and a preliminary set of parsing results (in itsdb form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release will be finalized for download from this site on or before June 1, 2010. The itsdb treebanks available for download (see below), but we have yet to complete the exports to textual form. It turns out that
The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles] (3.5 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files] (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles. Please see [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)] for details.
As of May 2010, parsing the WikiWoods corpus is complete, and itsdb profiles are available for [http://www.delph-in.net/wikiwoods/1004 download] (typically, one would extract the HPSG derivation from the 'result' relation, i.e. field #11). Each archive contains itsdb data files for 1000 WikiWoods segments, and the files are designed to 'plug into' the directory structure of the so-called [wiki:LogonTop LOGON] distribution.
To simplify access to the derivation trees, and to readily make available other views on the HPSG analyses—as described by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]—we are currently working to provide a set of plain text files, exported from itsdb. This process is delayed by a few more days, as it turns out that our traditional approach of creating one (compressed) file per sentence does not scale well: 55 million files are hard on the file system, and near-lethal on the back-up system. Thus, we adapted the software to export a single, internally structured file per profile and are now re-running the export (as of May 27, 2010). Over the next few days, we will populate the [http://www.delph-in.net/wikiwoods/1004 download site] with export files.
This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the [http://hpc.uio.no TITAN] high-performance computing facilities at the University of Oslo, and we are grateful to the [http://www.usit.uio.no/suf/vd Scientific Computation staff at UiO], as well as to the [http://www.notur.no Norwegian Metacenter for Computational Science].
Home | Forum | Discussions | Events