-
Notifications
You must be signed in to change notification settings - Fork 4
WikiWoods
WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The corpus itself and a preliminary set of parsing results (in itsdb form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release will be finalized for download from this site on or before June 1, 2010. At present, final itsdb treebanks are available for download (see below), but we have yet to complete the exports to textual form (which, in an uninteresting technical sense, turns out to be a harder problem than the parsing proper :-).
The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles] (3.5 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files] (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles. Please see [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)] for details.
As of May 2010, parsing the WikiWoods corpus is complete, and itsdb profiles are available for [http://www.delph-in.net/wikiwoods/1004 download] (typically, one would extract the HPSG derivation from the 'result' relation, i.e. field #11 of the underlying tsdb(1) data files). Each archive contains itsdb data files for 1000 WikiWoods segments, and the files are designed to 'plug into' the directory structure of the so-called [wiki:LogonTop LOGON] distribution.
To simplify access to the derivation trees, and to readily make available other views on the HPSG analyses—as described by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]—we are currently working to also provide a set of plain text files, exported from itsdb. This process is delayed by a few more days, as it turns out that our traditional approach of creating one (compressed) file per sentence does not scale well: 55 million files are hard on the file system, and near-lethal on the back-up system. Thus, we need to adapt the software to export a single, internally structured file per profile and then have to re-run the export (as of May 27, 2010). However, a non-trivial subset of export jobs causes kernel panics on some of the cluster nodes, a mystery that is currently being investigated. Over the next few days, we will populate the [http://www.delph-in.net/wikiwoods/1004 download site] with export files.
This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the [http://hpc.uio.no TITAN] high-performance computing facilities at the University of Oslo, and we are grateful to the [http://www.usit.uio.no/suf/vd Scientific Computation staff at UiO], as well as to the [http://www.notur.no Norwegian Metacenter for Computational Science].
Home | Forum | Discussions | Events