-
Notifications
You must be signed in to change notification settings - Fork 4
WikiWoods
WikiWoods is an ongoing initiative to provide rich syntacto-semantic annotations for the full [http://en.wikipedia.org English Wikipedia]. A high-level discussion of the WikiWoods motivation and general methodology is provided by [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)]. The corpus itself and a preliminary set of parsing results (in itsdb form only, for the time being) are available for download (see below); please consult the related WeScience page for (emerging) instructions on how to utilize this data. The first public release will be finalized for download from this site on or before June 1, 2010. We are currently investigating batch processing failures for a handful of WikiWoods segments, and have yet to complete the exports (from itsdb treebanks) to textual form.
The WikiWoods corpus is extracted from a [http://www.delph-in.net/wescience/enwiki-20080727-pages-articles.xml.bz2 Wikipedia snapshot] of July 2008 (the version originally used for the manually treebanked WeScience sub-corpus). As of mid-2010, the corpus comprises close to 1.3 million content articles, for a total of around 55 million 'sentences' (or other types of root-level utterances). The corpus is available in two forms: (a) as a collection of [http://www.delph-in.net/wikiwoods/1004/raw.tgz raw articles] (3.5 gigabytes compressed), prior to preprocessing; and (b) as a set of preprocessed and sentence-segmented [http://www.delph-in.net/wikiwoods/1004/txt.tgz text files] (2.2 gigabytes compressed). Both sets of files are organized by segments, each comprised of 100 articles. Please see [http://www.delph-in.net/wikiwoods/lrec10.pdf Flickinger et al. (2010)] for details.
As of May 2010, parsing the WikiWoods corpus is complete, and a preliminary set of itsdb profiles is available for [http://www.delph-in.net/wikiwoods/1004 download]. Each archive contains itsdb data files for 1000 WikiWoods segments, and the files are designed to 'plug into' the directory structure of the so-called [wiki:LogonTop LOGON] distribution. Please see the instructions on exporting into textual form available for the WeScience sub-corpus, in case you are in a hurry to take into use this data outside of itsdb; over the next few days, we will populate the download site with export files.
This work is in part funded by the University of Oslo, through its research partnership with the Center for the Study of Language and Information at Stanford University. Experimentation and engineering on the scale of Wikipedia is made possible through access to the [http://hpc.uio.no TITAN] high-performance computing facilities at the University of Oslo, and we are grateful to the [http://www.usit.uio.no/suf/vd Scientific Computation staff at UiO], as well as to the [http://www.notur.no Norwegian Metacenter for Computational Science].
Home | Forum | Discussions | Events