-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_DataCollection
We are seeking to collect user-generated text to support the evaluation of parser adaptation across domain/genre. We are interested in a variety of registers: Open Access Research Literature, Wikipedia, Technology Blogs, Product Reviews and User Forums. Secondly we collect text from sources that discuess with the Linux operating system or natural language processing. The choice of these domains is motivated by our assumption that the users of the corpus will be more familiar with the language used in connection with these topics.
NLP blogs were obtained in mid-April from the following sites:
Linux blogs were also downloaded in mid-April, from:
Linux forums were extracted from the Unix & Linux subset of the April 2011 Stack Exchange Creative Commons Dump. In this set a text corresponds to a post (be it a question or an answer). If necessary threads can be reconstructed by using the primary/new id xref file.
Linux reviews are from http://www.softpedia.com/reviews/linux/. They possible require some manual cleaning - each review typically ends with a sentence like 'Check out these screenshots'
The Linux wiki set was created following the method used for WikiWoods.
All data and scripts are in /ltg/jread/workspace/wesearch/data-collection. The content has been extracted by finding the most specific element that contains all the relevant text (for example, blog posts typically contain some element with an attribute indicating that is the content element). All mark-up related to rendering has been retained for now. Sentences were obtained from tokenizer (as used in creating WikiWoods).
Section | Source | Documents | Total Items | Avg. Items |
NLP, blog | http://blog.cyberling.org | 51 | 609 | 11.9 |
http://gameswithwords.fieldofscience.com | 457 | 10,571 | 23.1 | |
http://lingpipe-blog.com | 343 | 12,716 | 37.1 | |
http://nlpers.blogspot.com | 249 | 7,368 | 29.6 | |
http://thelousylinguist.blogspot.com | 536 | 7,234 | 13.5 | |
Linux, blog | http://embraceubuntu.com | 220 | 2,970 | 13.5 |
http://www.linuxscrew.com | 312 | 3,883 | 12.5 | |
http://www.markshuttleworth.com | 271 | 6,728 | 24.8 | |
http://www.ubuntugeek.com | 1,631 | 43,265 | 26.5 | |
http://ubuntu.philipcasey.com | 105 | 1,475 | 14.0 | |
http://www.ubuntu-unleashed.com | 312 | 6201 | 19.9 | |
Linux, forums | stack exchange | 9,945 | 54249 | 5.5 |
Linux, reviews | softpedia | 249 | 13,430 | 53.9 |
Section | Items | Coverage | Length | Ambiguity | Time | Tokens | Types |
NLP, wiki | 11558 | 86.4% | 18.0 | 10859 | 8.2 | 238059 | 19396 |
NLP, blog | 46106 | 81.9% | 15.5 | 8158 | 6.1 | 838592 | 41771 |
Linux, wiki | 40738 | 85.0% | 18.5 | 12407 | 9.6 | 843082 | 45783 |
Linux, blog | 92280 | 83.7% | 11.1 | 5151 | 3.9 | 1000683 | 48511 |
Linux, review | 14761 | 84.6% | 18.1 | 10610 | 7.5 | 304672 | 13158 |
Linux, forum | 85743 | 74.8% | 11.0 | 4885 | 3.1 | 1115412 | 56673 |
Corpus statistics for each section. Coverage shows what precentage of items received an analysis (using the unadapted parser 'out of the box'), and ambiguity and time give an indication of average parsing complexity (for the 'vanilla' parser configuration). Tokens shows the token count of each section and types is the number of unique, non-punctuation tokens seen per section.
-
Given an HTML document, extract elements specified by a set of XPaths.
-
Sentence segment using tokenizer adapted to handle HTML tags---P, LI, PRE, DIV force line breaks.
-
Simplify by: * removing automatically generated text
-
removing superfluous whitespace
-
removing comments
-
removing some attributes (e.g. HREF)
-
ersatzing CODE and IMG
-
-
Filter CODE and IMG if they occur in isolation. Filter OL, UL, TABLE.
-
Create line-oriented itsdb import files with only one source and up to 1,000 items. Do not split documents across profiles.
Domain-Genre-Source-article-item-0 DGSAAAAIII0
Domain**: 1##=linux, 2##=nlp**
Genre: #1#=academic, #2#=blog, #3#=forum, #4#=reviews, #5#=wiki
Source:
-
121=embraceubuntu.com
-
122=ubuntu.philipcasey.com
-
123=www.linuxscrew.com
-
124=www.markshuttleworth.com
-
125=www.ubuntugeek.com
-
126=www.ubuntu-unleashed.com
-
131=unix.stackexchange.com
-
141=www.softpedia.com/reviews/linux
-
221=blog.cyberling.org
-
222=gameswithwords.fieldofscience.com
-
223=lingpipe-blog.com
-
224=nlpers.blogspot.com
-
225=thelousylinguist.blogspot.com
Article: 4 digits --- maximum from any source is 9,945
Item: 3 digits --- maximum from any article is 731
BASIC-line numbering: an extra zero for segmentation correction.
-
Profile import files as lists of IDs and items.
-
Pointer file for each profile, with lines corresponding to items, with item start index in source document, and lists of pairs (start, length) indicating deletions.
-
Cross-reference file with document ids CSDDDDDD and source URL.
(wiki results just copied from above)
Section |
Items
Coverage
Length
Time
Tokens
NLP, wiki
11558
86.4%
18.0
8.2
238059
NLP, blog
38498
80.8%
17.6
8.3
676080
Linux, wiki
40738
85.0%
18.5
9.6
843082
Linux, blog
64520
82.3%
13.2
5.7
854157
Linux, review
13430
80.4%
19.8
9.2
266063
Linux, forum
54249
82.0%
14.8
5.6
802736
Home | Forum | Discussions | Events