WeSearch_DataCollection

Background

We are seeking to collect user-generated text to support the evaluation of parser adaptation across domain/genre. We are interested in a variety of registers: Open Access Research Literature, Wikipedia, Technology Blogs, Product Reviews and User Forums. Secondly we collect text from sources that discuess with the Linux operating system or natural language processing. The choice of these domains is motivated by our assumption that the users of the corpus will be more familiar with the language used in connection with these topics.

Collected Data

NLP blogs were obtained in mid-April from the following sites:

Linux blogs were also downloaded in mid-April, from:

Linux forums were extracted from the Unix & Linux subset of the April 2011 Stack Exchange Creative Commons Dump. In this set a text corresponds to a post (be it a question or an answer). If necessary threads can be reconstructed by using the primary/new id xref file.

Linux reviews are from http://www.softpedia.com/reviews/linux/. They possible require some manual cleaning - each review typically ends with a sentence like 'Check out these screenshots'

The Linux wiki set was created following the method used for WikiWoods.

All data and scripts are in /ltg/jread/workspace/wesearch/data-collection. The content has been extracted by finding the most specific element that contains all the relevant text (for example, blog posts typically contain some element with an attribute indicating that is the content element). All mark-up related to rendering has been retained for now. Sentences were obtained from tokenizer (as used in creating WikiWoods).


Section	Source	Documents	Total Items	Avg. Items
NLP, blog	http://blog.cyberling.org	51	609	11.9
	http://gameswithwords.fieldofscience.com	457	10,571	23.1
	http://lingpipe-blog.com	343	12,716	37.1
	http://nlpers.blogspot.com	249	7,368	29.6
	http://thelousylinguist.blogspot.com	536	7,234	13.5
Linux, blog	http://embraceubuntu.com	220	2,970	13.5
	http://www.linuxscrew.com	312	3,883	12.5
	http://www.markshuttleworth.com	271	6,728	24.8
	http://www.ubuntugeek.com	1,631	43,265	26.5
	http://ubuntu.philipcasey.com	105	1,475	14.0
	http://www.ubuntu-unleashed.com	312	6201	19.9
Linux, forums	stack exchange	9,945	54249	5.5
Linux, reviews	softpedia	249	13,430	53.9

Initial Parsing Results


Section	Items	Coverage	Length	Ambiguity	Time	Tokens	Types
NLP, wiki	11558	86.4%	18.0	10859	8.2	238059	19396
NLP, blog	46106	81.9%	15.5	8158	6.1	838592	41771
Linux, wiki	40738	85.0%	18.5	12407	9.6	843082	45783
Linux, blog	92280	83.7%	11.1	5151	3.9	1000683	48511
Linux, review	14761	84.6%	18.1	10610	7.5	304672	13158
Linux, forum	85743	74.8%	11.0	4885	3.1	1115412	56673

Corpus statistics for each section. Coverage shows what precentage of items received an analysis (using the unadapted parser 'out of the box'), and ambiguity and time give an indication of average parsing complexity (for the 'vanilla' parser configuration). Tokens shows the token count of each section and types is the number of unique, non-punctuation tokens seen per section.

Data Preparation

Given an HTML document, extract elements specified by a set of XPaths.
Sentence segment using tokenizer adapted to handle HTML tags---P, LI, PRE, DIV force line breaks.
Simplify by: * removing automatically generated text
- removing superfluous whitespace
- removing comments
- removing some attributes (e.g. HREF)
- ersatzing CODE and IMG
Filter CODE and IMG if they occur in isolation. Filter OL, UL, TABLE.
Create line-oriented itsdb import files with only one source and up to 1,000 items. Do not split documents across profiles.

Identifier Format

Domain-Genre-Source-article-item-0 DGSAAAAIII0

Domain**: 1##=linux, 2##=nlp**

Genre: #1#=academic, #2#=blog, #3#=forum, #4#=reviews, #5#=wiki

Source:

121=embraceubuntu.com
122=ubuntu.philipcasey.com
123=www.linuxscrew.com
124=www.markshuttleworth.com
125=www.ubuntugeek.com
126=www.ubuntu-unleashed.com
131=unix.stackexchange.com
141=www.softpedia.com/reviews/linux
221=blog.cyberling.org
222=gameswithwords.fieldofscience.com
223=lingpipe-blog.com
224=nlpers.blogspot.com
225=thelousylinguist.blogspot.com

Article: 4 digits --- maximum from any source is 9,945

Item: 3 digits --- maximum from any article is 731

BASIC-line numbering: an extra zero for segmentation correction.

Output

Profile import files as lists of IDs and items.
Pointer file for each profile, with lines corresponding to items, with item start index in source document, and lists of pairs (start, length) indicating deletions.
Cross-reference file with document ids CSDDDDDD and source URL.

Parsing Results on clean data

(wiki results just copied from above)


Section

Items

Coverage

Length

Time

Tokens

NLP, wiki

11558

86.4%

18.0

8.2

238059

NLP, blog

38498

80.8%

17.6

8.3

676080

Linux, wiki

40738

85.0%

18.5

9.6

843082

Linux, blog

64520

82.3%

13.2

5.7

854157

Linux, review

13430

80.4%

19.8

9.2

266063

Linux, forum

54249

82.0%

14.8

5.6

802736

Home | Forum | Discussions | Events

WeSearch_DataCollection

Background

Collected Data

Initial Parsing Results

Data Preparation

Identifier Format

Output

Parsing Results on clean data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!