Skip to content

Report Extraction

Simon Bedford edited this page Apr 27, 2017 · 7 revisions

Overview

Report Extraction is based on a mixture of hand-crafted rules as well as machine learning models.

The rest of this document contains:

  1. A short description of the hand-crafted rule approach
  2. Specific modifications or decisions taken to adapt the rules-based approach to the competition guidelines
  3. A short description of the optimized approach for extracting each element (Reporting Unit, Reporting Term, Quantity and Location).
  4. More detail on the rules-based approach

1. Hand Crafted Rules

Based on numerous reviewed examples from the URLs provided for the competition, a number of rules were constructed to try and extract all possible reports of displacement events, capturing all of the different encountered ways of referencing displaced persons.

In general, a given article is split into sentences and each sentence is reviewed in turn in order to find a relevant reporting term and unit along with the right relationship between them.

The reporting terms typically are identified as being verbs within the sentence, and reporting units can either be subjects or objects of these verbs, although numerous 'special cases' exist.

The reporting terms and units that are used for this approach are provided as arguments to the Interpreter (*_reporting_terms and *_reporting_units).

A report is defined to exist within a sentence if the Interpeter is able to find a term and unit along with a relationship between them based on the parse tree and parts of speech tags (this is the minimum requirement for extracting a report; the quantity, location and date elements are optional).

The Quantity is extracted (where possible) by extracting all number-like entities within the sentence and then choosing the most likely one based upon their relationships to the underlying reporting term.

The Locations are extracted by looking for location-like entities within the sentence.

Similarly, all date-like entities are identified within the whole article, and then are parsed relative to the article publication date into datetimes, in order to identify the complete time-frame of the events mentioned.

2. Specific modifications or decisions taken to adapt the rules-based approach to the competition guidelines

Part of the competition evaluation process requires the processing of a file containing ~320 text excerpts, and for each excerpt identifying:

  • Reporting Unit (People vs. Households)
  • Reporting Term (chosen from a fixed list of 10 possibilities)
  • Displacement Quantity (a single number)
  • Specific location
  • Country code

However, the broader set of rules described above is a more general solution that differs in the following ways:

  1. Works best on complete articles and sentences rather than fragments
  2. No assumption that every sentence contains a report
  3. Multiple reports can be extracted for a given sentence
  4. Multiple locations and countries can be extracted for a given sentence
  5. The rules look for broader references to displacement events and units, outside of the fixed categories

Therefore, in order to maximize the performance on the provided text excerpts, a number of choices were made in order to adapt the rules:

  • After reports have been extracted, the reporting Units are mapped to the pre-defined categories:
    • Units relating to structures or households -> 'Households'
    • Units relating to people or other individuals -> 'People'
  • Similarly, the identified reporting Terms are typically mapped to the category with the same Lemma (i.e. terms containing displaced -> 'Displaced`), with the exception of:
    • If 'camp' is in the term -> 'In Relief Camp'
    • If 'shelter' or 'accommodate' are in the term -> 'Sheltered'
    • Id 'damage' in the term -> ''Partially Destroyed Housing'
  • If multiple reports are encountered for a given excerpt, the following rules are used in order to choose the 'most relevant':
    • Events affecting People > Destroyed Housing > Damaged Housing
  • Similarly, if multiple countries are found for a given excerpt, the rules are:
    • The country the occurs most often or,
    • The first country mentioned

3. Optimized Approach for Each Report Element

Reporting Unit

The rules based approach on its own has average results (across classes) of:

  • Precision - 0.97
  • Recall - 0.60
  • F1 - 0.73

Separately, a Multinomial Naive Bayes classifier was trained using training data provided by the IDMC (133 text excerpts of which 117 are in English). Prior to training, the text excerpts were processed in order to:

  • Remove text in brackets
  • Remove named entities
  • Remove stop words
  • Lemmatize other words

The features used were extracted using the Word Vectorizer method from scikit-learn looking at single words.

The classifier on its own gives test results

  • Precision - 0.89
  • Recall - 0.90
  • F1 - 0.88

An optimized approach was created by combining the results of the hand-crafted rules along with the classifier in the following way:

  1. If classifier and rules output match => Done
  2. If the rules do not find anything => Use the classifier output
  3. Otherwise => Use the rules output

This gives results of:

  • Precision - 0.93
  • Recall - 0.94
  • F1 - 0.93

Reporting Term

The rules based approach on its own has average results (across classes) of:

  • Precision - 0.85
  • Recall - 0.58
  • F1 - 0.68

Separately, two machine-learning based models were trained:

Firstly, a Multinomial Naive Bayes classifier was trained using training data provided by the IDMC (the added complication here is that due to the small size of the training data, there are many categories with few or 0 examples).

Prior to training, the text excerpts were processed in order to:

  • Remove text in brackets
  • Remove named entities
  • Remove stop words
  • Lemmatize other words

The features used were extracted using the Word Vectorizer method from scikit-learn looking at bi-grams.

This classifier on its own gives test results

  • Precision - 0.60
  • Recall - 0.55
  • F1 - 0.54

Secondly, features were extracted using a pre-trained Word2Vec model (from Google), and a Linear SVC classifier was trained using these features. (The feature vector for an excerpt was calculated by averaging the feature vector for each word in the excerpt).

This classifier on its own gives test results

  • Precision - 0.52
  • Recall - 0.59
  • F1 - 0.55

The results of each of these two classifiers is then combined by averaging the predicted class probabilities and choosing the class with with highest probability, with results

  • Precision - 0.55
  • Recall - 0.65
  • F1 - 0.58

Finally, an optimized approach was created by combining the results of the hand-crafted rules along with the classifier using similar heuristics as for the Unit above, giving results of:

  • Precision - 0.71
  • Recall - 0.75
  • F1 - 0.71

Quantity

The Displacement Quantity is extracted using the following algorithm:

  1. Use the quantity obtained from the most likely report found using the rules-based approach
  2. Use a second rule-based function that looks at the text excerpt, taking into the account the predicted Unit, and attempts to find the number in the text that is closest to the word most closely associated with the Unit
  3. If neither of the above approaches yields a result then return the largest number found in the text

This approach gives results:

  • Precision - 0.83
  • Recall - 0.73
  • F1 - 0.76

Location

The location and associated country and extracted using the following algorithm:

  1. Identify all possible location entities in the text (note that due to the short / incomplete nature of some of the fragments, some locations can be mis-tagged as, for example, Organizations or People; these entities are also included)
  2. For each extracted location, attempt to identify the relevant country*:
    • Note: the algorithm uses all of the extracted locations as context when trying to determine the country
  3. Choose the most likely location / country pair based upon the rule mentioned above:
    • The country that occurs the most times or,
    • The location / country that first appears in the text
  • Locations are mapped to countries using the following procedure:
  1. See if the location is a country name, using the Pycountry library
  2. See if the location is the name of a country subdivision, using the Pycountry library
  3. See if the location is a city name (based on list of world cities with population > 3000)
  4. Otherwise use the Mapzen webservice

More Detail on the Rules-based Approach

Functionality implemented in Interpreter class:

Clone this wiki locally