Skip to content

WeSearch_SentenceSegmentation

JonathonRead edited this page Jun 7, 2012 · 23 revisions

Desiderata

Some useful features of a sentence segmentation tool (not necessarily important for Lars Jørgen's thesis):

  • Domain/genre independent
  • Identification of non-linguistic segments
  • Mark-up aware
  • Mark-up normalisation
  • Handling unpunctuated text
  • Stand-off annotation

Existing Approaches

Mikheev

Mikheev, A. 2002. Periods, Capitalized Words, etc. Computational Linguistics 28(3).

Treats three related aspects of text normalisation: sentence boundary detection, disambiguation of capitalised words and identification of abbreviations. Sentence boundary detection uses four simple rules (where the concept of 'following' disregards brackets, quotation marks etc.):

  • If a period follows a nonabbreviation it is a sentence terminal
  • If a period follows an abbreviation and is the last token in a paragraph it is a sentence terminal
  • If a period follows an abbreviation and is not followed by a capitalised word it is not a sentence terminal
  • If a period follows an abbreviation and is followed by a capitalised word which is not a proper name, it is a sentence terminal

This yields very precise results, but does not cover the case for when a period follows an abbreviation which is followed by a proper name - which is an ambiguous case. In these cases, Mikheev uses the majority baseline, assuming non-sentence boundary.

MxTerminator

Reynar, J. and Ratnaparkhi, A. 1997. A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing.

Punkt

Kiss, T. and Strunk, J. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4).

Implemented in the NLTK.

RASP

Briscoe, T., Carroll, J. and Watson, R. 2006. The Second Release of the RASP System. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.

Uses deterministic finite-state rules based on the immediate context (capitals, other punctuation etc.) to distinguish between periods used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or whitespace preceded by valid sentence final punctuation and followed by a capital letter. Jonathon has the source code... anyone know Flex!?

Satz

Palmer, D. and Hearst, M. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics 23(2).

Splitta

Gillick, D. 2009. Sentence Boundary Detection and the Problem with the U.S. In Proceedings of NAACL HTL 2009: Short Papers.

Stanford CoreNLP

Only the usage is documented, but seems to rely on sets of (1) acceptable sentence boundary tokens; (2) tokens commonly following sentence boundaries; and (3) sentence boundary tokens to ignore. A major advantage is that it returns sentences with character offsets pointing back to the source text.

Related Work

Baldwin, T. and Joseph, M. P. A. K. 2009. Restoring Punctuation and Casing in English Text. In Lecture Notes in Computer Science, Volume 5866/2009.

Clone this wiki locally