-
Notifications
You must be signed in to change notification settings - Fork 4
WeSearch_SentenceSegmentation
Some useful features of a sentence segmentation tool (not necessarily important for Lars Jørgen's thesis):
- Domain/genre independent
- Identification of non-linguistic segments
- Mark-up aware
- Mark-up normalisation
- Handling unpunctuated text
- Stand-off annotation
Mikheev, A. 2002. Periods, Capitalized Words, etc. Computational Linguistics 28(3).
Reynar, J. and Ratnaparkhi, A. 1997. A Maximum Entropy Approach to Identifying Sentence Boundaries. In Proceedings of the Fifth ACL Conference on Applied Natural Language Processing.
Kiss, T. and Strunk, J. 2006. Unsupervised Multilingual Sentence Boundary Detection. Computational Linguistics 32(4).
Implemented in the NLTK.
Briscoe, T., Carroll, J. and Watson, R. 2006. The Second Release of the RASP System. In Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions.
Uses deterministic finite-state rules based on the immediate context (capitals, other punctuation etc.) to distinguish between periods used to end sentences and those used to end abbreviations (including titles and initials). The program assumes there is a sentence boundary wherever there is a blank line, or whitespace preceded by valid sentence final punctuation and followed by a capital letter. Jonathon has the source code... anyone know Flex!?
Palmer, D. and Hearst, M. 1997. Adaptive Multilingual Sentence Boundary Disambiguation. Computational Linguistics 23(2).
Gillick, D. 2009. Sentence Boundary Detection and the Problem with the U.S. In Proceedings of NAACL HTL 2009: Short Papers.
Only the usage is documented, but seems to rely on sets of (1) acceptable sentence boundary tokens; (2) tokens commonly following sentence boundaries; and (3) sentence boundary tokens to ignore. A major advantage is that it returns sentences with character offsets pointing back to the source text.
Baldwin, T. and Joseph, M. P. A. K. 2009. Restoring Punctuation and Casing in English Text. In Lecture Notes in Computer Science, Volume 5866/2009.
Home | Forum | Discussions | Events