-
Notifications
You must be signed in to change notification settings - Fork 65
Description
This is a follow-up of #953
Add a new feature to the Token to represent the "form" of a token. However, a tokenizer may choose to set this feature differently to establish a basic normalization without having to resort to actually materializing this normalization in the underlying text. In particular tokenizers that do context sensitive normalization might profit from this, e.g. PTB quote normalization where the left/right context of the quote needs to be taken into account to identify it as an opening or closing quote.
Normally, this is the underlying text. In order to save space, it would be conceivable to implement a custom getter for this that returns getDocumentText() if the feature is not explicitly set. Likewise, the setter would set the feature internally to null if the form corresponds to getDocumentText().
- switch segmenters to provide token form when generating token
- add option to SegmenterBase to suppress setting of form
- add it in the UML diagrams in the type system documentation
Open questions:
- How to deal with cases where we currently call e.g. Sentence.getCoveredText() or NamedEntity.getCoveredText() which bypass the tokens and go directly to the CAS text? -- Probably the covered text should be used... not really sure yet.
- When using writers such as the CoNLL writers, should the CAS text be written or the text from the tokens? -- That should be configurable via a
PARAM_WRITE_COVERED_TEXT
which should be off by default. If the user uses a normalizing segmenter, that should by default be respected. - What to pass to the ML algorithm in trainer components? -- That should be configurable via a
PARAM_USE_COVERED_TEXT
which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.