Introduce "form" feature on tokens (1.12.0)

This is a follow-up of #953

Add a new feature to the Token to represent the "form" of a token. However, a tokenizer may choose to set this feature differently to establish a basic normalization without having to resort to actually materializing this normalization in the underlying text. In particular tokenizers that do context sensitive normalization might profit from this, e.g. PTB quote normalization where the left/right context of the quote needs to be taken into account to identify it as an opening or closing quote. 

Normally, this is the underlying text. In order to save space, it would be conceivable to implement a custom getter for this that returns getDocumentText() if the feature is not explicitly set. Likewise, the setter would set the feature internally to null if the form corresponds to getDocumentText().

- [ ] switch segmenters to provide token form when generating token
- [ ] add option to SegmenterBase to suppress setting of form
- [ ] add it in the UML diagrams in the type system documentation

----

Open questions:

- How to deal with cases where we currently call e.g. Sentence.getCoveredText() or NamedEntity.getCoveredText() which bypass the tokens and go directly to the CAS text? -- Probably the covered text should be used... not really sure yet.
- When using writers such as the CoNLL writers, should the CAS text be written or the text from the tokens? -- That should be configurable via a `PARAM_WRITE_COVERED_TEXT` which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.
- What to pass to the ML algorithm in trainer components? -- That should be configurable via a `PARAM_USE_COVERED_TEXT` which should be off by default. If the user uses a normalizing segmenter, that should by default be respected.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce "form" feature on tokens (1.12.0) #1168

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Introduce "form" feature on tokens (1.12.0) #1168

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions