-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Starting from #62 we figure out that would make sense to have a generic way for more in detail validation of ALTO files content. Several ideas were discussed on board meetings:
- Use xsd 1.1 and asserts in order to implement some consistency checks like a textblock box to be fully included into page box (no negative coordinates and no coordinates bigger than page width/height). There are two main concerns in this case: there are not too many open source validation tools for 1.1 compared with 1.0 and second, if we add this into xsd validation the level of restriction would be too high and will became mandatory, creating a lot of troubles both on ALTO creators and consummers.
- Use a separate SCHEMATRON schema (https://en.wikipedia.org/wiki/Schematron) as an add-on to default xsd validation. This new schema can be used optionally into a validation pipeline for ALTO files for users that would like to have more restrictive checks (more into the area of quality checks, rather than structural checks)
Based on board discussions, we should continue with option 2
On this topic we would like to collect as many ideas as possible for SCHEMATRON validation in order to create a list of checks to be implemented. For each proposed test, also specify a proposal for severity level. For the moment I would propose ERROR, WARNING and INFO as possible levels, just as starting point
Currently following tests/categories of tests were proposed:
- Coordinates checks starting from Restrict float attribute values where possible to allow for better xml-validation. #62 and extend to all boundaries (not only positive coordinates, but also all values or combinations of values (like VPOS + HEIGHT < PAGE HEIGHT) to be inside page/printSpace/Margin boundaries
- Overlapping checks - even is not mandatory to have in ALTO zero overlaps, overlapping might indicate some issues
- Parent elements without children (for example Texline without any String inside)
- Any strings encodding issues
- Meaningfull usage of optional information - for example, even VPOS, HPOS are optional in schema, might be a good idea to outline if any of these are missing, even as errors or at least warnings
- Language specific checks (for example in Chinese usually each glyph should be encoded in fact as an word and two Chinese Glyphs into same word is considered incorect by some ALTO processors)
Please add your own ideas, detail test categories listed above so that we can create in the final a list of tests to be implemented and their verbosity level. SCHEMATRON schema would be optional, but should be a sort of guideline of good practices when creating ALTO files