Skip to content

Make it possible to distinguish hard and soft hyphens #86

@urieli

Description

@urieli

Currently there is no way of distinguishing hard and soft HYP elements.

Example of a hard hyphen:

I separated the words by a non-
breaking space.

Example of a soft hyphen:

I separated the words by a non-break-
ing space.

However, since the OCR system can often distinguish the two (e.g. by checking a lexicon of known words), it should be able to pass this information to downstream systems in the Alto file, since this information could affect OCR-to-text and OCR layer indexing strategies.

I suggest changing the HYP element to include a new HARD_HYPHEN attribute, as follows:

<xsd:element name="HYP" minOccurs="0">
  <xsd:annotation>
    <xsd:documentation>A hyphenation char. Can appear only at the end of a line.</xsd:documentation>
  </xsd:annotation>
  <xsd:complexType>
    <xsd:attribute name="HEIGHT" type="xsd:float" use="optional"/>
    <xsd:attribute name="WIDTH" type="xsd:float" use="optional"/>
    <xsd:attribute name="HPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="VPOS" type="xsd:float" use="optional"/>
    <xsd:attribute name="CONTENT" type="xsd:string" use="required"/>
    <xsd:attribute name="HARD_HYPHEN" type="xsd:boolean" use="optional">
      <xsd:annotation>
        <xsd:documentation>True if this is a hard-hyphen (would appear in the word regardless of print location), false if this is a soft hyphen (only appears in the word if it is split at the end of a line).</xsd:documentation>
      </xsd:annotation>
    </xsd:attribute>
  </xsd:complexType>
</xsd:element>

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions