Skip to content

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

License

Notifications You must be signed in to change notification settings

OCR-D/gt_structure_text

Repository files navigation

gt_structure_text

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include. The data is based on transcription data stored in the German Text Archive (DTA) (https://www.deutschestextarchiv.de/).

Metadata

Language:
eng, fra, deu, heb, lat
Format:
Page-XML
Time:
1500-1900
GT Type:
data_structure_and_text
License:
CC-BY-SA-4.0
Transcription Guidelines:
OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
Project:
OCR-D
Project-URL:
https://ocr-d.de/

Sources

The volume of transcriptions:

TextLine Page TxtRegion ImgRegion GraphRegion TabRegion SepRegion MathRegion MusicRegion NoiseRegion
6609 217 1648 1 74 3 141 1 4 17

List of transcriptions

document TxtRegion ImgRegion LineDrawRegion GraphRegion TabRegion ChartRegion SepRegion MathRegion ChemRegion MusicRegion AdRegion NoiseRegion UnknownRegion CustomRegion TextLine Page
nn_mirabilia_1500 10 2 58 3
herder_geschichte03_1787 5 3 14 1
lohenstein_agrippina_1665 56 3 1 109 3
blumenbach_anatomie_1805 20 84 3
nn_besuch_1780 5 3 1 76 4
justi_abhandlung01_1758 37 1 1 131 4
reinkingk_policey_1653_teil2 21 1 108 2
wecker_kochbuch_1598 35 156 4
aventinus_grammatica_1515 29 19 1 129 3
dannhauer_catechismus10_1673 18 151 4
bohse_helicon_1696 35 3 2 121 5
brenz_abentmal_1550 22 89 4
trota_mordtbrenner_1540 20 2 44 2
nn_vertrag_1525 5 35 2
basilius_legendi_1515 12 2 82 3
schiller_raeuber_1781 15 2 54 2
hohberg_georgica01_1682_teil1 14 3 66 2
glauber_opera01_1658 127 3 2 376 6
petrarca_psalmi_1506 13 2 64 3
ruempler_gartenbau_1882 105 2 3 9 1 6
bebel_frau_1879 20 3 164 4
oesterreicher_sachsen_1548 8 2 48 2
praetorius_syntagma02_1619_teil1 72 1 4 168 4
rhegius_artzney_1529 12 1 80 3
karlstadt_sermon_1523 5 1 1 65 2
aepinus_bekentnis_1548 20 3 101 4
nn_historia_1500 5 1 35 2
arnimb_goethe03_1835 5 1 22 1
witzstat_buchszbaum_1540 13 47 2
gerstner_mechaniktafeln01_1831 2 1 2 1
vespucci_insule_1506 7 62 2
buerger_gedichte_1778 14 6 52 2
ballenstedt_delatio_1777 26 3 98 3
nn_lied_1520 5 1 1 22 1
vischer_aesthetikregister_1858 1 1
kistler_kraeuter_1500 14 58 2
luther_auszlegunge_1520 10 59 2
bernd_lebensbeschreibung_1738 15 4 1 71 3
pistoris_regiment_1506 12 90 3
alberti_pictura_1540 22 1 94 3
luz_blitz_1784 17 1 4 110 4
meyfart_rhetorica_1634 27 4 113 4
loeber_heuschrecken_1693 15 1 3 87 3
weigel_gnothi02_1618 22 1 128 4
kant_aufklaerung_1784 15 4 55 2
rollenhagen_reysen_1603 22 1 81 3
heyden_paedono_1548 19 72 3
reinkingk_policey_1653_teil1 20 1 146 3
benner_herrnhuterey04_1748 37 6 144 4
lessing_menschengeschlecht_1780 8 1 15 1
praetorius_syntagma02_1619_teil2 30 1 5 136 4
sachs_drey_1553 7 54 2
euler_rechenkunst01_1738 94 8 31 234 6
silesius_seelenlust01_1657 38 1 7 4 137 5
hohberg_georgica01_1682_teil2 27 159 2
osiander_predigt_1553 7 57 2
pinder_epiphanie_1506 31 1 5 169 4
boeschenstain_gedicht_1520 9 1 45 1
hilbert_zahlkoerper_1897 46 4 5
praetorius_verrichtung_1668 38 2 197 5
huebner_handbuch_1696 26 4 4 78 3
laube_europa0202_1837 15 2 7 43 5
nn_lied_1515 6 25 1
luther_babstum_1526 7 2 51 2
valentinus_occulta_1603 22 1 1 164 6
arnold_ketzerhistorie01_1699 43 6 378 4
clauren_mimil_1815 44 1 206 9
calvi_beutelschneider01_1627 21 3 87 3
estor_rechtsgelehrsamkeit02_1758 44 1 3 153 4

Extent

In this section they can insert additional information, instructions or notes.

About

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •