PyPXML

A python library for parsing, creating and modifying PageXML files.

Setup

Note

Python version >=3.11

Install from PyPI

pip install pypxml

Install upstream from source

Clone repository: git clone https://github.com/jahtz/pypxml
Install package: cd pypxml && pip install .

API

PyPXML provides a feature rich Python API for working with PageXML files.

Full documentation

CLI

$ pypxml --help
Usage: pypxml [OPTIONS] COMMAND [ARGS]...

  A python library for parsing, converting and modifying PageXML files.

Options:
  --help     Show this message and exit.
  --version  Show the version and exit.

Commands:
  get-codec           Extract the character set from PageXML files.
  get-custom          List all custom region attributes in PageXML files.
  get-regions         List all regions in PageXML files.
  get-text            Extract text from PageXML files.
  regularize-codec    Regularize character encodings in PageXML files.
  regularize-regions  Regularize region types in PageXML files.

analytics

get-codec

$ pypxml get-codec --help
Usage: pypxml get-codec [OPTIONS] FILES...

  This tool analyzes the text content of PageXML files and extracts the set of
  characters used.

  It can optionally normalize unicode, remove whitespace, and output character
  frequencies. Results are printed to the console or saved as a CSV file.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output FILE               Path to a CSV file to save the results. If
                                  omitted, results are printed to stdout. If a
                                  directory is given, the file 'codec.csv'
                                  will be created inside it.
  -l, --level [TextRegion|TextLine|Word|Glyph]
                                  PageXML level from which to extract text.
                                  [default: TextLine]
  -i, --index INTEGER             Only consider TextEquiv elements with the
                                  specified index.
  -w, --remove-whitespace         Remove all whitespace characters before
                                  analyzing text.
  -f, --frequencies               Also output character frequencies.
  -n, --normalize [NFC|NFD|NFKC|NFKD]
                                  Normalize unicode before analyzing text.

get-regions

$ pypxml get-regions --help
Usage: pypxml get-regions [OPTIONS] FILES...

  Analyzes PageXML files and lists the region types found.

  Optionally includes subtypes, outputs frequencies, and group by file,
  directory, or globally.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH               CSV file or directory where the results are
                                  saved. If a directory is given, the file
                                  'regions.csv' will be created inside it. If
                                  omitted, results are printed to stdout.
  -l, --level [total|directory|file]
                                  Set the aggregation level for the output.
                                  'total' combines all files, 'directory'
                                  aggregates by parent directory, and 'file'
                                  lists results per individual file.
                                  [default: total]
  -f, --frequencies               Also output the frequency (count) of each
                                  region type.
  -t, --types                     Include subtypes by printing them as
                                  'PageType.type' if available.

get-custom

$ pypxml get-custom --help
Usage: pypxml get-custom [OPTIONS] FILES...

  Analyzes PageXML files and lists the custom region types found.

  Optionally outputs frequencies and group by file, directory, or globally.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH               CSV file or directory where the results are
                                  saved. If a directory is given, the file
                                  'customs.csv' will be created inside it. If
                                  omitted, results are printed to stdout.
  -l, --level [total|directory|file]
                                  Set the aggregation level for the output.
                                  'total' combines all files, 'directory'
                                  aggregates by parent directory, and 'file'
                                  lists results per individual file.
                                  [default: total]
  -f, --frequencies               Also output the frequency (count) of each
                                  custom attribute.

get-text

$ pypxml get-text --help
Usage: pypxml get-text [OPTIONS] FILES...

  Extract text from PageXML files at the TextLine level.

  Outputs to individual text files, a single file, or prints to the console,
  with optional separators between regions and pages.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output PATH            Output destination. If a directory is
                               specified, a separate text file is created for
                               each PageXML file, ignoring the page separator.
                               If a file is specified, the text from all files
                               is concatenated into that file. If omitted, the
                               text is printed to stdout.
  -i, --index INTEGER          Use only the text from TextEquiv elements at
                               the given index.
  -r, --region-separator TEXT  Separator string inserted between regions. Use
                               "" for an empty line, "\n" for two empty lines,
                               ...
  -p, --page-separator TEXT    Separator string inserted between pages when
                               outputting to a single file or stdout. Ignored
                               when outputting multiple files. Use "" for an
                               empty line, "\n" for two empty lines, ...

regularize

regularize-codec

$ pypxml regularize-codec --help
Usage: pypxml regularize-codec [OPTIONS] FILES...

  Apply character replacement rules to text elements in PageXML files.

  Supports selecting PlainText or Unicode elements and limiting replacements
  to specific element levels.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output DIRECTORY          Directory to save the modified PageXML
                                  files. If omitted, input files will be
                                  overwritten.
  -i, --index INTEGER             Use only TextEquiv elements with the
                                  specified index. Defaults to all TextEquiv
                                  elements if not set.
  -l, --level [TextRegion|TextLine|Word|Glyph]
                                  PageXML element level to process.  [default:
                                  TextLine]
  --plaintext / --unicode         Select the text element to use.Choose from
                                  PlainText (without formatting) or Unicode
                                  (formatted).  [default: unicode]
  -r, --rule TEXT...              Define substring replacement rules. Each
                                  rule is a pair of strings: '--rule SOURCE
                                  TARGET'. Multiple rules can be specified by
                                  repeating the option.  [required]

regularize-regions

$ pypxml regularize-regions --help
Usage: pypxml regularize-regions [OPTIONS] FILES...

  This tool processes PageXML files and updates or removes regions based on
  specified rules.

  Regions are matched by their PageType and optional subtype. Regions matching
  the source specification are either updated to a new type or deleted if
  target is set to 'None'.

  FILES: List of PageXML file paths to process. Accepts individual files, glob
  wildcards, or directories.

Options:
  -o, --output DIRECTORY  Directory to save the modified PageXML files. If
                          omitted, input files will be overwritten.
  -r, --rule TEXT...      Define rules for region regularization. Format:
                          --rule SOURCE TARGET where SOURCE is the original
                          region type (e.g., TextRegion.paragraph,
                          ImageRegion), and TARGET is the new region type. Use
                          an 'None' TARGET to delete the region. Only region
                          PageTypes are allowed. Multiple rules can be
                          specified by repeating this option.  [required]

ZPD

Developed at Centre for Philology and Digitality (ZPD), University of Würzburg.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
cli		cli
docs		docs
pypxml		pypxml
resources		resources
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyPXML

Setup

Install from PyPI

Install upstream from source

API

CLI

analytics

get-codec

get-regions

get-custom

get-text

regularize

regularize-codec

regularize-regions

ZPD

About

Uh oh!

Releases 6

Uh oh!

Languages

License

jahtz/pypxml

Folders and files

Latest commit

History

Repository files navigation

PyPXML

Setup

Install from PyPI

Install upstream from source

API

CLI

analytics

get-codec

get-regions

get-custom

get-text

regularize

regularize-codec

regularize-regions

ZPD

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Uh oh!

Languages