Skip to content

PLAYA-PDF 0.6.0: Structure and text improvements

Choose a tag to compare

@dhdaines dhdaines released this 13 Jun 16:52
· 81 commits to main since this release

What's Changed

  • Iterate over Form XObjects inside Form XObjects with .xobjects by @dhdaines in #126
  • Correct bbox on non-diagonal Type3 FontMatrix by @dhdaines in #127
  • Fixes and improvements to text extraction, marked content and logical structure by @dhdaines in #124
  • Add displacement property to text objects by @dhdaines in #129
  • Allow iteration over Type3 font programs by @dhdaines in #130
  • Extract images as PNMs if possible by @dhdaines in #131

Notes from CHANGELOG.md

  • Add structure to Page to access structure elements indexed by
    marked content IDs (convenience wrapper over the parent tree)
  • Add structure to XObjectObject for the same reason
  • Add parent to all ContentObject to access parent structure
    element (if any) via the parent tree
  • Descend into Form XObjects in Page.xobjects
  • Improve text extraction for rotated pages
  • Improve text extraction for tagged PDFs
  • Correct displacement and bbox for Type3 fonts with non-diagonal
    FontMatrix
  • Add displacement property to TextObject
  • Add functioning __iter__ to GlyphObject in the case of
    Type3 fonts, which works like XObjectObject
  • Extract non-JPEG images as PNM
  • BREAKING: Fix __len__ on PathObject which incorrectly returned
    non-zero even though iteration is not possible
  • BREAKING: Remove misleading char_width, get_descent, and
    get_ascent methods and hscale and vscale properties from font
    objects
  • BREAKING: Do not guess basename for Type3 fonts (generally it
    isn't different from fontname for other subset fonts)
  • BREAKING: Element.contents contains both structure.ContentItem
    and structure.ContentObject

Full Changelog: v0.5.1...v0.6.0