PLAYA-PDF 0.6.0: Structure and text improvements
What's Changed
- Iterate over Form XObjects inside Form XObjects with
.xobjects
by @dhdaines in #126 - Correct bbox on non-diagonal Type3
FontMatrix
by @dhdaines in #127 - Fixes and improvements to text extraction, marked content and logical structure by @dhdaines in #124
- Add displacement property to text objects by @dhdaines in #129
- Allow iteration over Type3 font programs by @dhdaines in #130
- Extract images as PNMs if possible by @dhdaines in #131
Notes from CHANGELOG.md
- Add
structure
toPage
to access structure elements indexed by
marked content IDs (convenience wrapper over the parent tree) - Add
structure
toXObjectObject
for the same reason - Add
parent
to allContentObject
to access parent structure
element (if any) via the parent tree - Descend into Form XObjects in
Page.xobjects
- Improve text extraction for rotated pages
- Improve text extraction for tagged PDFs
- Correct displacement and bbox for Type3 fonts with non-diagonal
FontMatrix
- Add
displacement
property toTextObject
- Add functioning
__iter__
toGlyphObject
in the case of
Type3 fonts, which works likeXObjectObject
- Extract non-JPEG images as PNM
- BREAKING: Fix
__len__
onPathObject
which incorrectly returned
non-zero even though iteration is not possible - BREAKING: Remove misleading
char_width
,get_descent
, and
get_ascent
methods andhscale
andvscale
properties from font
objects - BREAKING: Do not guess
basename
for Type3 fonts (generally it
isn't different fromfontname
for other subset fonts) - BREAKING:
Element.contents
contains bothstructure.ContentItem
andstructure.ContentObject
Full Changelog: v0.5.1...v0.6.0