Releases: dhdaines/playa
Releases · dhdaines/playa
PLAYA-PDF 0.5.0: Breaking all the APIs again
There was a lot of rot and bug in various APIs, especially text and font related ones, and since ZeroVer and Reasons, it seemed like a good idea to get rid of that nonsense.
Changes from CHANGELOG.md
- Remove use of
object
in type annotations - Add support for role map and standard structure types
- Refactor page.py as it was getting really unwieldy
- Add missing
ctm
to content objects in metadata API - Somewhat improve untagged text extraction where the CTM is exotic
- Correct character and word spacing to apply after all glyphs
- Correct vertical writing to fully support glyph-specific position
vectors, even totally absurd ones - Correct horizontal scaling to apply to vertical writing, including
the position vector - Add
bbox
andcontents
to structure elements - Add
origin
anddisplacement
to glyphs - Add
size
to glyphs and texts to get effective font size (still not
entirely accurate when there is rotation or skewing) - Support PDF 2.0
Length
attribute on inline images - Add
font
property to documents and pages - BREAKING:
find
andfind_all
in structure search by standard
structure types (roles) - BREAKING:
parent_tree
moved toplaya.structure.Tree
- BREAKING:
Point
,Rect
,Matrix
andPDFObject
moved to
playa.pdftypes
- BREAKING:
PathObject
no longer contains "subpaths", it is safe to
recursively descend it now - BREAKING: Content objects moved to
playa.content
and interpreter
toplaya.interp
- BREAKING: Text state no longer exists in the public API, text
objects have immutable line matrix and glyph offset now, and
everything else is in the graphic state - BREAKING:
text_space_
properties are removed since what they
returned was not actually text space (and maybe not useful either) - BREAKING:
glyph_offset
is removed from glyphs and made private in
text objects, as it is not in a well defined space. - BREAKING: Glyph
bbox
now has a precise definition, which isn't
exactly the glyph bounding box but is a lot closer. This means
notably that adjacent glyphs may overlap or may not touch, which is
why you should never use thebbox
to detect word boundaries.
Useorigin
anddisplacement
instead, please! - BREAKING:
cid2unicode
attribute of fonts is removed as it doesn't
make any sense for Type3 or CID fonts.
What's Changed
- fix!: make type annotations much stricter by @dhdaines in #95
- feat!: Add support for role map and standard structure types by @dhdaines in #98
- Dont't split PathObject into subpaths by @lambdalemon in #85
- XObjects inherit graphic state from surrounding by @lambdalemon in #96
- fix: correct ascent/descent for Type3 fonts by @dhdaines in #99
- refactor!: split playa.page into three modules by @dhdaines in #100
- refactor!: most of text state is just graphics state by @dhdaines in #101
- refactor!: drown text state in the bathtub by @dhdaines in #102
- Correct documentation and metadata for font, text, and glyph objects by @dhdaines in #105
- Fix text rendering matrix for GlyphObject by @lambdalemon in #107
- Correct glyph and text bboxes in vertical writing mode by @dhdaines in #110 (thanks @lambdalemon for a different version of this PR)
- Make benchmarks more useful by @dhdaines in #111
- feat!: Improve text extraction and add useful glyph and text properties by @dhdaines in #112
- Correct the handling of character and word spacing parameters by @dhdaines in #113
- feat: support PDF 2.0 inline images by @dhdaines in #115
Full Changelog: v0.4.3...v0.5.0
PLAYA-PDF 0.4.3: More bug fixes
- Correct ascent, descent, and glyph boxes for Type3 fonts
- Use ascent and descent (and not a single solitary text space unit, floating in a man's hat) to calculate glyph/text bbox height (thanks to @lambdalemon)
- XObjects inherit graphics state from surrounding content (by @lambdalemon)
Full Changelog: v0.4.2...v0.4.3
PLAYA-PDF 0.4.2: Bug fixes
What's Changed
- Correct
fontsize
andscaling
in text state - Correct
ValueError
on incorrect stream lengths for ASCII85 data - Correct implicit font encodings for Type1 fonts
- Tolerate all sorts of illegal structure trees
- Allow accessing annotations and XObjects from structure tree
- Better encoding for SimpleFont by @lambdalemon in #82
- Improve error handling in font initialization by @dhdaines in #84
- Extra robustness for ascii85 and inline images by @dhdaines in #89
- fix: do not follow circular xobject references by @dhdaines in #90
- Fix a few annoyances in logical structure trees by @dhdaines in #74
- Fix bug in CFFFontProgram when using predefined encodings by @lambdalemon in #91
- Remove padding in AES encrypted strings by @dhdaines in #92
- Add the ability to access underlying objects in structure content objects by @dhdaines in #93
- Correct
asobj
for structure elements by @dhdaines in #94
New Contributors
- @lambdalemon made their first contribution in #82
Full Changelog: v0.4.1...v0.4.2
PLAYA-PDF 0.4.1: Minor but important cleanups
What's Changed
- Correct outlines in CLI
- Accept UTF-16LE in strings with BOM
- Speed up fallback xrefs in pathological PDFs
- Detect two PDFs in a trenchcoat
Full Changelog: v0.4.0...v0.4.1
PLAYA-PDF 0.4.0: More robustness and expanded CLI
What's Changed
- Export structured/typed metadata for use in CLI and clients by @dhdaines in #68
- Remove deprecated APIs for 0.4.0 (or maybe 1.0.0?) release by @dhdaines in #69
- Be extra robust to really broken PDFs in parsing
Full Changelog: v0.3.2...v0.4.0
PLAYA-PDF 0.3.2: Improved stability and bug fixes
PLAYA-PDF 0.3.1: Supporting some users
What's Changed
- feat: accept
bytes
as input (for async applications) by @dhdaines in #65 - Fix CTM in Form XObjects (and support pdfannots) by @dhdaines in #66
Full Changelog: v0.3.0...v0.3.1
PLAYA-PDF 0.3.0: Break all (well most of) the APIs!
What's Changed
- Remove deprecated APIs for upcoming PLAYA-PDF 0.3 series by @dhdaines in #52
- fix: accept empty name objects by @dhdaines in #54
- feat: extract text objects not text badly by @dhdaines in #53
- feat: support text extraction and make a benchmark by @dhdaines in #55
- feat!: make mcstack immutable to avoid surprises by @dhdaines in #57
- feat: Add backreferences to content objects by @dhdaines in #58
- Import and re-export a lot of types at top level by @dhdaines in #60
- Deprecate more APIs by @dhdaines in #59
- Lazy interface to logical structure tree by @dhdaines in #61
- feat: new APIs, flatten, extract_text, is_tagged by @dhdaines in #62
- New document outline and destination APIs by @dhdaines in #64
Full Changelog: v0.2.8...v0.3.0
PLAYA-PDF 0.2.10: Nope, more bugs to fix.
PLAYA 0.2.10: 2025-02-18
- Fix serious bug in rare ' and " text operators
- Fix robustness issues in structtree API
Full Changelog: v0.2.9...v0.2.10
PLAYA-PDF 0.2.9: Final (really) 0.2 release
What's Changed
- fix: Support the all-important empty name object
- feat!: Break the CLI again (ZeroVer YOLO) to better support page ranges
- feat: Support some limited and lossy text extraction in the CLI
- feat: Add necessary
.doc
property to page list - fix: Correct type annotations for page list
Full Changelog: v0.2.8...v0.2.9