Skip to content

And You Have The Milk

Compare
Choose a tag to compare
@EliotJones EliotJones released this 03 Aug 15:16
· 1270 commits to master since this release

This release primarily focuses on more bug-fixing to improve stability of extracting text content. The main new features are full support for encrypted documents, Document Layout Analysis tools and early-access path information.

  • Fix a bug using DefaultWordExtractor where the Letters collection on all words would be empty.
  • Supports UTF-16 encoded strings in document content, such as document information dictionaries, and in HexToken based strings.
  • Supports all forms of document encryption up to and including revision 6 in PDF 2.0 spec.
  • Prevents crashes where PDF contains circular object references.
  • The new DocumentLayoutAnalysis namespace supports nearest-neighbour word extraction and recursive X-Y cut document segmentation. RecursiveXYCut.GetBlocks implements the Recursive X-Y cut algorithm https://en.wikipedia.org/wiki/Recursive_X-Y_cut. NearestNeighbourWordExtractor can be provided to Page.GetWords for a different word extraction technique.
  • Fix bug where some letters had a width or height of zero.
  • More tolerant search for cross-reference offsets, if the cross-reference offsets are incorrect we search for the corresponding object.
  • Handle a case where CidFonts contained hex rather than string tokens for registry-ordering-supplement information.
  • Support cross-reference tables even if they appear after the first %%EOF end of file marker.
  • Support rotated pages. Page now contains a Rotation property indicating if the page is rotated at the top level. Valid values for rotation are 0, 90, 180 and 270. The currently reported PageSize does not take rotation into account yet. This also adds support for properly rotating letters and page content.
  • Change internal letter point size calculation, Page.ExperimentalAccess.GetPointSize(Letter letter) now reports the point size with an updated calculation which handles rotated letters.
  • Map character codes directly to ASCII character values where there's no corresponding Unicode value. This matches PDFBox 1.8/9 behaviour where if no Unicode value can be found, the integer value is mapped directly to a character.
  • Expose PdfPath information from the page's content stream. Early access to path/geometry information parsed from the page's content. Use Page.ExperimentalAccess.Paths to access lines, rectangles, curves, etc declared by the page.