And You Have The Milk
This release primarily focuses on more bug-fixing to improve stability of extracting text content. The main new features are full support for encrypted documents, Document Layout Analysis tools and early-access path information.
- Fix a bug using
DefaultWordExtractor
where theLetters
collection on all words would be empty. - Supports UTF-16 encoded strings in document content, such as document information dictionaries, and in
HexToken
based strings. - Supports all forms of document encryption up to and including revision 6 in PDF 2.0 spec.
- Prevents crashes where PDF contains circular object references.
- The new
DocumentLayoutAnalysis
namespace supports nearest-neighbour word extraction and recursive X-Y cut document segmentation.RecursiveXYCut.GetBlocks
implements the Recursive X-Y cut algorithm https://en.wikipedia.org/wiki/Recursive_X-Y_cut.NearestNeighbourWordExtractor
can be provided toPage.GetWords
for a different word extraction technique. - Fix bug where some letters had a width or height of zero.
- More tolerant search for cross-reference offsets, if the cross-reference offsets are incorrect we search for the corresponding object.
- Handle a case where CidFonts contained hex rather than string tokens for registry-ordering-supplement information.
- Support cross-reference tables even if they appear after the first
%%EOF
end of file marker. - Support rotated pages.
Page
now contains aRotation
property indicating if the page is rotated at the top level. Valid values for rotation are 0, 90, 180 and 270. The currently reportedPageSize
does not take rotation into account yet. This also adds support for properly rotating letters and page content. - Change internal letter point size calculation,
Page.ExperimentalAccess.GetPointSize(Letter letter)
now reports the point size with an updated calculation which handles rotated letters. - Map character codes directly to ASCII character values where there's no corresponding Unicode value. This matches PDFBox 1.8/9 behaviour where if no Unicode value can be found, the integer value is mapped directly to a character.
- Expose
PdfPath
information from the page's content stream. Early access to path/geometry information parsed from the page's content. UsePage.ExperimentalAccess.Paths
to access lines, rectangles, curves, etc declared by the page.