Releases: sambitdash/PDFIO.jl
Releases · sambitdash/PDFIO.jl
New pdPageExtractText Method
Changes this release:
- A new
pdPageExtractText
method is introduced which does a cleaner text conversion for complex PDFs including non-tagged PDFs. - Bug fixes
Text conversions carried out on 25,000+ files.
Text extraction from PDF files
The release provides the following functionalities.
- Has a text extraction API
pdPageExtractText(page)
- Supports Unicode code extraction from font encoding as well as Unicode CMap. (does not read into the font internal encoding embedded in the font file)
- Supports Adobe’s encoding for Latin fonts (AdobeGlyphList). Symbol and ZapfDingbats encodings are supported as well.
- Does not do any special handling for tagged PDFs but tagged PDFs may behave better as the creation order and reading order of document objects are similar.
PDFIO v0.0.6
- Implementation of PDF Common Data types
- Text Strings
- Date
- Name Tree
- Number Tree - Page Labels
- File attachments and annotations supported as custom scripts
- Cleaner implementation of
show
andprint
methods of PDF Objects - Inline API documentation in REPL