Text extraction from PDF files
The release provides the following functionalities.
- Has a text extraction API
pdPageExtractText(page)
- Supports Unicode code extraction from font encoding as well as Unicode CMap. (does not read into the font internal encoding embedded in the font file)
- Supports Adobe’s encoding for Latin fonts (AdobeGlyphList). Symbol and ZapfDingbats encodings are supported as well.
- Does not do any special handling for tagged PDFs but tagged PDFs may behave better as the creation order and reading order of document objects are similar.