Skip to content

Text extraction from PDF files

Compare
Choose a tag to compare
@sambitdash sambitdash released this 09 Sep 03:39
· 264 commits to master since this release

The release provides the following functionalities.

  1. Has a text extraction API pdPageExtractText(page)
  2. Supports Unicode code extraction from font encoding as well as Unicode CMap. (does not read into the font internal encoding embedded in the font file)
  3. Supports Adobe’s encoding for Latin fonts (AdobeGlyphList). Symbol and ZapfDingbats encodings are supported as well.
  4. Does not do any special handling for tagged PDFs but tagged PDFs may behave better as the creation order and reading order of document objects are similar.