PDF Data Extraction using Python and Selenium

Author: Ayush Agarwal (@thisisayush), Letstream License: GNU General Public License V3

Features

Extracting data from PDF by loading the PDF in browser and then extracting data.

For running in Headless Mode, install any one of the backends from https://pyvirtualdisplay.readthedocs.io/en/latest/
Geckodriver (for firefox) and Chromedriver (for Chrome)
Tested on xvfb backend sudo apt install xvfb and chrome/firefox on Ubuntu Server 18.04
Tested on Non-Headless Mode in Windows 10 Chrome/Firefox.

Import the Extractor class and initialise it with following parameters

browser: Required, must be on of Extractor.CHROME or Extractor.FIREFOX
executable_path: Required, must be complete absolute path to the corrosponding webdriver (chromedriver or geckodriver) executable
headless: Default (True), whether to run in a PyVirtualDisplay or Normal Mode.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
drivers		drivers
html		html
pdfextractor		pdfextractor
samples		samples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py