Merge pull request #900 from devikabhapkar/main

pawangeek · web-flow · commit 0fc7018b3387 · 2022-10-21T10:15:50.000+05:30
Extract text from a PDF
diff --git a/extract_text_from_pdf/README.md b/extract_text_from_pdf/README.md
@@ -0,0 +1,5 @@
+# Extract text from pdf
+
+Python can also be used to easily extract text from PDFs using the PyPDF2 package. Getting text from a PDF file proves useful for data mining, invoice reconciliation, or report generation, and the extraction process can be automated in just a few lines of code. You can run pip install PyPDF2 in your terminal to install the package. Below are a few examples of what you can achieve using Py2PDF2:
+
+> $ extract_text_from_pdf filename
diff --git a/extract_text_from_pdf/extract_text_from_pdf.py b/extract_text_from_pdf/extract_text_from_pdf.py
@@ -0,0 +1,19 @@
+# import module PyPDF2
+import PyPDF2
+# put 'example.pdf' in working directory
+# and open it in read binary mode
+pdfFileObj = open('example.pdf', 'rb')
+# call and store PdfFileReader
+# object in pdfReader
+pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
+# to print the total number of pages in pdf
+# print(pdfReader.numPages)
+# get specific page of pdf by passing
+# number since it stores pages in list
+# to access first page pass 0
+pageObj = pdfReader.getPage(0)
+# extract the page object
+# by extractText() function
+texts = pageObj.extractText()
+# print the extracted texts
+print(texts)