Skip to content

Commit 0fc7018

Browse files
authored
Merge pull request #900 from devikabhapkar/main
Extract text from a PDF
2 parents 69c6274 + f419be2 commit 0fc7018

File tree

2 files changed

+24
-0
lines changed

2 files changed

+24
-0
lines changed

extract_text_from_pdf/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
# Extract text from pdf
2+
3+
Python can also be used to easily extract text from PDFs using the PyPDF2 package. Getting text from a PDF file proves useful for data mining, invoice reconciliation, or report generation, and the extraction process can be automated in just a few lines of code. You can run pip install PyPDF2 in your terminal to install the package. Below are a few examples of what you can achieve using Py2PDF2:
4+
5+
> $ extract_text_from_pdf filename
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# import module PyPDF2
2+
import PyPDF2
3+
# put 'example.pdf' in working directory
4+
# and open it in read binary mode
5+
pdfFileObj = open('example.pdf', 'rb')
6+
# call and store PdfFileReader
7+
# object in pdfReader
8+
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
9+
# to print the total number of pages in pdf
10+
# print(pdfReader.numPages)
11+
# get specific page of pdf by passing
12+
# number since it stores pages in list
13+
# to access first page pass 0
14+
pageObj = pdfReader.getPage(0)
15+
# extract the page object
16+
# by extractText() function
17+
texts = pageObj.extractText()
18+
# print the extracted texts
19+
print(texts)

0 commit comments

Comments
 (0)