@@ -36,18 +36,18 @@ cryptography, where standard open source libraries are being used.
36
36
37
37
The following are some of the benefits of utilizing this approach:
38
38
39
- 1 . PDF files are in existence for over three decades. Implementations
40
- of the PDF writers are not always accurate to the specification.
39
+ 1 . PDF files have been in existence for over three decades. Implementations
40
+ of PDF writers are not always accurate to the specification.
41
41
They may even vary significantly from vendor to vendor. Every time
42
42
someone gets a new PDF file, there is a possibility that file
43
- may not be interpreted as per the specification. A script based
43
+ may not be interpreted as per the specification. A script- based
44
44
language makes it easier for the consumers to quickly modify and
45
45
enhance the code to their specific needs.
46
46
47
- 2 . When a higher level scripting language implements a C/C++ PDF
48
- library API, the scope is kept limited to achieving certain high
49
- level tasks like, graphics or text extraction; annotation or
50
- signature content extraction; or page extraction or merging.
47
+ 2 . When a higher- level scripting language implements a C/C++ PDF
48
+ library API, the scope is limited to achieving certain high- level
49
+ tasks like graphics or text extraction, annotation or
50
+ signature content extraction, or page extraction or merging.
51
51
52
52
However, ` PDFIO ` represents the PDF specification as a model in the
53
53
Model, View and Controller parlance. A PDF file can be represented
@@ -59,7 +59,7 @@ The following are some of the benefits of utilizing this approach:
59
59
page content stream or inside PDF page annotations. An API like
60
60
` PDFIO ` can create two categories of object types. One representing
61
61
the text object inside the content stream and the other for the
62
- text inside an annotation object. Thus, providing flexibility to
62
+ text inside an annotation object. Thus, it provides flexibility to
63
63
the API user.
64
64
65
65
3 . Since, the API is written as an object model of PDF documents, it's
@@ -70,21 +70,21 @@ The following are some of the benefits of utilizing this approach:
70
70
71
71
There are also certain downsides to this approach:
72
72
73
- 1 . Any API that represents an object model of a document, tends to
73
+ 1 . Any API that represents an object model of a document tends to
74
74
carry the complexity of introducing abstract objects. They can be
75
- opaque objects (handles) that are representational specific to the
75
+ opaque objects (handles) that are representational- specific to the
76
76
API. They may not have any functional meaning. The methods are
77
- granular and may not complete one use level task. The amount of code
78
- needed to complete a user level task can be substantially higher.
77
+ granular and may not complete one use- level task. The amount of code
78
+ needed to complete a user- level task can be substantially higher.
79
79
80
80
A comparative presentation of such approach can be seen in the
81
- illustration given below. A text extraction task, that can be just
81
+ illustration given below. A text extraction task that can be
82
82
one simple method invocation in a competing library like ` Taro ` ,
83
83
can involve more number of steps in ` PDFIO ` . For example, in ` PDFIO `
84
84
the following steps have to be carried out:
85
85
a. Open the PDF document and obtain the document handle.
86
86
b. Query the document handle for all the pages in the document.
87
- c. Iterate the pages and obtain the page object handles for each of
87
+ c. Iterate the pages and obtain page object handles for each of
88
88
the pages.
89
89
d. Extract the text from the page objects and write to a file IO.
90
90
e. Close the document ensuring all the document resources are
@@ -100,7 +100,7 @@ There are also certain downsides to this approach:
100
100
101
101
## Illustration
102
102
103
- A popular package ` Taro.jl ` [ @Avik :2013] that utilizes Java based [ Apache
103
+ The popular package ` Taro.jl ` [ @Avik :2013] that utilizes Java based [ Apache
104
104
Tika] ( http://tika.apache.org/ ) , [ Apache POI] ( http://poi.apache.org/ )
105
105
and [ Apache FOP] ( https://xmlgraphics.apache.org/fop/ ) libraries for
106
106
reading PDF and other file types may need the following code to
@@ -113,7 +113,7 @@ meta, txtdata = Taro.extract("sample.pdf");
113
113
114
114
```
115
115
116
- While the same with ` PDFIO ` may look like below:
116
+ The same functionality with ` PDFIO ` may look like the code below:
117
117
118
118
``` julia
119
119
function getPDFText (src, out)
@@ -131,19 +131,19 @@ function getPDFText(src, out)
131
131
end
132
132
133
133
```
134
- While ` PDFIO ` requires a larger number of lines of code, it definitely
135
- provides a more granular set of APIs to understand the PDF document
134
+ While ` PDFIO ` requires a larger number of lines of code, it
135
+ provides a more granular set of APIs for understanding the PDF document
136
136
structure.
137
137
138
138
# Functionality
139
139
140
140
` PDFIO ` is implemented in layers enabling following features:
141
141
142
- 1 . Extract and render the Contents in of a PDF page. This ensures the
143
- contents are organized in a hierarchical grouping, that can be used
142
+ 1 . Extract and render the Contents in a PDF page. This ensures the
143
+ contents are organized in a hierarchical grouping that can be used
144
144
for rendering of the content. Rendering is used here in a generic
145
- sense and not confined to painting on a raster device. For example,
146
- extracting document text can also be considered as a rendering
145
+ sense and is not confined to painting on a raster device. For example,
146
+ extracting document text can also be considered a rendering
147
147
task. ` pdPageExtractText ` is an apt example of the same.
148
148
2 . Provide functional tasks to PDF document access. A few of such
149
149
functionalities are:
@@ -155,7 +155,7 @@ structure.
155
155
- Extracting fonts and font attributes (` pdPageGetFonts ` ,
156
156
` pdFontIsItalic ` etc.)
157
157
3 . Access low level PDF objects (` CosObject ` ) and obtain information
158
- when high level APIs do not exist. These kinds of functionalities
158
+ when high- level APIs do not exist. These kinds of functionalities
159
159
are mostly related to the file structure of the PDF documents and
160
160
also known as the ` COS ` layer APIs.
161
161
0 commit comments