Skip to content

Commit d8111c7

Browse files
alexhannasambitdash
authored andcommitted
Update paper.md (#77)
Fixed typos, grammar.
1 parent 838b866 commit d8111c7

File tree

1 file changed

+23
-23
lines changed

1 file changed

+23
-23
lines changed

paper/paper.md

Lines changed: 23 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -36,18 +36,18 @@ cryptography, where standard open source libraries are being used.
3636

3737
The following are some of the benefits of utilizing this approach:
3838

39-
1. PDF files are in existence for over three decades. Implementations
40-
of the PDF writers are not always accurate to the specification.
39+
1. PDF files have been in existence for over three decades. Implementations
40+
of PDF writers are not always accurate to the specification.
4141
They may even vary significantly from vendor to vendor. Every time
4242
someone gets a new PDF file, there is a possibility that file
43-
may not be interpreted as per the specification. A script based
43+
may not be interpreted as per the specification. A script-based
4444
language makes it easier for the consumers to quickly modify and
4545
enhance the code to their specific needs.
4646

47-
2. When a higher level scripting language implements a C/C++ PDF
48-
library API, the scope is kept limited to achieving certain high
49-
level tasks like, graphics or text extraction; annotation or
50-
signature content extraction; or page extraction or merging.
47+
2. When a higher-level scripting language implements a C/C++ PDF
48+
library API, the scope is limited to achieving certain high- level
49+
tasks like graphics or text extraction, annotation or
50+
signature content extraction, or page extraction or merging.
5151

5252
However, `PDFIO` represents the PDF specification as a model in the
5353
Model, View and Controller parlance. A PDF file can be represented
@@ -59,7 +59,7 @@ The following are some of the benefits of utilizing this approach:
5959
page content stream or inside PDF page annotations. An API like
6060
`PDFIO` can create two categories of object types. One representing
6161
the text object inside the content stream and the other for the
62-
text inside an annotation object. Thus, providing flexibility to
62+
text inside an annotation object. Thus, it provides flexibility to
6363
the API user.
6464

6565
3. Since, the API is written as an object model of PDF documents, it's
@@ -70,21 +70,21 @@ The following are some of the benefits of utilizing this approach:
7070

7171
There are also certain downsides to this approach:
7272

73-
1. Any API that represents an object model of a document, tends to
73+
1. Any API that represents an object model of a document tends to
7474
carry the complexity of introducing abstract objects. They can be
75-
opaque objects (handles) that are representational specific to the
75+
opaque objects (handles) that are representational-specific to the
7676
API. They may not have any functional meaning. The methods are
77-
granular and may not complete one use level task. The amount of code
78-
needed to complete a user level task can be substantially higher.
77+
granular and may not complete one use-level task. The amount of code
78+
needed to complete a user-level task can be substantially higher.
7979

8080
A comparative presentation of such approach can be seen in the
81-
illustration given below. A text extraction task, that can be just
81+
illustration given below. A text extraction task that can be
8282
one simple method invocation in a competing library like `Taro`,
8383
can involve more number of steps in `PDFIO`. For example, in `PDFIO`
8484
the following steps have to be carried out:
8585
a. Open the PDF document and obtain the document handle.
8686
b. Query the document handle for all the pages in the document.
87-
c. Iterate the pages and obtain the page object handles for each of
87+
c. Iterate the pages and obtain page object handles for each of
8888
the pages.
8989
d. Extract the text from the page objects and write to a file IO.
9090
e. Close the document ensuring all the document resources are
@@ -100,7 +100,7 @@ There are also certain downsides to this approach:
100100

101101
## Illustration
102102

103-
A popular package `Taro.jl`[@Avik:2013] that utilizes Java based [Apache
103+
The popular package `Taro.jl`[@Avik:2013] that utilizes Java based [Apache
104104
Tika](http://tika.apache.org/), [Apache POI](http://poi.apache.org/)
105105
and [Apache FOP](https://xmlgraphics.apache.org/fop/) libraries for
106106
reading PDF and other file types may need the following code to
@@ -113,7 +113,7 @@ meta, txtdata = Taro.extract("sample.pdf");
113113

114114
```
115115

116-
While the same with `PDFIO` may look like below:
116+
The same functionality with `PDFIO` may look like the code below:
117117

118118
```julia
119119
function getPDFText(src, out)
@@ -131,19 +131,19 @@ function getPDFText(src, out)
131131
end
132132

133133
```
134-
While `PDFIO` requires a larger number of lines of code, it definitely
135-
provides a more granular set of APIs to understand the PDF document
134+
While `PDFIO` requires a larger number of lines of code, it
135+
provides a more granular set of APIs for understanding the PDF document
136136
structure.
137137

138138
# Functionality
139139

140140
`PDFIO` is implemented in layers enabling following features:
141141

142-
1. Extract and render the Contents in of a PDF page. This ensures the
143-
contents are organized in a hierarchical grouping, that can be used
142+
1. Extract and render the Contents in a PDF page. This ensures the
143+
contents are organized in a hierarchical grouping that can be used
144144
for rendering of the content. Rendering is used here in a generic
145-
sense and not confined to painting on a raster device. For example,
146-
extracting document text can also be considered as a rendering
145+
sense and is not confined to painting on a raster device. For example,
146+
extracting document text can also be considered a rendering
147147
task. `pdPageExtractText` is an apt example of the same.
148148
2. Provide functional tasks to PDF document access. A few of such
149149
functionalities are:
@@ -155,7 +155,7 @@ structure.
155155
- Extracting fonts and font attributes (`pdPageGetFonts`,
156156
`pdFontIsItalic` etc.)
157157
3. Access low level PDF objects (`CosObject`) and obtain information
158-
when high level APIs do not exist. These kinds of functionalities
158+
when high-level APIs do not exist. These kinds of functionalities
159159
are mostly related to the file structure of the PDF documents and
160160
also known as the `COS` layer APIs.
161161

0 commit comments

Comments
 (0)