Skip to content

Commit be4f9b3

Browse files
committed
Update to the JOSS paper based on the comments.
1 parent b9cba45 commit be4f9b3

File tree

1 file changed

+63
-35
lines changed

1 file changed

+63
-35
lines changed

paper/paper.md

Lines changed: 63 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: 'PDFIO: PDF Reader Library for native Julia'
2+
title: PDFIO: PDF Reader Library for native Julia'
33
tags:
44
- Julia
55
- PDF
@@ -22,54 +22,81 @@ bibliography: paper.bib
2222
# Summary
2323

2424
Portable Document Format (PDF) is the most ubiquitous file format for
25-
text, scientific research, legal documentation and many other domains
26-
for information dissemination and presentation. Being a final form
25+
text, scientific research, legal documentation and many other fields
26+
for information presentation and dissemination. Being a final form
2727
format of choice, a large body of text is currently archived in this
2828
format. Julia is an upcoming programming language in the field of data
29-
sciences with focus on text analysis. Extracting archived content to
30-
text is highly beneficial to the language usage and adoption.
29+
sciences. Extracting archived content and understanding document
30+
metadata is beneficial to the language usage.
3131

32-
``PDFIO`` is an API developed purely in Julia. Almost, all the
32+
`PDFIO` is an API developed purely in Julia. Almost, all the
3333
functionalities of PDF understanding is entirely written from scratch
34-
in Julia with only exception of certain (de)compression codecs and
34+
in Julia with the only exception of certain (de)compression codecs and
3535
cryptography, where standard open source libraries are being used.
3636

3737
The following are some of the benefits of utilizing this approach:
3838

3939
1. PDF files are in existence for over three decades. Implementations
40-
of the PDF writers are not always accurate to the specification or
41-
they may even vary significantly from vendor to vendor. Every time,
42-
someone gets a new PDF file there is a possibility that it may not
43-
work to the best interpretation of the specification. A script
44-
based language makes it easier for the consumers to quickly modify
45-
the code and enhance to their specific needs.
40+
of the PDF writers are not always accurate to the specification.
41+
They may even vary significantly from vendor to vendor. Every time
42+
someone gets a new PDF file, there is a possibility that file
43+
may not be interpreted as per the specification. A script based
44+
language makes it easier for the consumers to quickly modify and
45+
enhance the code to their specific needs.
4646

4747
2. When a higher level scripting language implements a C/C++ PDF
48-
library API, the scope is confined to achieving certain high level
49-
application tasks like, graphics or text extraction; annotation or
50-
signature content extraction or page merging or
51-
extraction. However, this API represents the PDF specification as a
52-
model (in Model, View and Controller parlance). Every object in PDF
53-
specification can be represented in some form through these
54-
APIs. Hence, objects can be utilized effectively to understand
55-
document structure or correlate documents in more meaningful ways.
48+
library API, the scope is kept limited to achieving certain high
49+
level tasks like, graphics or text extraction; annotation or
50+
signature content extraction; or page extraction or merging.
51+
52+
However, `PDFIO` represents the PDF specification as a model in the
53+
Model, View and Controller parlance. A PDF file can be represented
54+
as a collection of interconnected Julia structures. Those
55+
structures can be utilized in granular tasks or simply can be used
56+
to understand the structure of the PDF document.
57+
58+
As per the PDF specification, text can be presented as part of the
59+
page content stream or inside PDF page annotations. An API like
60+
`PDFIO` can create two categories of object types. One representing
61+
the text object inside the content stream and the other for the
62+
text inside an annotation object. Thus, providing flexibility to
63+
the API user.
5664

57-
3. Potential to be extended as a PDF generator. Since, the API is
58-
written as an object model of PDF documents, it's easier to extend
59-
with additional PDF write or update capabilities.
65+
3. Since, the API is written as an object model of PDF documents, it's
66+
easier to extend with additional PDF write or update capabilities.
67+
Although, the current implementation does not provide the PDF
68+
writing capabilities, the foundation has been laid for future
69+
extension.
6070

6171
There are also certain downsides to this approach:
6272

6373
1. Any API that represents an object model of a document, tends to
64-
carry the complexity of introducing abstract objects, often opaque
65-
objects (handles) that are merely representational for an API
66-
user. They may not have any functional meaning. The methods tend to
67-
be granular than a method that can complete a user level task.
68-
2. The user may need to refer to the PDF specification
69-
(PDF-32000-1:2008)[@Adobe:2008] for having a complete semantic
70-
understanding.
71-
3. The amount of code needed to carry out certain tasks can be
72-
substantially higher.
74+
carry the complexity of introducing abstract objects. They can be
75+
opaque objects (handles) that are representational specific to the
76+
API. They may not have any functional meaning. The methods are
77+
granular and may not complete one use level task. The amount of code
78+
needed to complete a user level task can be substantially higher.
79+
80+
A comparative presentation of such approach can be seen in the
81+
illustration given below. A text extraction task, that can be just
82+
one simple method invocation in a competing library like `Taro`,
83+
can involve more number of steps in `PDFIO`. For example, in `PDFIO`
84+
the following steps have to be carried out:
85+
a. Open the PDF document and obtain the document handle.
86+
b. Query the document handle for all the pages in the document.
87+
c. Iterate the pages and obtain the page object handles for each of
88+
the pages.
89+
d. Extract the text from the page objects and write to a file IO.
90+
e. Close the document ensuring all the document resources are
91+
reclaimed.
92+
2. The API user may need to refer to the PDF specification
93+
(PDF-32000-1:2008)[@Adobe:2008] for semantic understanding of PDF
94+
files in accomplishing some of the tasks. For example, the workflow
95+
of PDF text extraction above is a natural extension from how text is
96+
represented in a PDF file as per the specification. A PDF file is
97+
composed of pages and text is represented inside each page content
98+
object. The object model of `PDFIO` is a Julia language
99+
representation of the PDF specification.
73100

74101
## Illustration
75102

@@ -105,7 +132,8 @@ end
105132

106133
```
107134
While `PDFIO` requires a larger number of lines of code, it definitely
108-
provides a more granular set of APIs.
135+
provides a more granular set of APIs to understand the PDF document
136+
structure.
109137

110138
# Functionality
111139

@@ -132,7 +160,7 @@ provides a more granular set of APIs.
132160
also known as the `COS` layer APIs.
133161

134162

135-
# Acknowledgements
163+
# Acknowledgments
136164

137165
We acknowledge contributions of all the community developers who have
138166
contributed to this effort. Their contribution can be viewed at:

0 commit comments

Comments
 (0)