Skip to content

Commit da9fb76

Browse files
committed
Modified the README.md based on JOSS comments.
1 parent 5044a4d commit da9fb76

File tree

1 file changed

+59
-72
lines changed

1 file changed

+59
-72
lines changed

README.md

Lines changed: 59 additions & 72 deletions
Original file line numberDiff line numberDiff line change
@@ -22,65 +22,56 @@ The following are some of the benefits of utilizing this approach:
2222
of the specification. A script based language makes it easier for the
2323
consumers to quickly modify the code and enhance to their specific needs.
2424

25-
2. When a higher level scripting language implements a C/C++ PDF library API, the
26-
scope is confined to achieving certain high level application tasks like,
27-
graphics or text extraction; annotation or signature content extraction or
28-
page merging or extraction. However, this API represents the PDF
29-
specification as a model (in MVC parlance). Every object in PDF
30-
specification can be represented in some form through these APIs. Hence,
31-
objects can be utilized effectively to understand document structure or
32-
correlate documents in more meaningful ways.
33-
34-
3. Potential to be extended as a PDF generator. Since, the API is written as an
35-
object model of PDF documents, it's easier to extend with additional PDF
36-
write or update capabilities.
25+
2. When a higher level scripting language implements a C/C++ PDF
26+
library API, the scope is kept limited to achieving certain high
27+
level tasks like, graphics or text extraction; annotation or
28+
signature content extraction; or page extraction or merging.
3729

30+
However, `PDFIO` represents the PDF specification as a model in the
31+
Model, View and Controller parlance. A PDF file can be represented
32+
as a collection of interconnected Julia structures. Those
33+
structures can be utilized in granular tasks or simply can be used
34+
to understand the structure of the PDF document.
35+
36+
As per the PDF specification, text can be presented as part of the
37+
page content stream or inside PDF page annotations. An API like
38+
`PDFIO` can create two categories of object types. One representing
39+
the text object inside the content stream and the other for the
40+
text inside an annotation object. Thus, providing flexibility to
41+
the API user.
42+
43+
3. Since, the API is written as an object model of PDF documents, it's
44+
easier to extend with additional PDF write or update capabilities.
45+
Although, the current implementation does not provide the PDF
46+
writing capabilities, the foundation has been laid for future
47+
extension.
3848

3949
There are also certain downsides to this approach:
4050

41-
1. Any API that represents an object model of a document, tends to carry the
42-
complexity of introducing abstract objects, often opaque objects (handles)
43-
that are merely representational for an API user. They may not have any
44-
functional meaning. The methods tend to be granular than a method that can
45-
complete a user level task.
46-
2. The user may need to refer to the PDF specification for having a complete
47-
semantic understanding.
48-
3. The amount of code needed to carry out certain tasks can be substantially
49-
higher.
51+
1. Any API that represents an object model of a document, tends to
52+
carry the complexity of introducing abstract objects. They can be
53+
opaque objects (handles) that are representational specific to the
54+
API. They may not have any functional meaning. The methods are
55+
granular and may not complete one use level task. The amount of code
56+
needed to complete a user level task can be substantially higher.
5057

51-
### Illustration
52-
53-
A popular package `Taro.jl` that utilizes Java based [Apache
54-
Tika](http://tika.apache.org/), [Apache POI](http://poi.apache.org/) and [Apache
55-
FOP](https://xmlgraphics.apache.org/fop/) libraries for reading PDF and other
56-
file types may need the following code to extract text and other metadata from
57-
the document.
58+
In `PDFIO` the following steps have to be carried out:
59+
a. Open the PDF document and obtain the document handle.
60+
b. Query the document handle for all the pages in the document.
61+
c. Iterate the pages and obtain the page object handles for each of
62+
the pages.
63+
d. Extract the text from the page objects and write to a file IO.
64+
e. Close the document ensuring all the document resources are
65+
reclaimed.
66+
2. The API user may need to refer to the PDF specification
67+
(PDF-32000-1:2008)[@Adobe:2008] for semantic understanding of PDF
68+
files in accomplishing some of the tasks. For example, the workflow
69+
of PDF text extraction above is a natural extension from how text is
70+
represented in a PDF file as per the specification. A PDF file is
71+
composed of pages and text is represented inside each page content
72+
object. The object model of `PDFIO` is a Julia language
73+
representation of the PDF specification.
5874

59-
```julia
60-
using Taro
61-
Taro.init()
62-
meta, txtdata = Taro.extract("sample.pdf");
63-
64-
```
65-
66-
While the same with `PDFIO` may look like below:
67-
68-
```julia
69-
function getPDFText(src, out)
70-
doc = pdDocOpen(src)
71-
docinfo = pdDocGetInfo(doc)
72-
open(out, "w") do io
73-
npage = pdDocGetPageCount(doc)
74-
for i=1:npage
75-
page = pdDocGetPage(doc, i)
76-
pdPageExtractText(io, page)
77-
end
78-
end
79-
pdDocClose(doc)
80-
return docinfo
81-
end
82-
83-
```
8475

8576
## Installation
8677

@@ -106,34 +97,30 @@ The above mentioned code takes a PDF file `src` as input and writes the text dat
10697
return - A dictionary containing metadata of the document
10798
"""
10899
function getPDFText(src, out)
100+
# handle that can be used for subsequence operations on the document.
109101
doc = pdDocOpen(src)
110-
```
111-
Provides `doc` handle that can be used for subsequence operations on the document.
112-
```julia
102+
103+
# Metadata extracted from the PDF document.
104+
# This value is retained and returned as the return from the function.
113105
docinfo = pdDocGetInfo(doc)
114-
```
115-
Metadata extracted from the PDF document. This value is retained and returned as the return from the function.
116-
```julia
117106
open(out, "w") do io
107+
108+
# Returns number of pages in the document
118109
npage = pdDocGetPageCount(doc)
119-
```
120-
Returns number of pages in the document
121-
```julia
110+
122111
for i=1:npage
112+
113+
# handle to the specific page given the number index.
123114
page = pdDocGetPage(doc, i)
124-
```
125-
Returns a `page` handle to the specific page given the number number index.
126-
```julia
115+
116+
# Extract text from the page and write it to the output file.
127117
pdPageExtractText(io, page)
128-
```
129-
Extract text from the page and write it to the output file.
130-
```julia
131-
end
118+
119+
end
132120
end
121+
# Close the document handle.
122+
# The doc handle should not be used after this call
133123
pdDocClose(doc)
134-
```
135-
Close the document handle. The `doc` handle should not be used after this call
136-
```julia
137124
return docinfo
138125
end
139126
```

0 commit comments

Comments
 (0)