@@ -22,65 +22,56 @@ The following are some of the benefits of utilizing this approach:
22
22
of the specification. A script based language makes it easier for the
23
23
consumers to quickly modify the code and enhance to their specific needs.
24
24
25
- 2 . When a higher level scripting language implements a C/C++ PDF library API, the
26
- scope is confined to achieving certain high level application tasks like,
27
- graphics or text extraction; annotation or signature content extraction or
28
- page merging or extraction. However, this API represents the PDF
29
- specification as a model (in MVC parlance). Every object in PDF
30
- specification can be represented in some form through these APIs. Hence,
31
- objects can be utilized effectively to understand document structure or
32
- correlate documents in more meaningful ways.
33
-
34
- 3 . Potential to be extended as a PDF generator. Since, the API is written as an
35
- object model of PDF documents, it's easier to extend with additional PDF
36
- write or update capabilities.
25
+ 2 . When a higher level scripting language implements a C/C++ PDF
26
+ library API, the scope is kept limited to achieving certain high
27
+ level tasks like, graphics or text extraction; annotation or
28
+ signature content extraction; or page extraction or merging.
37
29
30
+ However, ` PDFIO ` represents the PDF specification as a model in the
31
+ Model, View and Controller parlance. A PDF file can be represented
32
+ as a collection of interconnected Julia structures. Those
33
+ structures can be utilized in granular tasks or simply can be used
34
+ to understand the structure of the PDF document.
35
+
36
+ As per the PDF specification, text can be presented as part of the
37
+ page content stream or inside PDF page annotations. An API like
38
+ ` PDFIO ` can create two categories of object types. One representing
39
+ the text object inside the content stream and the other for the
40
+ text inside an annotation object. Thus, providing flexibility to
41
+ the API user.
42
+
43
+ 3 . Since, the API is written as an object model of PDF documents, it's
44
+ easier to extend with additional PDF write or update capabilities.
45
+ Although, the current implementation does not provide the PDF
46
+ writing capabilities, the foundation has been laid for future
47
+ extension.
38
48
39
49
There are also certain downsides to this approach:
40
50
41
- 1 . Any API that represents an object model of a document, tends to carry the
42
- complexity of introducing abstract objects, often opaque objects (handles)
43
- that are merely representational for an API user. They may not have any
44
- functional meaning. The methods tend to be granular than a method that can
45
- complete a user level task.
46
- 2 . The user may need to refer to the PDF specification for having a complete
47
- semantic understanding.
48
- 3 . The amount of code needed to carry out certain tasks can be substantially
49
- higher.
51
+ 1 . Any API that represents an object model of a document, tends to
52
+ carry the complexity of introducing abstract objects. They can be
53
+ opaque objects (handles) that are representational specific to the
54
+ API. They may not have any functional meaning. The methods are
55
+ granular and may not complete one use level task. The amount of code
56
+ needed to complete a user level task can be substantially higher.
50
57
51
- ### Illustration
52
-
53
- A popular package ` Taro.jl ` that utilizes Java based [ Apache
54
- Tika] ( http://tika.apache.org/ ) , [ Apache POI] ( http://poi.apache.org/ ) and [ Apache
55
- FOP] ( https://xmlgraphics.apache.org/fop/ ) libraries for reading PDF and other
56
- file types may need the following code to extract text and other metadata from
57
- the document.
58
+ In ` PDFIO ` the following steps have to be carried out:
59
+ a. Open the PDF document and obtain the document handle.
60
+ b. Query the document handle for all the pages in the document.
61
+ c. Iterate the pages and obtain the page object handles for each of
62
+ the pages.
63
+ d. Extract the text from the page objects and write to a file IO.
64
+ e. Close the document ensuring all the document resources are
65
+ reclaimed.
66
+ 2 . The API user may need to refer to the PDF specification
67
+ (PDF-32000-1:2008)[ @Adobe :2008] for semantic understanding of PDF
68
+ files in accomplishing some of the tasks. For example, the workflow
69
+ of PDF text extraction above is a natural extension from how text is
70
+ represented in a PDF file as per the specification. A PDF file is
71
+ composed of pages and text is represented inside each page content
72
+ object. The object model of ` PDFIO ` is a Julia language
73
+ representation of the PDF specification.
58
74
59
- ``` julia
60
- using Taro
61
- Taro. init ()
62
- meta, txtdata = Taro. extract (" sample.pdf" );
63
-
64
- ```
65
-
66
- While the same with ` PDFIO ` may look like below:
67
-
68
- ``` julia
69
- function getPDFText (src, out)
70
- doc = pdDocOpen (src)
71
- docinfo = pdDocGetInfo (doc)
72
- open (out, " w" ) do io
73
- npage = pdDocGetPageCount (doc)
74
- for i= 1 : npage
75
- page = pdDocGetPage (doc, i)
76
- pdPageExtractText (io, page)
77
- end
78
- end
79
- pdDocClose (doc)
80
- return docinfo
81
- end
82
-
83
- ```
84
75
85
76
## Installation
86
77
@@ -106,34 +97,30 @@ The above mentioned code takes a PDF file `src` as input and writes the text dat
106
97
return - A dictionary containing metadata of the document
107
98
"""
108
99
function getPDFText (src, out)
100
+ # handle that can be used for subsequence operations on the document.
109
101
doc = pdDocOpen (src)
110
- ```
111
- Provides ` doc ` handle that can be used for subsequence operations on the document.
112
- ``` julia
102
+
103
+ # Metadata extracted from the PDF document.
104
+ # This value is retained and returned as the return from the function.
113
105
docinfo = pdDocGetInfo (doc)
114
- ```
115
- Metadata extracted from the PDF document. This value is retained and returned as the return from the function.
116
- ``` julia
117
106
open (out, " w" ) do io
107
+
108
+ # Returns number of pages in the document
118
109
npage = pdDocGetPageCount (doc)
119
- ```
120
- Returns number of pages in the document
121
- ``` julia
110
+
122
111
for i= 1 : npage
112
+
113
+ # handle to the specific page given the number index.
123
114
page = pdDocGetPage (doc, i)
124
- ```
125
- Returns a ` page ` handle to the specific page given the number number index.
126
- ``` julia
115
+
116
+ # Extract text from the page and write it to the output file.
127
117
pdPageExtractText (io, page)
128
- ```
129
- Extract text from the page and write it to the output file.
130
- ``` julia
131
- end
118
+
119
+ end
132
120
end
121
+ # Close the document handle.
122
+ # The doc handle should not be used after this call
133
123
pdDocClose (doc)
134
- ```
135
- Close the document handle. The ` doc ` handle should not be used after this call
136
- ``` julia
137
124
return docinfo
138
125
end
139
126
```
0 commit comments