-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hi,
I have been shredding PDFs for some time using a variety of tools (including pdftohtml) and only recently discovered your library. I really like the API and the CMake-based build system which makes it very easy to incorporate into projects. Thanks!
I have a question on how best to add a feature that returns the text elements as JSON representation of the bounding boxes (I plan to package this as a SQLite extension that would allow you to query out the (text) contents of a PDF)
Here is a sample from the XML generated by pdftohtml
. I find this schema fine as-is and would like to do something similar with your library but take advantage of how much easier PDFHummus
is to incorporate than pdftohtml
). I assume that this work is not particularly difficult (basing off what has already been done) but wanted to get your advice before jumping into it.
<pdf2xml producer="poppler" version="20.11.0">
<page number="10" position="absolute" top="0" left="0" height="1188" width="918">
<fontspec id="0" size="54" family="ArialMT" color="#b6b6b6"/>
<fontspec id="1" size="16" family="ArialMT" color="#000000"/>
<fontspec id="2" size="23" family="TrebuchetMS" color="#000000"/>
<fontspec id="3" size="16" family="Arial" color="#000000"/>
<fontspec id="4" size="16" family="Arial" color="#999999"/>
<text top="11" left="459" width="15" height="60" font="0"> </text>
<text top="109" left="108" width="701" height="18" font="1">began in the 1960s with merely identifying the systems from space. Researchers then</text>
<text top="131" left="108" width="701" height="18" font="1">developed a technique to estimate intensity from the storm cloud structure and lifetime. See</text>
Let me know if you have any thoughts wrt returning a string (either a single document or one document per page) or as an nlohmann::json
data-structure. Because the JSON support in SQLite is very good, I find it easy to write simple scalar functions that return a blob of JSON and then convert that into a table-like value via a Common Table Expression.
I am delighted to have found your library and to start using it.