Docling and splitting #224

mophilly · 2025-01-25T00:37:37Z

mophilly
Jan 25, 2025
Collaborator

The docling project seems like a great fit for many IDP cases. I have just arrived at the need to split large files for submission to an LLM. It appears that splitting is a fundamental element in the decling examples.

Is splitting, as expressed in ExtractThinker, a complement to docling or would using docling replace it?

enoch3712 · 2025-01-27T08:56:48Z

enoch3712
Jan 27, 2025
Maintainer

Hello @mophilly.

DocumentLoader works separately from the Splitter, i made sure that was done in each one.

And works great with docling, the one i would not advice is MarkitDown, because it doesnt allow splitting page out of the box, Docling yes.

To understand a documentLoader, just read this:
https://enoch3712.github.io/ExtractThinker/core-concepts/document-loaders/

always uses one function, load, that contains an array of pages, each page with content and image (if is vision).

If you want to use splitter, just take a look at:
https://github.com/enoch3712/ExtractThinker/blob/main/tests/test_process.py

PS: sorry for the delay, im finishing another article, this one is a big boy!

0 replies

mophilly · 2025-02-26T16:13:07Z

mophilly
Feb 26, 2025
Collaborator Author

Hello, @enoch3712.

Docling was easy to implement in this project. It outputs pages, which is nice.
The markdown result appears to be complete on a small test.

I would like additional output options, such as json, css, and html. Docling offers support for these. How might this fit into the vision for the project?

Adding output options to document_loader_docling.py, def load() seems like one place to put that. On or about line 231 is the assignment of conv_result, which later in the code provides the export_to_markdown. A branching statement could be placed there to support other formats.

OTOH, raising the scope of conv_result in the class would allow for additional methods to invoke specific output types. That might result in more concise code and flexibility.

3 replies

enoch3712 Feb 27, 2025
Maintainer

Hello @mophilly !

In terms of ouput options, i will keep it JSON. Why? Because the parsing comes from Pydantic -> JSON. Then you can create the next formats. If you want to keep HTML of CSV, just use Docling.

Remember, that's a bit out of scope, that that is what Docling already does. In this project we take care of the Document Intelligence part

mophilly Feb 28, 2025
Collaborator Author

I don't understand where docling fits in the extract thinker system.
The current docling document loader returns markdown. The Process for classification does accept that; it accepts a file path. Likewise with the Extractor.

enoch3712 Feb 28, 2025
Maintainer

Docling is the DocumentLoader. Only extracts the content from the document. The rest ET does. beyond that:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Docling and splitting #224

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Docling and splitting #224

Uh oh!

mophilly Jan 25, 2025 Collaborator

Replies: 2 comments · 3 replies

Uh oh!

enoch3712 Jan 27, 2025 Maintainer

Uh oh!

Uh oh!

mophilly Feb 26, 2025 Collaborator Author

Uh oh!

enoch3712 Feb 27, 2025 Maintainer

Uh oh!

mophilly Feb 28, 2025 Collaborator Author

Uh oh!

enoch3712 Feb 28, 2025 Maintainer

mophilly
Jan 25, 2025
Collaborator

Replies: 2 comments 3 replies

enoch3712
Jan 27, 2025
Maintainer

mophilly
Feb 26, 2025
Collaborator Author

enoch3712 Feb 27, 2025
Maintainer

mophilly Feb 28, 2025
Collaborator Author

enoch3712 Feb 28, 2025
Maintainer