task flow from egg to chicken #241

mophilly · 2025-02-06T00:19:14Z

mophilly
Feb 6, 2025
Collaborator

In adding splitting to handle large PDF files, I hit upon a classic question: which comes first?
classification ->> split pages ->> extraction ->> validation
or
split page ->> classification ->> extraction ->> validation

I have been crafting test script for each task. Good for learning but I wonder if I am too deep in the weeds. Issue #237 is a great addition. It does make the distinction between classification, splitting and extraction less visible.

Is there a task flow more in line the current project capabilities?

enoch3712 · 2025-02-06T21:13:40Z

enoch3712
Feb 6, 2025
Maintainer

Hello @mophilly!

Always comes first classification, split then extraction. The flow is always this one.

result = (process.load_file(BULK_DOC_PATH)
    .split(my_classifications, strategy=SplittingStrategy.EAGER)
    .extract(vision=True))

Inside of the split, the classification is done and then aggregates the pages and does the extraction.

Is there a task flow more in line the current project capabilities?

I think you already know everything, But everything is done inside of a Process. Take a look at test_process

Tomorrow i should publish with Issue #237 and will make everything more clear to use.

1 reply

mophilly Feb 6, 2025
Collaborator Author

Thank you for the reply. Helpful, as always. I am very excited as I approach the first “real world” test using this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

task flow from egg to chicken #241

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

task flow from egg to chicken #241

Uh oh!

mophilly Feb 6, 2025 Collaborator

Replies: 1 comment · 1 reply

Uh oh!

enoch3712 Feb 6, 2025 Maintainer

Uh oh!

mophilly Feb 6, 2025 Collaborator Author

mophilly
Feb 6, 2025
Collaborator

Replies: 1 comment 1 reply

enoch3712
Feb 6, 2025
Maintainer

mophilly Feb 6, 2025
Collaborator Author