You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hey folks, awesome tool here, provides pretty accurate meta data and nice OCR integration.
I was interested in getting better OCR after seeing how bad it sometimes is with pictures. While the OCR is great, I would love to get the ingested PDF with OCR'd overlayed text (I think Google and Azure support this).
My main problem is that the current workflow (due to limiations of the Paperless API AFAIK), is that the OCR (and tagging) run when a certain tag is seen. This only happens after ingestion and to update the original PDF with the OCR version, it needs to be replaced using PDF_REPLACE, which uploads the new version and deletes the old.
This works, but it leaves me slightly unsatifsied, as we are
Ingesting
Add paperless-gpt auto tags
paperless gpt OCR
Upload OCR version PDF
Delete old version PDF
Reingest the new PDF
It's more a pricinciple problem. The only "real" downside, is that I loose my original filename, which I would've liked to keep.
There's been an interesting discussion in the paperless repo about alternative OCR integration (paperless-ngx/paperless-ngx#5128 (reply in thread)). The creator of the Google OCR tool that is AFAIK reused by paperless-gpt, suggested using a preconsumtion script to run OCR before consumption.
I think a preconsumtion script is a great first step at running OCR before the actual ingestion by paperless.
Ideally, there should be a option to choose between different OCR services, much like paperless-gpt. The paperless folks are investigating such an integration with Azure right now, see paperless-ngx/paperless-ngx#5128 (comment).
My points only cover OCR for text, not the other cool things LLM could do, like understanding images and document structure.
Posting this to raise visibility for the ongoing external OCR discussion in paperless and hope to get more people interested 😁 .
I would have never known my OCR'ed text is so bad, had I not first setup paperless-gpt 🤣
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Hey folks, awesome tool here, provides pretty accurate meta data and nice OCR integration.
I was interested in getting better OCR after seeing how bad it sometimes is with pictures. While the OCR is great, I would love to get the ingested PDF with OCR'd overlayed text (I think Google and Azure support this).
My main problem is that the current workflow (due to limiations of the Paperless API AFAIK), is that the OCR (and tagging) run when a certain tag is seen. This only happens after ingestion and to update the original PDF with the OCR version, it needs to be replaced using
PDF_REPLACE
, which uploads the new version and deletes the old.This works, but it leaves me slightly unsatifsied, as we are
It's more a pricinciple problem. The only "real" downside, is that I loose my original filename, which I would've liked to keep.
There's been an interesting discussion in the paperless repo about alternative OCR integration (paperless-ngx/paperless-ngx#5128 (reply in thread)). The creator of the Google OCR tool that is AFAIK reused by paperless-gpt, suggested using a preconsumtion script to run OCR before consumption.
I think a preconsumtion script is a great first step at running OCR before the actual ingestion by paperless.
Ideally, there should be a option to choose between different OCR services, much like paperless-gpt. The paperless folks are investigating such an integration with Azure right now, see paperless-ngx/paperless-ngx#5128 (comment).
My points only cover OCR for text, not the other cool things LLM could do, like understanding images and document structure.
Posting this to raise visibility for the ongoing external OCR discussion in paperless and hope to get more people interested 😁 .
I would have never known my OCR'ed text is so bad, had I not first setup paperless-gpt 🤣
Beta Was this translation helpful? Give feedback.
All reactions