External OCR should be a Paperless native feature (and might become one) #409

nstrelow · 2025-05-18T15:00:50Z

nstrelow
May 18, 2025

Hey folks, awesome tool here, provides pretty accurate meta data and nice OCR integration.

I was interested in getting better OCR after seeing how bad it sometimes is with pictures. While the OCR is great, I would love to get the ingested PDF with OCR'd overlayed text (I think Google and Azure support this).

My main problem is that the current workflow (due to limiations of the Paperless API AFAIK), is that the OCR (and tagging) run when a certain tag is seen. This only happens after ingestion and to update the original PDF with the OCR version, it needs to be replaced using PDF_REPLACE, which uploads the new version and deletes the old.

This works, but it leaves me slightly unsatifsied, as we are

Ingesting
Add paperless-gpt auto tags
paperless gpt OCR
Upload OCR version PDF
Delete old version PDF
Reingest the new PDF

It's more a pricinciple problem. The only "real" downside, is that I loose my original filename, which I would've liked to keep.

There's been an interesting discussion in the paperless repo about alternative OCR integration (paperless-ngx/paperless-ngx#5128 (reply in thread)). The creator of the Google OCR tool that is AFAIK reused by paperless-gpt, suggested using a preconsumtion script to run OCR before consumption.

I think a preconsumtion script is a great first step at running OCR before the actual ingestion by paperless.

Ideally, there should be a option to choose between different OCR services, much like paperless-gpt. The paperless folks are investigating such an integration with Azure right now, see paperless-ngx/paperless-ngx#5128 (comment).

My points only cover OCR for text, not the other cool things LLM could do, like understanding images and document structure.

Posting this to raise visibility for the ongoing external OCR discussion in paperless and hope to get more people interested 😁 .

I would have never known my OCR'ed text is so bad, had I not first setup paperless-gpt 🤣

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

External OCR should be a Paperless native feature (and might become one) #409

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

External OCR should be a Paperless native feature (and might become one) #409

Uh oh!

nstrelow May 18, 2025

Replies: 0 comments

nstrelow
May 18, 2025