Skip to content

Documents: Add automatic background processing #8

@helje5

Description

@helje5

The document (and note) types now supports storing a plain text representation of a document. This is intended for two things:

  • FTS
  • OCR
  • faster access
    OGo Obj/C stores the document blobs in the filesystem, though the storage is technically pluggable, we might be able to refer out to e.g. a WebDAV server (or other document providers).

Columns:

  • text_content
  • text_content_type (SMALLINT, 0=plain?, 1=markdown, 2=html, ..., should that be a MIME type?)
  • text_content_object_version (the version of the document the content relates to)

Those fields should be filled asynchronously, either using a queue or just by cron using a started. It could do various things:

  • OCR PDF's and images, e.g. using Tesseract or MarkItDown
  • Transcribe audio attachments, e.g. using Whisper
  • Generate document thumbnails (where would we store them, as sub-documents, own column?)
  • It could also create the ts_vector as part of the update of text_content, the related column would have to be created

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions