feat(ocr): add OCR #5834

perfectra1n · 2025-06-21T17:55:28Z

This PR integrates OCR capabilities by orchestrating interactions between a new client-side UI, a set of server-side API endpoints, a core OCR service, the Tesseract.js library, and the existing database schema.

Key Features:

OCR Service: A new OcrService is introduced, utilizing the Tesseract.js library to perform OCR on images.
API Endpoints: Several new API endpoints are added to manage OCR tasks:
- POST /api/ocr/process-note/{noteId}: Triggers OCR processing for a specific image note.
- POST /api/ocr/process-attachment/{attachmentId}: Triggers OCR for a specific image attachment.
- GET /api/ocr/search: Searches for text within the extracted OCR data.
- POST /api/ocr/batch-process: Initiates a batch job to process all images that haven't been OCR'd yet.
- GET /api/ocr/batch-progress: Retrieves the progress of the ongoing batch OCR job.
- GET /api/ocr/stats: Provides statistics on OCR'd files.
- DELETE /api/ocr/delete/{blobId}: Deletes the OCR data for a specific image.
Client-Side UI: The image options have been updated to include:
- Enabling/disabling OCR.
- Setting the OCR language.
- Configuring a minimum confidence threshold for OCR results.
- A "Batch OCR" button to trigger the processing of all images.
- A progress bar to monitor the batch OCR process.
Database Integration: The extracted OCR text is stored in the blobs table, in a new ocr_text column. This allows
for efficient searching of image content.

Implementation Details:

The OcrService is responsible for all OCR-related logic, including initialization of Tesseract.js, text
extraction, and database interaction.
The service supports a variety of image formats, including JPEG, PNG, GIF, BMP, TIFF, and WEBP.
The client-side implementation in apps/client/src/widgets/type_widgets/options/images/images.ts provides an interface for managing OCR settings and initiating batch processing.
The API routes in apps/server/src/routes/api/ocr.ts expose the OCR functionality to the client.

Data Storage and Schema

The extracted text from an image is stored directly in the database.
Implementation:
- A new column, ocr_text (of type TEXT), has been added to the existing blobs table. The blobs table stores the actual file content (the image itself), so this new column adds the extracted text alongside the binary data it was derived from.
- Writing: The OCRService.storeOCRResult() method is responsible for persistence. It executes the SQL command: UPDATE blobs SET ocr_text = ? WHERE blobId = ?.
- Reading/Checking: To avoid reprocessing, the OCRService.getStoredOCRResult() method checks if text already exists using: SELECT ocr_text FROM blobs WHERE blobId = ?.
- Searching: The core search functionality leverages this new column. The OCRService.searchOCRResults() method performs a LIKE query to find matches: SELECT blobId, ocr_text FROM blobs WHERE ocr_text LIKE ?.

Core Logic: `OCRService` (`apps/server/src/services/ocr/ocr_service.ts`)

This class contains the primary business logic and orchestrates the entire OCR process.

How is it implemented?
- Initialization (initialize): The service doesn't initialize Tesseract on application startup. Instead, it's initialized on-demand the first time an OCR operation is requested. It correctly configures the paths for the Tesseract worker (worker-script/node/index.js) and the WebAssembly core (tesseract-core.wasm.js).
- Text Extraction (extractTextFromImage): This is the heart of the process. It takes a Buffer of image data, passes it to the Tesseract.worker.recognize() function, and awaits the result. It then formats the output into a structured OCRResult object, converting Tesseract's confidence score from a 0-100 scale to a 0-1 decimal.
- Processing Logic (processNoteOCR, processAttachmentOCR): These methods act as controllers. They fetch the relevant note or attachment from the database using the becca service, verify its MIME type is a supported image format, and check if OCR text already exists in the blobs table. If all checks pass, they retrieve the image content via .getContent() and pass the resulting buffer to extractTextFromImage. Finally, they persist the result using storeOCRResult.
- Batch Processing (startBatchProcessing, processBatchInBackground):
  - When a batch process is started, the service first queries the database to get a count of all image notes and attachments that do not have existing OCR data.
  - It stores the progress in an in-memory object: this.batchProcessingState. This object tracks the total number of images, the number processed, and the start time. Using in-memory state is efficient for tracking the live progress of a single, ongoing task.
  - The actual processing (processBatchInBackground) runs asynchronously without blocking the main thread. It iterates through the unprocessed images, calls the appropriate processing method (processNoteOCR or processAttachmentOCR) for each, and increments the processed count in the batchProcessingState.

Server API (`apps/server/src/routes/api/ocr.ts`)

This file acts as a thin routing layer, exposing the OCRService's functionality via HTTP endpoints.

How is it implemented?
- Each function (e.g., processNoteOCR, batchProcessOCR, getBatchProgress) corresponds to an API endpoint.
- It performs initial request validation (e.g., checking for required parameters like noteId).
- It calls the corresponding method in the ocrService (e.g., ocrService.startBatchProcessing()).
- It formats the response from the service into a JSON object and sends it back to the client with the appropriate HTTP status code.
- The getBatchProgress endpoint is particularly simple: it just calls ocrService.getBatchProgress() and returns the in-memory state object, allowing the client to poll for updates efficiently.

Client-Side UI (`apps/client/src/widgets/type_widgets/options/images/images.ts`)

This widget provides the user interface for interacting with the OCR features.

How is it implemented?
- It uses jQuery to manipulate the DOM, adding event listeners to checkboxes, dropdowns, and buttons.
- Starting a Batch Job (startBatchOcr): When the user clicks the "Start Batch OCR" button, this function is called. It first makes a POST request to the /api/ocr/batch-process endpoint to initiate the process on the server.
- Polling for Progress (pollBatchOcrProgress): Upon a successful response from the server, it begins polling. It calls itself recursively using setTimeout every second. In each call, it makes a GET request to /api/ocr/batch-progress.
- It uses the data from the polling response to update the UI in real-time, adjusting the width of the progress bar and updating the status text (e.g., "Processed 5 of 100 images").
- Once the polling response indicates inProgress: false, it stops the polling loop and displays a completion message.

Data Flow (Mermaid Diagram)

…onderful tesseract.js path issues

perfectra1n · 2025-07-14T15:53:56Z

I'll label you a merge conflict next 😂

…w column

…d` column

apps/client/src/widgets/type_widgets/options/images/images.ts

+            const languageTags = selectedLanguages.map(lang => 
+                `<span class="language-code">${lang}</span>`
+            ).join('');
+            this.$ocrLanguageDisplay.html(languageTags);


To fix the issue, we need to ensure that any user-controlled input is properly escaped before being inserted into the DOM as HTML. The best approach is to use a method that treats the input as plain text rather than HTML, such as text() in jQuery, or to explicitly escape the input using a utility function.

Steps to fix:

Replace the use of html() with text() for inserting plain text into the DOM. This ensures that any special characters in the input are treated as literal text rather than HTML.

Alternatively, use a utility function to escape the input before constructing the HTML string. This is necessary if the HTML structure (e.g., <span>) must be preserved.

For this specific case, using text() is the simplest and safest solution since the lang values can be displayed as plain text without requiring HTML tags.

eliandoran

In addition, we would need tests for all the processors since they rely on third part libs which are going to get updated by Renovate and we would otherwise have no coverage to ensure they don't break at some point.

.github/workflows/playwright.yml

eliandoran · 2025-07-19T12:15:29Z

apps/client/src/widgets/type_widgets/options/images/images.ts

+            <label>${t("images.ocr_language")}</label>
+            <p class="form-text">${t("images.ocr_multi_language_description")}</p>
+            <div class="ocr-language-checkboxes">
+                <label class="tn-checkbox">


Either use the content language settings which are already available or refactor it to be generated dynamically from the list of locales in Commons.

eliandoran · 2025-07-19T12:17:57Z

apps/server/src/becca/entities/bblob.ts

    }

    content!: string | Buffer;
    contentLength!: number;
+    ocr_text?: string | null;


Please rename to textRepresentation as per our discussion. Make sure to use camel case.

eliandoran · 2025-07-19T12:19:47Z

apps/server/src/migrations/migrations.ts

+        version: 234,
+        sql: /*sql*/`\
+            -- Add OCR text column to blobs table
+            ALTER TABLE blobs ADD COLUMN ocr_text TEXT DEFAULT NULL;


Same here, column names should not contain the term OCR as we will reuse the column for other ways to extract the text (for example for PDFs having already the text information instead of simply containing pictures).

eliandoran · 2025-07-19T20:57:43Z

apps/server/src/services/ocr/processors/office_processor.ts

+     */
+    private getDefaultOCRLanguage(): string {
+        try {
+            const options = require('../../options.js').default;


Please avoid the use of require. Use sync imports if needed.

eliandoran · 2025-07-19T20:58:31Z

apps/server/src/services/ocr/processors/image_processor.ts

+    private isInitialized = false;
+
+    canProcess(mimeType: string): boolean {
+        const supportedTypes = [


Extract outside function.

eliandoran · 2025-07-19T20:59:28Z

apps/server/src/services/ocr/processors/office_processor.ts

+    }
+
+    canProcess(mimeType: string): boolean {
+        const supportedTypes = [


Extract outside function.

…ties

perfectra1n added 10 commits June 10, 2025 19:12

feat(ocr): add unit tests, resolve double sent headers, and fix the w…

c4a0219

…onderful tesseract.js path issues

fix(package): referenced wrong tesseract.js lol

33a5492

feat(ocr): drop confidence down a little bit

864543e

fix(unit): resolve typecheck errors

a4adc51

feat(unit): ocr unit tests almost pass

f135622

feat(unit): ocr tests almost pass...

d20b3d8

feat(unit): ocr tests almost pass...

80a9182

fix(unit): also fix broken llm test

7868ebe

fix(ocr): obviously don't need this migration file anymore

09196c0

Update playwright.yml

4b5e8d3

This was referenced Jun 21, 2025

feat(ocr): add ocr TriliumNext/Notes#2254

Closed

Feature request: OCR images #1622

Open

eliandoran added the merge-conflicts label Jul 12, 2025

perfectra1n added 2 commits July 14, 2025 16:15

feat(ocr): swap from custom table to using the blobs table, with a ne…

9029f59

…w column

merge main into feature branch

893be24

eliandoran removed the merge-conflicts label Jul 14, 2025

perfectra1n added 7 commits July 14, 2025 16:41

fix(dev): resolve issues with pnpm-lock.yaml

2a8c887

Merge branch 'main' into feat/add-ocr-capabilities

0298083

Merge branch 'main' into feat/add-ocr-capabilities

a7878dd

feat(ocr): add officeparser, pdf-parse, and sharp dependencies for ocr

e040865

feat(ocr): update this new migration to also add a `ocr_last_processe…

508cbea

…d` column

feat(ocr): implement new language selection form

6722d2d

feat(ocr): add additional processors for OCR feature

ca8cbf8

github-advanced-security bot found potential problems Jul 16, 2025

View reviewed changes

eliandoran added the merge-conflicts label Jul 18, 2025

eliandoran requested changes Jul 19, 2025

View reviewed changes

eliandoran added 3 commits July 26, 2025 10:33

Merge remote-tracking branch 'origin/main' into feat/add-ocr-capabili…

99fa5d8

…ties

chore(ci): remove unnecessary change

2adfc1d

feat(ocr): basic processing of new files

11e9b09

eliandoran added 7 commits July 26, 2025 11:51

refactor(ocr): deduplicate mime types partially

090b175

refactor(ocr): unnecessary initialization logic

c55aa6e

feat(ocr): add an option to display OCR text

422d318

feat(ocr): add a button to trigger an OCR manually

69b0973

fix(ocr): search error due to scoring

f295592

feat(ocr): display OCR text in search results

6212ea0

feat(ocr): display OCR text only in search results

925c9c1

eliandoran removed the merge-conflicts label Jul 26, 2025

eliandoran added 6 commits July 26, 2025 13:48

chore(deps): move workspace dependencies to server

08ca86c

feat(ocr): automatically process images

72cea24

feat(ocr): run the image operation in the background

2cb4e5e

feat(ocr): auto-process images only if enabled in settings

65b58c3

chore(ocr): improve ocr search result style

55ac1e0

feat(ocr): filter out text based on confidence

5ec6141

@@ -352,8 +352,8 @@
                     if (selectedLanguages.length === 0) {
-                        this.$ocrLanguageDisplay.html(`<span class="placeholder-text">${t("images.ocr_no_languages_selected")}</span>`);
+                        this.$ocrLanguageDisplay.text(t("images.ocr_no_languages_selected"));
                     } else {
                         const languageTags = selectedLanguages.map(lang =>
-                            `<span class="language-code">${lang}</span>`
-                        ).join('');
-                        this.$ocrLanguageDisplay.html(languageTags);
+                            lang
+                        ).join(', ');
+                        this.$ocrLanguageDisplay.text(languageTags);
                     }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(ocr): add OCR #5834

feat(ocr): add OCR #5834

perfectra1n commented Jun 21, 2025 •

edited

Loading

Uh oh!

perfectra1n commented Jul 14, 2025

Uh oh!

Check warning

Copilot Autofix

eliandoran left a comment

Uh oh!

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

eliandoran Jul 19, 2025

Uh oh!

Uh oh!

Uh oh!

feat(ocr): add OCR #5834

Are you sure you want to change the base?

feat(ocr): add OCR #5834

Conversation

perfectra1n commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features:

Implementation Details:

Data Storage and Schema

Core Logic: OCRService (apps/server/src/services/ocr/ocr_service.ts)

Server API (apps/server/src/routes/api/ocr.ts)

Client-Side UI (apps/client/src/widgets/type_widgets/options/images/images.ts)

Data Flow (Mermaid Diagram)

Uh oh!

perfectra1n commented Jul 14, 2025

Uh oh!

Check warning

Uh oh!

Copilot Autofix

eliandoran left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

eliandoran Jul 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

perfectra1n commented Jun 21, 2025 •

edited

Loading

Core Logic: `OCRService` (`apps/server/src/services/ocr/ocr_service.ts`)

Server API (`apps/server/src/routes/api/ocr.ts`)

Client-Side UI (`apps/client/src/widgets/type_widgets/options/images/images.ts`)