Skip to content

feat(ocr): add OCR #5834

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 35 commits into
base: main
Choose a base branch
from
Draft

feat(ocr): add OCR #5834

wants to merge 35 commits into from

Conversation

perfectra1n
Copy link
Member

@perfectra1n perfectra1n commented Jun 21, 2025

This PR integrates OCR capabilities by orchestrating interactions between a new client-side UI, a set of server-side API endpoints, a core OCR service, the Tesseract.js library, and the existing database schema.

Key Features:

  • OCR Service: A new OcrService is introduced, utilizing the Tesseract.js library to perform OCR on images.
  • API Endpoints: Several new API endpoints are added to manage OCR tasks:
    • POST /api/ocr/process-note/{noteId}: Triggers OCR processing for a specific image note.
    • POST /api/ocr/process-attachment/{attachmentId}: Triggers OCR for a specific image attachment.
    • GET /api/ocr/search: Searches for text within the extracted OCR data.
    • POST /api/ocr/batch-process: Initiates a batch job to process all images that haven't been OCR'd yet.
    • GET /api/ocr/batch-progress: Retrieves the progress of the ongoing batch OCR job.
    • GET /api/ocr/stats: Provides statistics on OCR'd files.
    • DELETE /api/ocr/delete/{blobId}: Deletes the OCR data for a specific image.
  • Client-Side UI: The image options have been updated to include:
    • Enabling/disabling OCR.
    • Setting the OCR language.
    • Configuring a minimum confidence threshold for OCR results.
    • A "Batch OCR" button to trigger the processing of all images.
    • A progress bar to monitor the batch OCR process.
  • Database Integration: The extracted OCR text is stored in the blobs table, in a new ocr_text column. This allows
    for efficient searching of image content.

Implementation Details:

  • The OcrService is responsible for all OCR-related logic, including initialization of Tesseract.js, text
    extraction, and database interaction.
  • The service supports a variety of image formats, including JPEG, PNG, GIF, BMP, TIFF, and WEBP.
  • The client-side implementation in apps/client/src/widgets/type_widgets/options/images/images.ts provides an interface for managing OCR settings and initiating batch processing.
  • The API routes in apps/server/src/routes/api/ocr.ts expose the OCR functionality to the client.

Data Storage and Schema

  • The extracted text from an image is stored directly in the database.
  • Implementation:
    • A new column, ocr_text (of type TEXT), has been added to the existing blobs table. The blobs table stores the actual file content (the image itself), so this new column adds the extracted text alongside the binary data it was derived from.
    • Writing: The OCRService.storeOCRResult() method is responsible for persistence. It executes the SQL command: UPDATE blobs SET ocr_text = ? WHERE blobId = ?.
    • Reading/Checking: To avoid reprocessing, the OCRService.getStoredOCRResult() method checks if text already exists using: SELECT ocr_text FROM blobs WHERE blobId = ?.
    • Searching: The core search functionality leverages this new column. The OCRService.searchOCRResults() method performs a LIKE query to find matches: SELECT blobId, ocr_text FROM blobs WHERE ocr_text LIKE ?.

Core Logic: OCRService (apps/server/src/services/ocr/ocr_service.ts)

This class contains the primary business logic and orchestrates the entire OCR process.

  • How is it implemented?
    • Initialization (initialize): The service doesn't initialize Tesseract on application startup. Instead, it's initialized on-demand the first time an OCR operation is requested. It correctly configures the paths for the Tesseract worker (worker-script/node/index.js) and the WebAssembly core (tesseract-core.wasm.js).
    • Text Extraction (extractTextFromImage): This is the heart of the process. It takes a Buffer of image data, passes it to the Tesseract.worker.recognize() function, and awaits the result. It then formats the output into a structured OCRResult object, converting Tesseract's confidence score from a 0-100 scale to a 0-1 decimal.
    • Processing Logic (processNoteOCR, processAttachmentOCR): These methods act as controllers. They fetch the relevant note or attachment from the database using the becca service, verify its MIME type is a supported image format, and check if OCR text already exists in the blobs table. If all checks pass, they retrieve the image content via .getContent() and pass the resulting buffer to extractTextFromImage. Finally, they persist the result using storeOCRResult.
    • Batch Processing (startBatchProcessing, processBatchInBackground):
      • When a batch process is started, the service first queries the database to get a count of all image notes and attachments that do not have existing OCR data.
      • It stores the progress in an in-memory object: this.batchProcessingState. This object tracks the total number of images, the number processed, and the start time. Using in-memory state is efficient for tracking the live progress of a single, ongoing task.
      • The actual processing (processBatchInBackground) runs asynchronously without blocking the main thread. It iterates through the unprocessed images, calls the appropriate processing method (processNoteOCR or processAttachmentOCR) for each, and increments the processed count in the batchProcessingState.

Server API (apps/server/src/routes/api/ocr.ts)

This file acts as a thin routing layer, exposing the OCRService's functionality via HTTP endpoints.

  • How is it implemented?
    • Each function (e.g., processNoteOCR, batchProcessOCR, getBatchProgress) corresponds to an API endpoint.
    • It performs initial request validation (e.g., checking for required parameters like noteId).
    • It calls the corresponding method in the ocrService (e.g., ocrService.startBatchProcessing()).
    • It formats the response from the service into a JSON object and sends it back to the client with the appropriate HTTP status code.
    • The getBatchProgress endpoint is particularly simple: it just calls ocrService.getBatchProgress() and returns the in-memory state object, allowing the client to poll for updates efficiently.

Client-Side UI (apps/client/src/widgets/type_widgets/options/images/images.ts)

This widget provides the user interface for interacting with the OCR features.

  • How is it implemented?
    • It uses jQuery to manipulate the DOM, adding event listeners to checkboxes, dropdowns, and buttons.
    • Starting a Batch Job (startBatchOcr): When the user clicks the "Start Batch OCR" button, this function is called. It first makes a POST request to the /api/ocr/batch-process endpoint to initiate the process on the server.
    • Polling for Progress (pollBatchOcrProgress): Upon a successful response from the server, it begins polling. It calls itself recursively using setTimeout every second. In each call, it makes a GET request to /api/ocr/batch-progress.
    • It uses the data from the polling response to update the UI in real-time, adjusting the width of the progress bar and updating the status text (e.g., "Processed 5 of 100 images").
    • Once the polling response indicates inProgress: false, it stops the polling loop and displays a completion message.

Data Flow (Mermaid Diagram)

@perfectra1n
Copy link
Member Author

I'll label you a merge conflict next 😂

const languageTags = selectedLanguages.map(lang =>
`<span class="language-code">${lang}</span>`
).join('');
this.$ocrLanguageDisplay.html(languageTags);

Check warning

Code scanning / CodeQL

DOM text reinterpreted as HTML Medium

DOM text
is reinterpreted as HTML without escaping meta-characters.

Copilot Autofix

AI 10 days ago

To fix the issue, we need to ensure that any user-controlled input is properly escaped before being inserted into the DOM as HTML. The best approach is to use a method that treats the input as plain text rather than HTML, such as text() in jQuery, or to explicitly escape the input using a utility function.

Steps to fix:

  1. Replace the use of html() with text() for inserting plain text into the DOM. This ensures that any special characters in the input are treated as literal text rather than HTML.
  2. Alternatively, use a utility function to escape the input before constructing the HTML string. This is necessary if the HTML structure (e.g., <span>) must be preserved.

For this specific case, using text() is the simplest and safest solution since the lang values can be displayed as plain text without requiring HTML tags.

Suggested changeset 1
apps/client/src/widgets/type_widgets/options/images/images.ts

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/apps/client/src/widgets/type_widgets/options/images/images.ts b/apps/client/src/widgets/type_widgets/options/images/images.ts
--- a/apps/client/src/widgets/type_widgets/options/images/images.ts
+++ b/apps/client/src/widgets/type_widgets/options/images/images.ts
@@ -352,8 +352,8 @@
         if (selectedLanguages.length === 0) {
-            this.$ocrLanguageDisplay.html(`<span class="placeholder-text">${t("images.ocr_no_languages_selected")}</span>`);
+            this.$ocrLanguageDisplay.text(t("images.ocr_no_languages_selected"));
         } else {
             const languageTags = selectedLanguages.map(lang => 
-                `<span class="language-code">${lang}</span>`
-            ).join('');
-            this.$ocrLanguageDisplay.html(languageTags);
+                lang
+            ).join(', ');
+            this.$ocrLanguageDisplay.text(languageTags);
         }
EOF
@@ -352,8 +352,8 @@
if (selectedLanguages.length === 0) {
this.$ocrLanguageDisplay.html(`<span class="placeholder-text">${t("images.ocr_no_languages_selected")}</span>`);
this.$ocrLanguageDisplay.text(t("images.ocr_no_languages_selected"));
} else {
const languageTags = selectedLanguages.map(lang =>
`<span class="language-code">${lang}</span>`
).join('');
this.$ocrLanguageDisplay.html(languageTags);
lang
).join(', ');
this.$ocrLanguageDisplay.text(languageTags);
}
Copilot is powered by AI and may make mistakes. Always verify output.
Copy link
Contributor

@eliandoran eliandoran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, we would need tests for all the processors since they rely on third part libs which are going to get updated by Renovate and we would otherwise have no coverage to ensure they don't break at some point.

<label>${t("images.ocr_language")}</label>
<p class="form-text">${t("images.ocr_multi_language_description")}</p>
<div class="ocr-language-checkboxes">
<label class="tn-checkbox">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either use the content language settings which are already available or refactor it to be generated dynamically from the list of locales in Commons.

}

content!: string | Buffer;
contentLength!: number;
ocr_text?: string | null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename to textRepresentation as per our discussion. Make sure to use camel case.

version: 234,
sql: /*sql*/`\
-- Add OCR text column to blobs table
ALTER TABLE blobs ADD COLUMN ocr_text TEXT DEFAULT NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, column names should not contain the term OCR as we will reuse the column for other ways to extract the text (for example for PDFs having already the text information instead of simply containing pictures).

*/
private getDefaultOCRLanguage(): string {
try {
const options = require('../../options.js').default;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid the use of require. Use sync imports if needed.

private isInitialized = false;

canProcess(mimeType: string): boolean {
const supportedTypes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract outside function.

}

canProcess(mimeType: string): boolean {
const supportedTypes = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract outside function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants