-
-
Notifications
You must be signed in to change notification settings - Fork 2k
feat(ocr): add OCR #5834
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat(ocr): add OCR #5834
Conversation
…onderful tesseract.js path issues
I'll label you a merge conflict next 😂 |
const languageTags = selectedLanguages.map(lang => | ||
`<span class="language-code">${lang}</span>` | ||
).join(''); | ||
this.$ocrLanguageDisplay.html(languageTags); |
Check warning
Code scanning / CodeQL
DOM text reinterpreted as HTML Medium
DOM text
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix
AI 10 days ago
To fix the issue, we need to ensure that any user-controlled input is properly escaped before being inserted into the DOM as HTML. The best approach is to use a method that treats the input as plain text rather than HTML, such as text()
in jQuery, or to explicitly escape the input using a utility function.
Steps to fix:
- Replace the use of
html()
withtext()
for inserting plain text into the DOM. This ensures that any special characters in the input are treated as literal text rather than HTML. - Alternatively, use a utility function to escape the input before constructing the HTML string. This is necessary if the HTML structure (e.g.,
<span>
) must be preserved.
For this specific case, using text()
is the simplest and safest solution since the lang
values can be displayed as plain text without requiring HTML tags.
-
Copy modified line R353 -
Copy modified lines R356-R358
@@ -352,8 +352,8 @@ | ||
if (selectedLanguages.length === 0) { | ||
this.$ocrLanguageDisplay.html(`<span class="placeholder-text">${t("images.ocr_no_languages_selected")}</span>`); | ||
this.$ocrLanguageDisplay.text(t("images.ocr_no_languages_selected")); | ||
} else { | ||
const languageTags = selectedLanguages.map(lang => | ||
`<span class="language-code">${lang}</span>` | ||
).join(''); | ||
this.$ocrLanguageDisplay.html(languageTags); | ||
lang | ||
).join(', '); | ||
this.$ocrLanguageDisplay.text(languageTags); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, we would need tests for all the processors since they rely on third part libs which are going to get updated by Renovate and we would otherwise have no coverage to ensure they don't break at some point.
<label>${t("images.ocr_language")}</label> | ||
<p class="form-text">${t("images.ocr_multi_language_description")}</p> | ||
<div class="ocr-language-checkboxes"> | ||
<label class="tn-checkbox"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either use the content language settings which are already available or refactor it to be generated dynamically from the list of locales in Commons.
} | ||
|
||
content!: string | Buffer; | ||
contentLength!: number; | ||
ocr_text?: string | null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please rename to textRepresentation
as per our discussion. Make sure to use camel case.
version: 234, | ||
sql: /*sql*/`\ | ||
-- Add OCR text column to blobs table | ||
ALTER TABLE blobs ADD COLUMN ocr_text TEXT DEFAULT NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, column names should not contain the term OCR as we will reuse the column for other ways to extract the text (for example for PDFs having already the text information instead of simply containing pictures).
*/ | ||
private getDefaultOCRLanguage(): string { | ||
try { | ||
const options = require('../../options.js').default; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid the use of require. Use sync imports if needed.
private isInitialized = false; | ||
|
||
canProcess(mimeType: string): boolean { | ||
const supportedTypes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract outside function.
} | ||
|
||
canProcess(mimeType: string): boolean { | ||
const supportedTypes = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extract outside function.
This PR integrates OCR capabilities by orchestrating interactions between a new client-side UI, a set of server-side API endpoints, a core OCR service, the Tesseract.js library, and the existing database schema.
Key Features:
/api/ocr/process-note/{noteId}
: Triggers OCR processing for a specific image note./api/ocr/process-attachment/{attachmentId}
: Triggers OCR for a specific image attachment./api/ocr/search
: Searches for text within the extracted OCR data./api/ocr/batch-process
: Initiates a batch job to process all images that haven't been OCR'd yet./api/ocr/batch-progress
: Retrieves the progress of the ongoing batch OCR job./api/ocr/stats
: Provides statistics on OCR'd files./api/ocr/delete/{blobId}
: Deletes the OCR data for a specific image.ocr_text
column. This allowsfor efficient searching of image content.
Implementation Details:
extraction, and database interaction.
JPEG, PNG, GIF, BMP, TIFF
, andWEBP
.apps/client/src/widgets/type_widgets/options/images/images.ts
provides an interface for managing OCR settings and initiating batch processing.Data Storage and Schema
ocr_text
(of typeTEXT
), has been added to the existingblobs
table. Theblobs
table stores the actual file content (the image itself), so this new column adds the extracted text alongside the binary data it was derived from.OCRService.storeOCRResult()
method is responsible for persistence. It executes the SQL command:UPDATE blobs SET ocr_text = ? WHERE blobId = ?
.OCRService.getStoredOCRResult()
method checks if text already exists using:SELECT ocr_text FROM blobs WHERE blobId = ?
.OCRService.searchOCRResults()
method performs aLIKE
query to find matches:SELECT blobId, ocr_text FROM blobs WHERE ocr_text LIKE ?
.Core Logic:
OCRService
(apps/server/src/services/ocr/ocr_service.ts
)This class contains the primary business logic and orchestrates the entire OCR process.
initialize
): The service doesn't initialize Tesseract on application startup. Instead, it's initialized on-demand the first time an OCR operation is requested. It correctly configures the paths for the Tesseract worker (worker-script/node/index.js
) and the WebAssembly core (tesseract-core.wasm.js
).extractTextFromImage
): This is the heart of the process. It takes aBuffer
of image data, passes it to theTesseract.worker.recognize()
function, and awaits the result. It then formats the output into a structuredOCRResult
object, converting Tesseract's confidence score from a 0-100 scale to a 0-1 decimal.processNoteOCR
,processAttachmentOCR
): These methods act as controllers. They fetch the relevant note or attachment from the database using thebecca
service, verify its MIME type is a supported image format, and check if OCR text already exists in theblobs
table. If all checks pass, they retrieve the image content via.getContent()
and pass the resulting buffer toextractTextFromImage
. Finally, they persist the result usingstoreOCRResult
.startBatchProcessing
,processBatchInBackground
):this.batchProcessingState
. This object tracks the total number of images, the number processed, and the start time. Using in-memory state is efficient for tracking the live progress of a single, ongoing task.processBatchInBackground
) runs asynchronously without blocking the main thread. It iterates through the unprocessed images, calls the appropriate processing method (processNoteOCR
orprocessAttachmentOCR
) for each, and increments theprocessed
count in thebatchProcessingState
.Server API (
apps/server/src/routes/api/ocr.ts
)This file acts as a thin routing layer, exposing the
OCRService
's functionality via HTTP endpoints.processNoteOCR
,batchProcessOCR
,getBatchProgress
) corresponds to an API endpoint.noteId
).ocrService
(e.g.,ocrService.startBatchProcessing()
).getBatchProgress
endpoint is particularly simple: it just callsocrService.getBatchProgress()
and returns the in-memory state object, allowing the client to poll for updates efficiently.Client-Side UI (
apps/client/src/widgets/type_widgets/options/images/images.ts
)This widget provides the user interface for interacting with the OCR features.
startBatchOcr
): When the user clicks the "Start Batch OCR" button, this function is called. It first makes aPOST
request to the/api/ocr/batch-process
endpoint to initiate the process on the server.pollBatchOcrProgress
): Upon a successful response from the server, it begins polling. It calls itself recursively usingsetTimeout
every second. In each call, it makes aGET
request to/api/ocr/batch-progress
.inProgress: false
, it stops the polling loop and displays a completion message.Data Flow (Mermaid Diagram)