Skip to content

OCR integration #13313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 59 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
a51e3b0
Initial implementation using tess4j
Kaan0029 Jun 12, 2025
aca504a
merge
Siedlerchr Jun 12, 2025
48ffb06
add brew to jna path
Siedlerchr Jun 12, 2025
5a256ae
Adapted Exception handling and configured tessdata variable
Kaan0029 Jun 19, 2025
f80cec8
addressed mentor code feedback after 19.06.25
Kaan0029 Jul 3, 2025
db1f577
Merge branch 'upstream-main' into gsoc-ocr-tess4j-initial-implementation
Kaan0029 Jul 10, 2025
4020d3a
fix java modules
Siedlerchr Jul 10, 2025
4987978
fix tessdata path
Siedlerchr Jul 10, 2025
9842dd4
fix tessdata path amd module infor for lept4j
Siedlerchr Jul 10, 2025
8b133e6
fix tessdata path amd module infor for lept4j
Siedlerchr Jul 10, 2025
42704ea
fix module access
Siedlerchr Jul 10, 2025
a64e1ea
fix module access
Siedlerchr Jul 10, 2025
6069bc1
fix module access
Siedlerchr Jul 10, 2025
5734252
Avoid throwing exception in configureTessdata when tessdata is missing
Kaan0029 Jul 10, 2025
ab98a3a
Use Path.of instead of Paths.get for modern Java style
Kaan0029 Jul 10, 2025
62dce25
Add OCR tessdata path setting to AI preferences tab and add abstracti…
Kaan0029 Jul 10, 2025
0949300
Merge remote-tracking branch 'origin/gsoc-ocr-tess4j-initial-implemen…
Kaan0029 Jul 10, 2025
e4e45f3
fix(ocr): restore deleted files
InAnYan Jul 10, 2025
15272a6
fix(ocr): restore deleted IconThemes
InAnYan Jul 10, 2025
14332af
Update gradle to 9.1.0-jabref
koppor Jul 11, 2025
5488deb
fix(submodules): fix submodules
InAnYan Jul 11, 2025
26a8500
Update gradle
koppor Jul 11, 2025
e0da447
Merge branch 'gsoc-ocr-tess4j-initial-implementation' of https://gith…
koppor Jul 11, 2025
f1a06ac
Update gradle
koppor Jul 11, 2025
42d34ca
Workaround for gradle bug
calixtus Jul 12, 2025
3ffa1c1
Merge remote-tracking branch 'upstream/main' into gsoc-ocr-tess4j-ini…
calixtus Jul 12, 2025
59d87a2
Workaround for gradle bug
calixtus Jul 12, 2025
bc36a94
Fix submodules
calixtus Jul 12, 2025
e7a043d
Added Debugging Testing file for jna
Kaan0029 Jul 13, 2025
caa48f1
Attempt to enforce specific jna version
Kaan0029 Jul 13, 2025
9b51b23
Merge remote-tracking branch 'origin/gsoc-ocr-tess4j-initial-implemen…
Kaan0029 Jul 13, 2025
c115803
Fix jna include
calixtus Jul 13, 2025
f16c1f4
Add jna-jpms
koppor Jul 13, 2025
8c51016
fix arm64 crash
Kaan0029 Jul 15, 2025
5843c69
fix tessdata path issue
Kaan0029 Jul 15, 2025
1bd2e5b
revert previous tessdata change
Kaan0029 Jul 15, 2025
c7fcc6b
fixed lifecycle
Kaan0029 Jul 16, 2025
a8bab25
fixed bug related to life cylce issue
Kaan0029 Jul 16, 2025
9b9c7b7
Deleted unnecessary comments
Kaan0029 Jul 16, 2025
edc0343
use StringUtil.isBlank
Kaan0029 Jul 16, 2025
695e2b4
Use @NonNull
Kaan0029 Jul 16, 2025
e5e651e
Added TODO comment
Kaan0029 Jul 16, 2025
9ffdbff
delete /tessdata from .gitignore
Kaan0029 Jul 16, 2025
facb463
delete TO-DO comments in FilePreferences
Kaan0029 Jul 16, 2025
265a312
Delete unnecessary comment in OcrProvider
Kaan0029 Jul 16, 2025
e4d16c2
Adjust AI preferences and make them fit with jabref conventions
Kaan0029 Jul 17, 2025
cd52669
Merge remote-tracking branch 'upstream/main' into gsoc-ocr-tess4j-ini…
calixtus Jul 17, 2025
2eb2df2
Fix submodules
calixtus Jul 17, 2025
809e5a1
Remove error_log.txt
calixtus Jul 17, 2025
d0a20e7
Fix gradle issues
calixtus Jul 17, 2025
ce0328d
Resolve uncaught exception issue in JabRefGUI.stop()
Kaan0029 Jul 21, 2025
da6b8c8
fix tessdata path issue by removing getParent call
Kaan0029 Jul 21, 2025
a4e5d20
delete OcrException and return optional or result (OcrResult) instead
Kaan0029 Jul 21, 2025
af085b9
adjust commenting style to be in line with JabRef conventions
Kaan0029 Jul 21, 2025
dd8d211
implemented string constant for 'tessdata' as it it used more than once
Kaan0029 Jul 21, 2025
a3a90a8
separate final and non-final fields
Kaan0029 Jul 21, 2025
e7dba5e
store whole exception object instead of only exception message
Kaan0029 Jul 21, 2025
d13ea24
implement OcrBackgroundTask class
Kaan0029 Jul 21, 2025
14cf72d
add method for getting property in filepreferences
Kaan0029 Jul 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions jablib/src/main/java/org/jabref/logic/ocr/OcrResult.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
package org.jabref.logic.ocr;

import java.util.Optional;

public class OcrResult {
private final boolean success;
private final String text;
private final String errorMessage;

private OcrResult(boolean success, String text, String errorMessage) {
this.success = success;
this.text = text;
this.errorMessage = errorMessage;
}

public static OcrResult success(String text) {
return new OcrResult(true, text, null);
}

public static OcrResult failure(String errorMessage) {
return new OcrResult(false, null, errorMessage);
}

public boolean isSuccess() {
return success;
}

public Optional<String> getText() {
return Optional.ofNullable(text);
}

public Optional<String> getErrorMessage() {
return Optional.ofNullable(errorMessage);
}
}
106 changes: 81 additions & 25 deletions jablib/src/main/java/org/jabref/logic/ocr/OcrService.java
Original file line number Diff line number Diff line change
Expand Up @@ -19,31 +19,87 @@
public class OcrService {
private static final Logger LOGGER = LoggerFactory.getLogger(OcrService.class);
private static final String JNA_LIBRARY_PATH = "jna.library.path";
private static final String TESSDATA_PREFIX = "TESSDATA_PREFIX";

// The OCR engine instance
private final Tesseract tesseract;

/**
* Constructs a new OcrService with default settings.
* Currently uses Tesseract with English language support.
*/
public OcrService() {
public OcrService() throws OcrException {
configureLibraryPath();

try {
this.tesseract = new Tesseract();
tesseract.setLanguage("eng");
configureTessdata();
LOGGER.debug("Initialized OcrService with Tesseract");
} catch (Exception e) {
throw new OcrException("Failed to initialize OCR engine", e);
}
}

private void configureLibraryPath() {
if (Platform.isMac()) {
String originalPath = System.getProperty(JNA_LIBRARY_PATH, "");
if (Platform.isARM()) {
System.setProperty(JNA_LIBRARY_PATH, JNA_LIBRARY_PATH + File.pathSeparator + "/opt/homebrew/lib/");
System.setProperty(JNA_LIBRARY_PATH,
originalPath + File.pathSeparator + "/opt/homebrew/lib/");
} else {
System.setProperty(JNA_LIBRARY_PATH,
originalPath + File.pathSeparator + "/usr/local/cellar/");
}
}
}

private void configureTessdata() throws OcrException {
// First, check environment variable
String tessdataPath = System.getenv(TESSDATA_PREFIX);

if (tessdataPath != null && !tessdataPath.isEmpty()) {
File tessdataDir = new File(tessdataPath);
if (tessdataDir.exists() && tessdataDir.isDirectory()) {
// Tesseract expects the parent directory of tessdata
if (tessdataDir.getName().equals("tessdata")) {
tesseract.setDatapath(tessdataDir.getParent());
} else {
tesseract.setDatapath(tessdataPath);
}
LOGGER.info("Using tessdata from environment variable: {}", tessdataPath);
return;
} else {
System.setProperty(JNA_LIBRARY_PATH, JNA_LIBRARY_PATH + File.pathSeparator + "/usr/local/cellar/");
LOGGER.warn("TESSDATA_PREFIX points to non-existent directory: {}", tessdataPath);
}
}
this.tesseract = new Tesseract();

// Configure Tesseract
tesseract.setLanguage("eng");
// Fall back to system locations
String systemPath = findSystemTessdata();
if (systemPath != null) {
tesseract.setDatapath(systemPath);
LOGGER.info("Using system tessdata at: {}", systemPath);
} else {
throw new OcrException("Could not find tessdata directory. Please set TESSDATA_PREFIX environment variable.");
}
}

private String findSystemTessdata() {
String[] possiblePaths = {
"/usr/local/share", // Homebrew Intel
"/opt/homebrew/share", // Homebrew ARM
"/usr/share" // System
};

// TODO: This path needs to be configurable and bundled properly
// For now, we'll use a relative path that works during development
tesseract.setDatapath("tessdata");
for (String path : possiblePaths) {
File tessdata = new File(path, "tessdata");
File engData = new File(tessdata, "eng.traineddata");
if (tessdata.exists() && engData.exists()) {
return path; // Return parent of tessdata
}
}

LOGGER.debug("Initialized OcrService with Tesseract");
return null;
}

/**
Expand All @@ -53,35 +109,35 @@ public OcrService() {
* @return The extracted text, or empty string if no text found
* @throws OcrException if OCR processing fails
*/
public String performOcr(Path pdfPath) throws OcrException {
// Validate input
public OcrResult performOcr(Path pdfPath) {
// User error - not an exception
if (pdfPath == null) {
throw new OcrException("PDF path cannot be null");
LOGGER.warn("PDF path is null");
return OcrResult.failure("No file path provided");
}

File pdfFile = pdfPath.toFile();

// User error - not an exception
if (!pdfFile.exists()) {
throw new OcrException("PDF file does not exist: " + pdfPath);
LOGGER.warn("PDF file does not exist: {}", pdfPath);
return OcrResult.failure("File does not exist: " + pdfPath.getFileName());
}

try {
LOGGER.info("Starting OCR for file: {}", pdfFile.getName());

// Perform OCR
String result = tesseract.doOCR(pdfFile);

// Clean up the result (remove extra whitespace, etc.)
result = StringUtil.isBlank(result) ? "" : result.trim();

LOGGER.info("OCR completed successfully. Extracted {} characters", result.length());
return result;
} catch (
TesseractException e) {
LOGGER.error("OCR failed for file: {}", pdfFile.getName(), e);
throw new OcrException(
"Failed to perform OCR on file: " + pdfFile.getName() +
". Error: " + e.getMessage(), e
);
return OcrResult.success(result);

} catch (TesseractException e) {
// This could be either a user error (corrupt PDF) or our bug
// Log it as error but return as failure, not exception
LOGGER.error("OCR processing failed for file: {}", pdfFile.getName(), e);
return OcrResult.failure("Failed to extract text from PDF: " + e.getMessage());
}
}
}
Loading