-
Notifications
You must be signed in to change notification settings - Fork 384
Description
Bug Report: Silent Native Crash when textItem.h is undefined
- Title:
pdf2json causes silent native crash (likely segfault) when processing textItem with h: undefined property for large/complex PDFs. - Environment:
- Node.js Version: (Please insert your Node.js version, e.g., v18.x.x, v20.x.x)
- pdf2json Version: (Please insert your pdf2json package version from package.json, e.g., ^3.0.0)
- Operating System: (e.g., Linux, macOS, Windows)
- System RAM: 32GB (29GB Free)
- Description of Problem:
When attempting to parse a large and complex PDF file (e.g., 25MB, 5231 pages, test2.pdf), the Node.js process using pdf2json terminates silently without throwing a JavaScript error (i.e., uncaughtException or unhandledRejection are not triggered, and try-catch blocks fail to capture an error). Debugging reveals the crash consistently occurs when pdf2json provides textItem objects where the h (height) property is undefined. - Steps to Reproduce:
-
Prepare a Node.js project using pdf2json to parse PDF files.
-
Use the provided parser.js extractBlocks function (which includes granular debug logging and a try-catch block around textItem processing):
// ... (rest of parser.js above extractBlocks)
async function extractBlocks(pdfData) {
logger.info(Extracting blocks from ${pdfData.Pages.length} pages...
);
const allBlocks = [];
let blockIdCounter = 0;for (let pageIndex = 0; pageIndex < pdfData.Pages.length; pageIndex++) {
logger.debug(--- Processing Page: ${pageIndex + 1}/${pdfData.Pages.length} ---
);
const page = pdfData.Pages[pageIndex];
const pageHeight = normalizeCoordinate(page.Height);
const pageWidth = normalizeCoordinate(page.Width);if (!page.Texts) { logger.debug(`No text items found on page ${pageIndex + 1}.`); continue; } for (let textItemIndex = 0; textItemIndex < page.Texts.length; textItemIndex++) { const textItem = page.Texts[textItemIndex]; try { logger.debug(`Processing TextItem: Page ${pageIndex + 1}, Item ${textItemIndex + 1}/${page.Texts.length}`); logger.debug(`Raw TextItem properties: {x: ${textItem.x}, y: ${textItem.y}, w: ${textItem.w}, h: ${textItem.h}, T: "${textItem.R && textItem.R[0] ? textItem.R[0].T : 'N/A'}"}`); const rawText = textItem.R && textItem.R[0] && textItem.R[0].T ? textItem.R[0].T : ''; logger.debug(`Raw block content before cleaning for ID ${textItem.id || 'undefined'}: "${rawText}"`); const cleanedText = cleanText(decodeURIComponent(rawText)); logger.debug(`Cleaned block content for ID ${textItem.id || 'undefined'}: "${cleanedText}" (Length: ${cleanedText.length})`); if (cleanedText.length < config.MIN_OUTPUT_LENGTH) { logger.debug(`Discarding block (too short after clean): "${cleanedText.substring(0, Math.min(cleanedText.length, 50))}..."`); continue; } const fontSize = textItem.R && textItem.R[0] && textItem.R[0].TS && textItem.R[0].TS[1] !== undefined ? normalizeCoordinate(textItem.R[0].TS[1]) : 0; const normalizedHeight = normalizeCoordinate(textItem.h); const normalizedWidth = normalizeCoordinate(textItem.w); if (normalizedHeight < config.MIN_BLOCK_DIMENSION_PX || normalizedWidth < config.MIN_BLOCK_DIMENSION_PX) { logger.debug(`Discarding block (dimensions too small): H=${normalizedHeight.toFixed(2)}, W=${normalizedWidth.toFixed(2)}`); continue; } if (isLikelyHeaderFooter(textItem, pageHeight)) { logger.debug(`Discarding block (likely header/footer): "${cleanedText.substring(0, Math.min(cleanedText.length, 50))}..."`); continue; } allBlocks.push({ id: `block-${pageIndex}-${blockIdCounter++}`, page_index: pageIndex, text_content: cleanedText, font_size: fontSize, x: normalizeCoordinate(textItem.x), y: normalizeCoordinate(textItem.y), width: normalizedWidth, height: normalizedHeight, is_title_candidate: fontSize >= config.FONT_SIZE_TITLE_THRESHOLD && cleanedText.length >= config.MIN_TITLE_LENGTH && cleanedText.length <= config.MAX_TITLE_LENGTH }); } catch (error) { logger.error(`❌ Error processing text item on Page ${pageIndex + 1}, Item ${textItemIndex + 1}. Skipping this item.`, { error: error.message, rawTextItem: textItem, stack: error.stack }); continue; } }
}
logger.info(Extracted ${allBlocks.length} relevant blocks.
);
return allBlocks;
}
// ... (rest of parser.js below extractBlocks) -
Obtain a large PDF file (test2.pdf used here, 25MB, 5231 pages) that causes the issue. (A minimal reproducible PDF would be ideal for pdf2json developers if possible).
-
Run the script:
node parser.js --pdf test2.pdf --llm_url http://localhost:11434/api/generate --llm_model gemma:2b --validate --reformulate --export_rag --log_level debug
- Expected Behavior:
The PDF should be parsed successfully, or a catchable JavaScript error (e.g., PDFParserError, TypeError) should be thrown, allowing for graceful error handling. - Actual Behavior:
The Node.js process terminates silently without any JavaScript error being logged or caught. The last log output indicates the crash occurs immediately after processing textItem objects where the h (height) property is undefined. - Relevant Logs/Error Messages (from debug level):
debug: Processing TextItem: Page 4, Item 47/50 {"timestamp":"2025-06-29T22:55:43.732Z"}
debug: Raw TextItem properties: {x: 2.57, y: 42.551, w: 63.538, h: undefined, T: "%C2%A9%20Intel%20Corporation.%20Intel%2C%20the%20Intel%20logo%2C%20and%20other%20Intel%20marks%20are%20trademarks%20of%20Intel%2}
debug: Raw block content before cleaning for ID undefined: "%C2%A9%20Intel%20Corporation.%20Intel%2C%20the%20Intel%20logo%2C%20and%20other%20Intel%20marks%20are%20trademarks%20of%20Intel%20Corporation%20or%20it}
debug: Cleaned block content for ID undefined: "© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Othe" (Length: 127) {"timestamp":"2025-}
debug: Processing TextItem: Page 4, Item 48/50 {"timestamp":"2025-06-29T22:55:43.732Z"}
debug: Raw TextItem properties: {x: 34.224, y: 42.551, w: 0.798, h: undefined, T: "r%20"} {"timestamp":"2025-06-29T22:55:43.732Z"}
debug: Raw block content before cleaning for ID undefined: "r%20" {"timestamp":"2025-06-29T22:55:43.732Z"}
(The process terminates after the last log for Item 48, without any further output or error message from the Node.js side).
8. Analysis/Hypothesis:
The crash is a native error within pdf2json's parsing logic. It appears to be triggered when pdf2json provides textItem objects with a missing or undefined h (height) property (and potentially other malformed textItem structures, such as id: undefined). Despite ample system memory and increased Node.js heap size, and normalizeCoordinate handling undefined gracefully, the underlying native components of pdf2json seem to encounter an unrecoverable state, leading to a segmentation fault or similar crash.
9. Workarounds:
- Splitting the PDF: The most effective workaround identified is to split the large PDF into smaller, more manageable chunks (e.g., 50-100 pages each) using external tools like pdftk or qpdf, and then process each chunk sequentially. This reduces the load on pdf2json and allows the pipeline to complete.
- Using pdftotext: For raw text extraction, using pdftotext (from poppler-utils) as a pre-processing step is a highly memory-efficient alternative that bypasses pdf2json entirely.
- Impact:
This bug prevents the processing of large and/or complex PDF documents reliably using pdf2json without external pre-processing steps, reducing the library's utility for such use cases.