Skip to content

Silent Native Crash when textItem.h is undefined #394

@olivier37520

Description

@olivier37520

Bug Report: Silent Native Crash when textItem.h is undefined

  1. Title:
    pdf2json causes silent native crash (likely segfault) when processing textItem with h: undefined property for large/complex PDFs.
  2. Environment:
  • Node.js Version: (Please insert your Node.js version, e.g., v18.x.x, v20.x.x)
  • pdf2json Version: (Please insert your pdf2json package version from package.json, e.g., ^3.0.0)
  • Operating System: (e.g., Linux, macOS, Windows)
  • System RAM: 32GB (29GB Free)
  1. Description of Problem:
    When attempting to parse a large and complex PDF file (e.g., 25MB, 5231 pages, test2.pdf), the Node.js process using pdf2json terminates silently without throwing a JavaScript error (i.e., uncaughtException or unhandledRejection are not triggered, and try-catch blocks fail to capture an error). Debugging reveals the crash consistently occurs when pdf2json provides textItem objects where the h (height) property is undefined.
  2. Steps to Reproduce:
  • Prepare a Node.js project using pdf2json to parse PDF files.

  • Use the provided parser.js extractBlocks function (which includes granular debug logging and a try-catch block around textItem processing):
    // ... (rest of parser.js above extractBlocks)
    async function extractBlocks(pdfData) {
    logger.info(Extracting blocks from ${pdfData.Pages.length} pages...);
    const allBlocks = [];
    let blockIdCounter = 0;

    for (let pageIndex = 0; pageIndex < pdfData.Pages.length; pageIndex++) {
    logger.debug(--- Processing Page: ${pageIndex + 1}/${pdfData.Pages.length} ---);
    const page = pdfData.Pages[pageIndex];
    const pageHeight = normalizeCoordinate(page.Height);
    const pageWidth = normalizeCoordinate(page.Width);

     if (!page.Texts) {
         logger.debug(`No text items found on page ${pageIndex + 1}.`);
         continue;
     }
    
     for (let textItemIndex = 0; textItemIndex < page.Texts.length; textItemIndex++) {
         const textItem = page.Texts[textItemIndex];
    
         try {
             logger.debug(`Processing TextItem: Page ${pageIndex + 1}, Item ${textItemIndex + 1}/${page.Texts.length}`); 
             logger.debug(`Raw TextItem properties: {x: ${textItem.x}, y: ${textItem.y}, w: ${textItem.w}, h: ${textItem.h}, T: "${textItem.R && textItem.R[0] ? textItem.R[0].T : 'N/A'}"}`);
    
             const rawText = textItem.R && textItem.R[0] && textItem.R[0].T ? textItem.R[0].T : '';
    
             logger.debug(`Raw block content before cleaning for ID ${textItem.id || 'undefined'}: "${rawText}"`);
    
             const cleanedText = cleanText(decodeURIComponent(rawText));
    
             logger.debug(`Cleaned block content for ID ${textItem.id || 'undefined'}: "${cleanedText}" (Length: ${cleanedText.length})`);
    
             if (cleanedText.length < config.MIN_OUTPUT_LENGTH) {
                 logger.debug(`Discarding block (too short after clean): "${cleanedText.substring(0, Math.min(cleanedText.length, 50))}..."`);
                 continue;
             }
    
             const fontSize = textItem.R && textItem.R[0] && textItem.R[0].TS && textItem.R[0].TS[1] !== undefined
                              ? normalizeCoordinate(textItem.R[0].TS[1])
                              : 0; 
    
             const normalizedHeight = normalizeCoordinate(textItem.h);
             const normalizedWidth = normalizeCoordinate(textItem.w);
    
             if (normalizedHeight < config.MIN_BLOCK_DIMENSION_PX || normalizedWidth < config.MIN_BLOCK_DIMENSION_PX) {
                 logger.debug(`Discarding block (dimensions too small): H=${normalizedHeight.toFixed(2)}, W=${normalizedWidth.toFixed(2)}`);
                 continue;
             }
    
             if (isLikelyHeaderFooter(textItem, pageHeight)) {
                 logger.debug(`Discarding block (likely header/footer): "${cleanedText.substring(0, Math.min(cleanedText.length, 50))}..."`);
                 continue;
             }
    
             allBlocks.push({
                 id: `block-${pageIndex}-${blockIdCounter++}`,
                 page_index: pageIndex,
                 text_content: cleanedText,
                 font_size: fontSize,
                 x: normalizeCoordinate(textItem.x),
                 y: normalizeCoordinate(textItem.y),
                 width: normalizedWidth,
                 height: normalizedHeight,
                 is_title_candidate: fontSize >= config.FONT_SIZE_TITLE_THRESHOLD && 
                                     cleanedText.length >= config.MIN_TITLE_LENGTH && 
                                     cleanedText.length <= config.MAX_TITLE_LENGTH
             });
         } catch (error) {
             logger.error(`❌ Error processing text item on Page ${pageIndex + 1}, Item ${textItemIndex + 1}. Skipping this item.`, { 
                 error: error.message,
                 rawTextItem: textItem,
                 stack: error.stack 
             });
             continue; 
         }
     }
    

    }
    logger.info(Extracted ${allBlocks.length} relevant blocks.);
    return allBlocks;
    }
    // ... (rest of parser.js below extractBlocks)

  • Obtain a large PDF file (test2.pdf used here, 25MB, 5231 pages) that causes the issue. (A minimal reproducible PDF would be ideal for pdf2json developers if possible).

  • Run the script:
    node parser.js --pdf test2.pdf --llm_url http://localhost:11434/api/generate --llm_model gemma:2b --validate --reformulate --export_rag --log_level debug

  1. Expected Behavior:
    The PDF should be parsed successfully, or a catchable JavaScript error (e.g., PDFParserError, TypeError) should be thrown, allowing for graceful error handling.
  2. Actual Behavior:
    The Node.js process terminates silently without any JavaScript error being logged or caught. The last log output indicates the crash occurs immediately after processing textItem objects where the h (height) property is undefined.
  3. Relevant Logs/Error Messages (from debug level):
    debug: Processing TextItem: Page 4, Item 47/50 {"timestamp":"2025-06-29T22:55:43.732Z"}
    debug: Raw TextItem properties: {x: 2.57, y: 42.551, w: 63.538, h: undefined, T: "%C2%A9%20Intel%20Corporation.%20Intel%2C%20the%20Intel%20logo%2C%20and%20other%20Intel%20marks%20are%20trademarks%20of%20Intel%2}
    debug: Raw block content before cleaning for ID undefined: "%C2%A9%20Intel%20Corporation.%20Intel%2C%20the%20Intel%20logo%2C%20and%20other%20Intel%20marks%20are%20trademarks%20of%20Intel%20Corporation%20or%20it}
    debug: Cleaned block content for ID undefined: "© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Othe" (Length: 127) {"timestamp":"2025-}
    debug: Processing TextItem: Page 4, Item 48/50 {"timestamp":"2025-06-29T22:55:43.732Z"}
    debug: Raw TextItem properties: {x: 34.224, y: 42.551, w: 0.798, h: undefined, T: "r%20"} {"timestamp":"2025-06-29T22:55:43.732Z"}
    debug: Raw block content before cleaning for ID undefined: "r%20" {"timestamp":"2025-06-29T22:55:43.732Z"}

(The process terminates after the last log for Item 48, without any further output or error message from the Node.js side).
8. Analysis/Hypothesis:
The crash is a native error within pdf2json's parsing logic. It appears to be triggered when pdf2json provides textItem objects with a missing or undefined h (height) property (and potentially other malformed textItem structures, such as id: undefined). Despite ample system memory and increased Node.js heap size, and normalizeCoordinate handling undefined gracefully, the underlying native components of pdf2json seem to encounter an unrecoverable state, leading to a segmentation fault or similar crash.
9. Workarounds:

  • Splitting the PDF: The most effective workaround identified is to split the large PDF into smaller, more manageable chunks (e.g., 50-100 pages each) using external tools like pdftk or qpdf, and then process each chunk sequentially. This reduces the load on pdf2json and allows the pipeline to complete.
  • Using pdftotext: For raw text extraction, using pdftotext (from poppler-utils) as a pre-processing step is a highly memory-efficient alternative that bypasses pdf2json entirely.
  1. Impact:
    This bug prevents the processing of large and/or complex PDF documents reliably using pdf2json without external pre-processing steps, reducing the library's utility for such use cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions