Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Sep 12, 2025

Description

| Aspect | Old cleanText (commented) | New Implementation | |--------|----------------------------|-------------------| | Scope | Simple control character removal | Comprehensive text normalization | | Unicode | No Unicode normalization | Uses NFC normalization | | Whitespace | Basic line ending normalization | Advanced whitespace handling (nbsp, zero-width, tabs) | | Letter spacing | No handling | Smart letter-spacing correction via smartDespaceLine | | Structure | Single function | Modular approach (normalizeText + smartDespaceLine + cleanText) | | PDF-specific issues | Limited handling | Addresses common PDF extraction problems |

Current cleanText Function: The new cleanText function (not commented) combines multiple steps:

  1. Calls normalizeText for basic cleaning
  2. Handles hyphenation across line breaks
  3. Manages paragraph breaks vs. line breaks
  4. Applies smartDespaceLine to fix spacing issues
  5. Removes spaces before punctuation
  6. Normalizes excessive spacing

The evolution shows a progression from basic text cleaning to sophisticated PDF text normalization that handles real-world extraction challenges like letter-spacing artifacts, inconsistent Unicode representations, and complex whitespace patterns.

API Request$0.0266

Thinking:The user is asking about specific code in the server/pdfChunks.ts file, specifically about: 1. input.normalize("NFC") - what does this do? 2. Explain the normalizeText and smartDespaceLine functions 3. Difference with the commented cleanText function I already provided a detailed explanation, but I didn't use a tool. Since I've already provided a comprehensive answer to their questions about the code, and I have the file content available to me, I should use the attempt_completion tool to present this as the final result.‎

Task Completed

I've analyzed the text processing functions in server/pdfChunks.ts. Here's a comprehensive explanation of each:

input.normalize("NFC")

This applies Unicode Normalization Form Canonical (NFC) to ensure characters are in their canonical composed form. For example, if "é" is represented as separate characters (base "e" + combining accent "´"), NFC combines them into a single composed character. This is crucial for PDFs that may contain inconsistent Unicode representations.

normalizeText Function

A foundational text cleaning function that:

  1. Unicode normalization - Uses NFC as explained above

  2. Control character removal - Strips control characters except newlines/tabs using /[^\P{C}\n\t]/gu

  3. Whitespace normalization:

    • Converts non-breaking spaces (\u00A0) to regular spaces
    • Removes zero-width spaces (\u200B)
    • Converts multiple tabs to single spaces
  4. Trim - Removes leading/trailing whitespace

smartDespaceLine Function

Fixes letter-spacing artifacts common in PDF text extraction. PDFs often extract text with artificial spacing like "N A S A" instead of "NASA". This function:

  • Reconstructs acronyms - Joins 3+ spaced letters: "N A S A" → "NASA"
  • Fixes short words - Joins two lowercase letters: "i s" → "is"
  • Preserves intentional spacing - Doesn't join inappropriately

Difference with Commented cleanText

The commented version was simpler and only handled basic control character removal and line ending normalization. The current implementation is much more sophisticated:

| Aspect | Old (commented) | New Implementation | |--------|-----------------|-------------------| | Scope | Simple control char removal | Comprehensive normalization | | Unicode | None | NFC normalization | | Whitespace | Basic line endings | Advanced (nbsp, zero-width, tabs) | | Letter spacing | No handling | Smart correction via smartDespaceLine | | Structure | Single function | Modular approach | | PDF-specific | Limited | Addresses real PDF extraction issues |

The current cleanText function combines all these steps plus handles hyphenation across line breaks, paragraph management, punctuation spacing, and excessive space normalization - making it much better suited for processing real-world PDF text extraction challenges.

Image based changes overview

CTM (Current Transformation Matrix) and Enhanced Image Capturing

CTM stands for Current Transformation Matrix - a 2D transformation matrix used in PDF rendering to track coordinate system transformations. It's represented as [a, b, c, d, e, f] corresponding to:

[a  c  e]
[b  d  f] 
[0  0  1]

CTM Tracking Enhancement:

The CTM tracking enhances image capturing by:

  1. Accurate positioning: Tracks where images are placed on the page
  2. Transformation awareness: Handles rotated, scaled, or translated images
  3. Coordinate system management: Maintains proper spatial relationships

Code Block Analysis:

1. Transform and Save/Restore Operations:

if (fnId === PDFJS.OPS.transform) {
  try {
    if (
      Array.isArray(args) &&
      args.length >= 6 &&
      args.every((n: any) => typeof n === "number")
    ) {
      currentCTM = multiplyMatrices(currentCTM, args as number[])
    }
  } catch {}
}
break
case PDFJS.OPS.save: {
  ctmStack.push([...currentCTM])
  break
}
case PDFJS.OPS.restore: {
  if (ctmStack.length) currentCTM = ctmStack.pop()!
  break
}

What this does:

  • Transform: Applies transformation matrices by multiplying them with the current CTM
  • Save: Pushes current CTM state onto a stack (like save state in graphics)
  • Restore: Pops the last saved CTM from stack (restore previous state)
  • This mimics PDF's graphics state management for accurate coordinate tracking

2. resolveImageByName Function:

const resolveImageByName = async (name: string): Promise<any | null> => {
  try {
    // Some builds expose has method
    if (
      typeof (page.objs as any).has === "function" &&
      (page.objs as any).has(name)
    ) {
      return (page.objs as any).get(name)
    }
    const obj = (page.objs as any).get(name)
    return obj || null
  } catch (e) {
    return null
  }
}

Enhancement: This provides a robust way to retrieve image objects by name from PDF.js's internal object store, handling different PDF.js builds and API variations.

3. isCanvasLike and isImageLike Logic:

const isCanvasLike = (obj: any) =>
  obj &&
  typeof obj.getContext === "function" &&
  typeof obj.width === "number" &&
  typeof obj.height === "number"

const isImageLike = (obj: any) =>
  obj &&
  typeof obj.width === "number" &&
  typeof obj.height === "number" &&
  typeof obj.getContext !== "function"

What these do:

  • isCanvasLike: Detects HTML5 Canvas objects that can be directly converted to PNG via toBuffer()
  • isImageLike: Detects image-like objects (have dimensions but aren't canvas) that need to be drawn onto a canvas first
  • Enhancement: Provides fast-path processing for different image object types returned by PDF.js

4. GRAYSCALE_1BPP Processing:

if (kind === pdfjsLib.ImageKind.GRAYSCALE_1BPP) {
  // Handle 1 bit per pixel grayscale (bit-packed data)
  let pixelIndex = 0
  for (let y = 0; y < height; y++) {
    for (let x = 0; x < width; x++) {
      const byteIndex = Math.floor(pixelIndex / 8)
      const bitIndex = 7 - (pixelIndex % 8) // MSB first
      const bit =
        byteIndex < uint8Data.length
          ? (uint8Data[byteIndex] >> bitIndex) & 1
          : 0
      const gray = bit ? 255 : 0 // Convert bit to full pixel value

      const dstIdx = pixelIndex * 4
      rgbaData[dstIdx] = gray // R
      rgbaData[dstIdx + 1] = gray // G
      rgbaData[dstIdx + 2] = gray // B
      rgbaData[dstIdx + 3] = 255 // A
      pixelIndex++
    }
  }
}

What this does:

  • Handles 1-bit per pixel grayscale images (black and white)
  • Bit unpacking: Extracts individual bits from packed byte data
  • MSB first: Reads most significant bit first (standard PDF format)
  • Pixel conversion: Converts 1-bit values (0/1) to full grayscale (0/255)
  • RGBA output: Creates full 4-channel RGBA data for canvas rendering

5. Image Hash and Description Generation:

const imageHash = crypto
  .createHash("md5")
  .update(new Uint8Array(buffer))
  .digest("hex")

let description: string

if (seenHashDescriptions.has(imageHash)) {
  description = seenHashDescriptions.get(imageHash)!
  console.log("IMAGE DEBUG: Reusing cached description...")
} else {
  console.log("IMAGE DEBUG: Generating new description...")
  if (describeImages) {
    try {
      description = await describeImageWithllm(buffer)
    } catch (e) {
      description = "This is an image from the PDF."
    }
  } else {
    description = "This is an image."
  }
  seenHashDescriptions.set(imageHash, description)
}

What this does:

  • Deduplication: Creates MD5 hash to identify identical images across the PDF
  • Caching: Stores AI-generated descriptions to avoid re-processing same images
  • AI Integration: Uses LLM to generate meaningful image descriptions when enabled
  • Fallback: Provides default descriptions when AI fails or is disabled
  • Performance: Significantly speeds up processing of PDFs with repeated images

Overall Enhancement Impact:

These changes transform basic image extraction into a sophisticated system that:

  1. Accurately tracks image positioning through CTM management
  2. Handles diverse image formats through type detection
  3. Optimizes performance through caching and fast-path processing
  4. Provides intelligent descriptions through AI integration
  5. Ensures robustness through comprehensive error handling

The result is much more reliable and efficient image extraction from complex PDF documents.

Summary by CodeRabbit

  • New Features

    • Paragraph-aware PDF extraction plus new public text utilities for normalization/cleaning; extracted text can include image markers by default.
  • Improvements

    • Expanded image extraction (format handling, size validation, hashing/caching, optional AI descriptions), improved page-to-page continuity, transform-aware placement, richer logging, and better per-page lifecycle handling.
  • Tests

    • Added a PDF processing test harness comparing text-only and image-enabled flows.
  • Style

    • Minor formatting-only change in an integration (no behavior change).

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 12, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Replaces op-list-centric PDF parsing with a per-page textContent paragraph pipeline, adds normalizeText/cleanText and CTM matrix utilities, expands image extraction/validation/PNG conversion/MD5 caching and optional LLM descriptions, preserves page overlap across chunks, and adds a direct test harness script.

Changes

Cohort / File(s) Summary
PDF extraction & image processing
server/pdfChunks.ts
Reworks extraction to per-page getTextContent with buildParagraphsFromPage, adds normalizeText/cleanText, CTM/matrix utilities and stack handling, per-page paragraph→chunk assembly with pageOverlap, richer image extraction (inline/objs), min-dimension and max-MB checks, PNG conversion, MIME detection, MD5 hashing + caching, optional describeImageWithllm, image saving under downloads/.../{docid}, image placeholders ([[IMG#n]]) when includeImageMarkersInText true, and extensive logging and cleanup.
Direct test harness
server/scripts/testPdfDirect.ts
New script that reads a PDF and runs two flows: FileProcessorService.processFile (KB flow) and extractTextAndImagesWithChunksFromPDF (image-enabled). Compares and logs text/image chunk counts, samples, and exits on error.
Formatting-only change
server/integrations/microsoft/index.ts
Cosmetic reformat of a ternary assignment in insertCalendarEvents (no behavior change).

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Dev
  participant TestScript as testPdfDirect.ts
  participant FPS as FileProcessorService
  participant Extractor as extractTextAndImagesWithChunksFromPDF
  participant PDFJS as pdf.js
  participant LLM as describeImageWithllm
  participant Disk as downloads/.../{docid}

  Dev->>TestScript: run testPdfDirect()
  TestScript->>FPS: processFile(pdfBuffer, "application/pdf", ...)
  FPS-->>TestScript: KB-flow results

  TestScript->>Extractor: extractTextAndImagesWithChunksFromPDF(uint8array, docId, extractImages=true, describeImages=true)
  Extractor->>PDFJS: loadDocument -> getPage N
  loop per page
    PDFJS-->>Extractor: page.getTextContent()
    Extractor->>Extractor: buildParagraphsFromPage -> normalizeText/cleanText -> create text chunks (handle pageOverlap)
    PDFJS-->>Extractor: page.objs / inline images
    alt image meets checks
      Extractor->>Extractor: convert to PNG -> detect MIME -> md5(hash)
      opt describeImages
        Extractor->>LLM: describeImageWithllm(imageBuffer, hash)
        LLM-->>Extractor: description
      end
      Extractor->>Disk: save image + metadata
      Extractor->>Extractor: append image placeholder to text chunks (if enabled)
    else
      Extractor-->>Extractor: skip image & log
    end
  end
  Extractor-->>TestScript: {text_chunks, image_chunks, positions}
  TestScript-->>Dev: print comparison & samples
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • kalpadhwaryu

Poem

I nibble bytes and stitch each line,
I fix the breaks and mark the vine.
I hash each carrot, save it bright,
I tuck [[IMG#n]] in text at night.
Hopping through pages, logs alight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "pdfChunks file updated for better text processing" accurately reflects the primary work in this changeset—substantial improvements to PDF text extraction and cleaning in server/pdfChunks.ts—and is concise and clear for a reviewer scanning history. It does not enumerate secondary image/CTM enhancements, but that level of detail is not required for a short PR title.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/pdfIngestFix

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces significant improvements to the PDF processing pipeline, focusing on enhancing the accuracy and robustness of both text and image extraction. The changes aim to provide a more reliable foundation for chunking PDF content, ensuring better data quality for downstream applications. The refactored text cleaning and extraction methods, coupled with a more sophisticated image handling mechanism, contribute to a more comprehensive and debuggable PDF parsing solution.

Highlights

  • Improved Text Cleaning: The cleanText function has been completely refactored, introducing normalizeText for initial Unicode and whitespace normalization, and smartDespaceLine for intelligent handling of letter spacing within lines, leading to more accurate text representation.
  • Enhanced PDF Text Extraction: The PDF text extraction logic now leverages PDF.js's getTextContent API to robustly build paragraphs, accounting for line breaks, Y-position changes, and blank lines, which significantly improves text chunking, especially for documents exported from tools like PowerPoint or Word.
  • Refined Image Extraction: Image extraction has been made more comprehensive, including the introduction of a MIN_IMAGE_DIM_PX environment variable to filter out small images, improved image object resolution, handling of inline images, and more robust processing of various image data formats. It also includes CTM tracking for better image bounds calculation.
  • Persistent Cross-Image Overlap: The logic for crossImageOverlap has been updated to append image placeholders ([[IMG#<id>]]) rather than replacing previous overlap text, ensuring better continuity of context across images.
  • Extensive Debugging and Testing: Numerous debug logs have been added throughout the pdfChunks.ts file to provide better visibility into the text and image processing flow. A new script, server/scripts/testPdfDirect.ts, has also been added to facilitate direct testing and comparison of PDF processing with and without image extraction.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant improvements to PDF text processing by implementing a more sophisticated text cleaning pipeline and switching to getTextContent for more reliable text extraction. The image extraction logic has also been made more robust. While these are great enhancements, the current changes include a substantial amount of debugging code (console.log) that must be removed before merging. There are also some critical issues, such as the use of hardcoded magic numbers for PDF operators, and a new test script with hardcoded absolute paths that will break for other developers. I've provided comments on these points and other opportunities to improve code quality and maintainability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (10)
server/pdfChunks.ts (8)

347-351: Avoid duplicate getTextContent() calls per page.

You fetch textContent inside buildParagraphsFromPage and again for textOperatorCount. Cache once.

Minimal change:

-        let paragraphs: string[] = await buildParagraphsFromPage(page)
-        let currentParagraph = ""
-        let textOperatorCount = (await page.getTextContent()).items.length
+        const textContent = await page.getTextContent({
+          includeMarkedContent: false,
+          disableNormalization: false,
+        })
+        const paragraphs = await (async () => {
+          // Inline the builder using textContent to avoid extra fetch
+          // (copy of buildParagraphsFromPage body but using textContent)
+          // Or refactor builder to accept textContent.
+          const items = textContent.items as any[]
+          const lines: string[] = []
+          let current = "", prevY: number | null = null, prevH: number | null = null
+          for (const item of items) {
+            const str: string = item && typeof item.str === "string" ? item.str : ""
+            if (!str) continue
+            const tr = Array.isArray(item.transform) ? item.transform : []
+            const y = typeof tr[5] === "number" ? tr[5] : null
+            const h = typeof item.height === "number" ? item.height : null
+            const tol = Math.max(prevH || 0, h || 0, 10) * 0.4
+            const newLine = prevY != null && y != null && Math.abs(y - prevY) > tol
+            if (newLine || (item as any).hasEOL) { if (current) lines.push(current); current = str } else { current += str }
+            prevY = y; prevH = h
+          }
+          if (current.trim()) lines.push(current)
+          const paras: string[] = []
+          let buf: string[] = []
+          for (const ln of lines) { if (ln.trim()) buf.push(ln); else if (buf.length) { paras.push(buf.join("\n")); buf = [] } }
+          if (buf.length) paras.push(buf.join("\n"))
+          return paras.map(cleanText).filter((p) => p.length > 0)
+        })()
+        let currentParagraph = ""
+        const textOperatorCount = textContent.items.length

Or refactor buildParagraphsFromPage to accept a pre-fetched textContent.

Also applies to: 409-413


435-469: Remove unused CTM tracking or wire it to real positioning logic.

currentCTM/ctmStack are computed but never used. Dead code adds complexity.

-        let currentCTM: [number, number, number, number, number, number] = [1, 0, 0, 1, 0, 0]
-        const ctmStack: [number, number, number, number, number, number][] = []
-        const mul = (m1: number[], m2: number[]): [number, number, number, number, number, number] => { ... }
-        const applyToPoint = (m: number[], x: number, y: number): { x: number; y: number } => { ... }
+        // CTM tracking removed: re-add only when bounding boxes are needed
@@
-            case PDFJS.OPS.save: {
-              ctmStack.push([...currentCTM])
-              break
-            }
-            case PDFJS.OPS.restore: {
-              if (ctmStack.length) currentCTM = ctmStack.pop()!
-              break
-            }
+            // save/restore CTM handling not needed currently

Also applies to: 556-563


1191-1193: Use debug level for cache hits instead of warn.

Cache reuse is expected, not a warning.

-Logger.warn(
-  `Reusing description for repeated image ${imageName} on page ${pageNum}`,
-)
+Logger.debug(
+  `Reusing description for repeated image ${imageName} on page ${pageNum}`,
+)

497-499: Remove unused variables.

imagesOnPage and vectorOpsDetected are set but never used.

-let imagesOnPage = 0
-let vectorOpsDetected = false
+// removed unused counters

1281-1284: crossImageOverlap is appended but never consumed; drop or cap it.

Avoid unbounded growth of a never-used string; either remove the feature or cap to a small window.

- crossImageOverlap += ` [[IMG#${globalSeq.value}]] `
+ // crossImageOverlap removed (unused)
@@
- if (overlapText.trim()) {
-   const previousCrossImageOverlap = crossImageOverlap
-   crossImageOverlap = crossImageOverlap
-     ? `${crossImageOverlap} ${overlapText}`
-     : overlapText
- } else {
-   // no update
- }
+ // overlapText not accumulated; handled within chunking only

Also applies to: 1329-1334


1375-1376: Replace stray console.log with Logger or remove.

“Calling destroy” console log is noisy in production.

-    console.log("Calling destroy")
+    Logger.debug("Destroying PDF document")

165-214: Remove dead text-operator extraction helpers.

validateTextItem and extractTextFromArgs are unused after switching to textContent; delete to reduce surface area and logging.

-/**
- * Validate text item
- */
-function validateTextItem(item: any): boolean { ... }
-
-/**
- * Extract text from various PDF.js text operators with enhanced validation
- */
-function extractTextFromArgs(args: any[]): string { ... }

17-18: Bound the hash cache to avoid unbounded memory use.

seenHashDescriptions can grow indefinitely across documents. Consider LRU with a max size.

I can add a tiny LRU (e.g., QuickLRU) or a capped Map eviction policy if you want.

server/scripts/testPdfDirect.ts (2)

2-2: Drop unused import.

resolve is unused.

-import { resolve } from "path"

35-40: Print counts only; avoid dumping large arrays.

Keep this harness light—counts are already sufficient.

No diff needed if you keep counts and remove any future array dumps.

Also applies to: 52-57

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8f5eee0 and feb056f.

📒 Files selected for processing (2)
  • server/pdfChunks.ts (12 hunks)
  • server/scripts/testPdfDirect.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/scripts/testPdfDirect.ts (2)
server/services/fileProcessor.ts (1)
  • FileProcessorService (22-158)
server/pdfChunks.ts (1)
  • extractTextAndImagesWithChunksFromPDF (282-1378)
server/pdfChunks.ts (3)
server/chunks.ts (1)
  • chunkTextByParagraph (9-112)
server/integrations/dataSource/config.ts (1)
  • DATASOURCE_CONFIG (3-71)
server/lib/describeImageWithllm.ts (1)
  • describeImageWithllm (56-112)
🔇 Additional comments (3)
server/pdfChunks.ts (2)

694-696: Confirmed — image-type default export is async; await usage is correct.
image-type(input: ArrayBuffer | Uint8Array | Buffer) -> Promise<{ext: string, mime: string} | undefined>, so the await usage at lines 694–696 (and 787–789, 1115–1132) is valid.


13-16: Verify WASM asset paths in all deployment targets

File: server/pdfChunks.ts (lines 13–16)

const openjpegWasmPath =
  path.join(__dirname, "../node_modules/pdfjs-dist/wasm/") + "/"
const qcmsWasmPath =
  path.join(__dirname, "../node_modules/pdfjs-dist/wasm/") + "/"

Ensure pdfjs-dist/wasm is present at runtime for all deployment targets (bundled/serverless). If not, make these paths configurable (env var) or copy the wasm assets into a stable runtime location during your build/deploy pipeline.

server/scripts/testPdfDirect.ts (1)

3-5: Ensure the script's runner resolves tsconfig path aliases (@/*).

  • server/tsconfig.json defines "@/" → ["./"] and server/scripts/testPdfDirect.ts imports "@/services/..." and "@/pdfChunks".
  • Bun's runtime respects tsconfig "paths", so running the script with Bun will resolve these aliases. (bun.com)
  • I found no tsconfig-paths/module-alias/ts-node helpers in the repo; if you run this with Node/ts-node, add a runtime mapper (e.g. tsconfig-paths) or switch to relative imports.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
server/pdfChunks.ts (3)

1111-1117: Respect Uint8Array view offsets when creating Buffer.

Using the underlying ArrayBuffer can include unrelated bytes. Create Buffer from the view.

-const imgBuffer = Buffer.from(uint8Data.buffer)
+const imgBuffer = Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength)

236-239: Don’t log raw extracted text (PII/log bloat).

Replace with length-only, routed via Logger at debug level.

-  const result = typeof text === "string" ? text : ""
-  console.log("EXTRACT TEXT DEBUG: Final extracted text:", result)
-  return result
+  const result = typeof text === "string" ? text : ""
+  Logger.debug(`extractTextFromArgs: len=${result.length}`)
+  return result

1433-1447: Remove plaintext dumps of all chunks and positions.

These can leak sensitive content and balloon logs. Keep summaries only.

-    console.log("FINAL DEBUG: PDF processing completed for", docid)
-    console.log("FINAL DEBUG: Processing summary:", {
-      totalTextChunks: text_chunks.length,
-      totalImageChunks: image_chunks.length,
-      textChunkPositions: text_chunk_pos.length,
-      imageChunkPositions: image_chunk_pos.length,
-      extractImages,
-      describeImages,
-    })
-
-    console.log("FINAL DEBUG: All text chunks:", text_chunks)
-    console.log("FINAL DEBUG: All text chunk positions:", text_chunk_pos)
-    console.log("FINAL DEBUG: All image chunks:", image_chunks)
-    console.log("FINAL DEBUG: All image chunk positions:", image_chunk_pos)
+    Logger.debug({
+      msg: "PDF processing summary",
+      docid,
+      totalTextChunks: text_chunks.length,
+      totalImageChunks: image_chunks.length,
+      textChunkPositions: text_chunk_pos.length,
+      imageChunkPositions: image_chunk_pos.length,
+      extractImages,
+      describeImages,
+    })
🧹 Nitpick comments (7)
server/pdfChunks.ts (7)

394-396: Use the declared constant instead of the magic number.

You introduced MIN_LINE_HEIGHT_FOR_TOLERANCE (Line 20) but still use 10 here. Replace to keep intent clear and single-sourced.

- const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
+ const tol = Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * 0.4 // dynamic tolerance

Also applies to: 20-20


119-124: Text normalization: broaden zero-width character removal.

Handle additional zero-width characters and BOM to reduce artifacts.

-  normalized = normalized.replace(/\u200B/g, "") // zero-width space
+  normalized = normalized.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "") // zero-widths & BOM

Also applies to: 30-44


506-512: Avoid second getTextContent() per page just for a count.

buildParagraphsFromPage already fetched textContent; compute textOperatorCount there and return it, or derive from opList if feasible. Saves a heavy call per page.

Would you like me to propose a small refactor to return { paragraphs, textItemCount } from buildParagraphsFromPage and wire it through?


519-526: Remove @ts-ignore around page.objs access; add safe typing.

Use a local const objs: any = (page as any).objs with guards.

-// @ts-ignore
-if (
-  typeof (page.objs as any).has === "function" &&
-  (page.objs as any).has(name)
-) {
-  // @ts-ignore
-  return (page.objs as any).get(name)
-}
-const obj = (page.objs as any).get(name)
+const objs: any = (page as any).objs
+if (objs && typeof objs.has === "function" && objs.has(name)) {
+  return objs.get(name)
+}
+const obj = objs?.get ? objs.get(name) : undefined
 return obj || null

552-578: vectorOpsDetected is set but never used.

Drop the flag and related checks to reduce noise.


617-621: Switch readability: avoid case extractImages ? ... : null.

Wrap image cases in an if (!extractImages) break; guard and keep plain OPS cases; simplifies control flow.


1454-1455: Replace stray console.log with Logger.

-  console.log("Calling destroy")
+  Logger.debug("Destroying PDF document")
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between feb056f and 60bfdb9.

📒 Files selected for processing (1)
  • server/pdfChunks.ts (10 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.

Applied to files:

  • server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.

Applied to files:

  • server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
  • getLogger (33-90)
server/chunks.ts (1)
  • chunkTextByParagraph (9-112)
server/integrations/dataSource/config.ts (1)
  • DATASOURCE_CONFIG (3-71)
server/lib/describeImageWithllm.ts (1)
  • describeImageWithllm (56-112)
🔇 Additional comments (3)
server/pdfChunks.ts (3)

318-321: Good: docid is sanitized before filesystem use.

Using safeDocId eliminates path traversal risk. Nice hardening.


107-145: Paragraph placeholder approach is robust; good tradeoff.

Using PUA sentinels + runtime-unique token reduces collision risk while preserving double-newline paragraphs. Nice.


741-744: Don't remove await — image-type(buffer) is async; suggested change is incorrect.

imageType(buffer) returns a Promise and must be awaited; retain the existing await in server/pdfChunks.ts (lines 741–744, 835–837, 1193–1211). Fix typings (or cast) if @ts-ignore is only masking a type issue — do not remove await.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
server/pdfChunks.ts (2)

1390-1393: Stop logging full chunks (PII risk + log bloat).

Dumping entire text/image chunks can leak sensitive data and explode logs. Log counts/metrics instead.

-    Logger.debug("All text chunks", { text_chunks })
-    Logger.debug("All text chunk positions", { text_chunk_pos })
-    Logger.debug("All image chunks", { image_chunks })
-    Logger.debug("All image chunk positions", { image_chunk_pos })
+    Logger.debug("Chunk summaries", {
+      text_chunks_len: text_chunks.length,
+      image_chunks_len: image_chunks.length,
+      text_pos_len: text_chunk_pos.length,
+      image_pos_len: image_chunk_pos.length,
+    })

1077-1085: Respect Uint8Array view offsets when creating Buffer.

Buffer.from(uint8Data.buffer) may include unrelated bytes. Use offset/length (or the view directly).

-                        const imgBuffer = Buffer.from(uint8Data.buffer)
+                        const imgBuffer = Buffer.from(
+                          uint8Data.buffer,
+                          uint8Data.byteOffset,
+                          uint8Data.byteLength,
+                        )
🧹 Nitpick comments (12)
server/pdfChunks.ts (12)

19-21: Replace magic numbers with named constants for line-break tolerance.

Use the defined MIN_LINE_HEIGHT_FOR_TOLERANCE and a named factor; make the factor configurable.

-const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
+const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
+const LINE_BREAK_TOLERANCE_FACTOR = parseFloat(
+  process.env.LINE_BREAK_TOLERANCE_FACTOR || "0.4",
+)
-          const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
+          const tol =
+            Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) *
+            LINE_BREAK_TOLERANCE_FACTOR

Also applies to: 392-394


270-272: Use central config for chunk size (avoid hardcoding 512).

Keeps behavior consistent with server/integrations/dataSource/config.ts.

-  const chunks = chunkTextByParagraph(cleanedText, 512, 128)
+  const chunks = chunkTextByParagraph(
+    cleanedText,
+    DATASOURCE_CONFIG.MAX_CHUNK_SIZE,
+    128,
+  )

28-42: Normalize CRLF to LF in normalizeText.

Prevents stray CR from PDFs/Office exports.

   let normalized = input.normalize("NFC")
 
   // Strip control chars except newline/tab
   normalized = normalized.replace(/[^\P{C}\n\t]/gu, "")
 
   // Normalize whitespace
+  normalized = normalized.replace(/\r\n?/g, "\n") // CRLF/CR -> LF
   normalized = normalized.replace(/\u00A0/g, " ") // nbsp → space
   normalized = normalized.replace(/\u200B/g, "") // zero-width space
   normalized = normalized.replace(/\t+/g, " ") // tabs → single space

65-73: Handle any whitespace between spaced letters (not just a single space).

parts[j] === " " misses double spaces; use a whitespace check.

-        parts[j] === " " &&
+        /^\s+$/.test(parts[j]) &&

577-586: Simplify switch: avoid conditional case labels.

Use plain OPS cases and guard with extractImages inside the block for readability and type-safety.

-            case extractImages ? PDFJS.OPS.paintImageXObject : null:
-            case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null:
-            case extractImages ? PDFJS.OPS.paintInlineImageXObject : null:
-            case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: {
+            case PDFJS.OPS.paintImageXObject:
+            case PDFJS.OPS.paintImageXObjectRepeat:
+            case PDFJS.OPS.paintInlineImageXObject:
+            case PDFJS.OPS.paintImageMaskXObject: {
+              if (!extractImages) break

Also applies to: 615-618


1226-1229: Demote noisy warning to debug when reusing cached description.

Cache hits are normal; don’t warn.

-                      Logger.warn(
-                        `Reusing description for repeated image ${imageName} on page ${pageNum}`,
-                      )
+                      Logger.debug(
+                        `Reusing description for repeated image ${imageName} on page ${pageNum}`,
+                      )

520-527: Replace @ts-ignore with explicit reasoning or safer typing.

Where unavoidable (page.objs.has), add a one-line comment explaining the pdf.js typing gap; elsewhere prefer type assertions.

Examples:

-            // @ts-ignore
-            if (
+            // pdf.js page.objs is untyped; some builds expose .has(name)
+            if (
               typeof (page.objs as any).has === "function" &&
               (page.objs as any).has(name)
             ) {
-              // @ts-ignore
-              return (page.objs as any).get(name)
+              return (page.objs as any).get(name)
             }
-        // @ts-ignore draw directly
-        cctx.drawImage(imgLike, 0, 0)
+        cctx.drawImage(imgLike as any, 0, 0)
-                    // @ts-ignore
-                    let type = await imageType(buffer)
+                    let type = imageType(buffer)

Also applies to: 822-823, 1156-1156


551-552: Remove unused flags (imagesOnPage, vectorOpsDetected) or use them.

They’re set but never read; simplify to reduce noise.

-        let imagesOnPage = 0
-        let vectorOpsDetected = false
+        // (removed unused variables)

And drop their increments/assignments.

Also applies to: 575-576


503-504: Avoid double getTextContent per page.

buildParagraphsFromPage calls getTextContent; you call it again for textOperatorCount. Consider passing textContent down or counting from the first call.

Also applies to: 372-376


1212-1217: Hashing: update directly with Buffer (micro-optimization).

-                    const imageHash = crypto
-                      .createHash("md5")
-                      .update(new Uint8Array(buffer))
+                    const imageHash = crypto
+                      .createHash("md5")
+                      .update(buffer)
                       .digest("hex")

17-18: Consider bounding seenHashDescriptions (memory).

Unbounded Map may grow across process lifetime; use LRU/TTL.


540-549: Dead code: currentParagraph/flushParagraph no-ops.

You never append to currentParagraph. Remove to reduce confusion, or wire it up.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 60bfdb9 and 5ab007e.

📒 Files selected for processing (2)
  • server/integrations/microsoft/index.ts (1 hunks)
  • server/pdfChunks.ts (10 hunks)
✅ Files skipped from review due to trivial changes (1)
  • server/integrations/microsoft/index.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.

Applied to files:

  • server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.

Applied to files:

  • server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
  • getLogger (33-90)
server/chunks.ts (1)
  • chunkTextByParagraph (9-112)
server/integrations/dataSource/config.ts (1)
  • DATASOURCE_CONFIG (3-71)
server/lib/describeImageWithllm.ts (1)
  • describeImageWithllm (56-112)
🔇 Additional comments (4)
server/pdfChunks.ts (4)

317-324: Doc ID sanitization looks good.

Safe FS usage with safeDocId; nice.


989-1076: Correct 1bpp handling and ImageData(width,height) fix look solid.

Bit-packed grayscale expansion and explicit height are correct.


103-147: Paragraph placeholder approach is robust.

Unique PUA-wrapped token avoids collisions; cleanup logic is correct.


557-575: Using PDFJS.OPS constants fixes prior magic-number issue.

Readability and forward-compat improved.

Also applies to: 577-586

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
server/pdfChunks.ts (4)

19-20: Replace magic numbers in line-gap tolerance with named constants.

You already have MIN_LINE_HEIGHT_FOR_TOLERANCE; also name the tolerance factor.

-// Minimum line height used for calculating line break detection tolerance (in PDF units)
-const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
+// Minimum line height used for calculating line break detection tolerance (in PDF units)
+const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
+// Fraction of line height used as Y-delta tolerance for line breaks
+const LINE_GAP_TOLERANCE_FACTOR = parseFloat(
+  process.env.LINE_GAP_TOLERANCE_FACTOR || "0.4",
+)
-          const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
+          const tol =
+            Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) *
+            LINE_GAP_TOLERANCE_FACTOR

Also applies to: 392-394


1383-1386: Avoid dumping full chunk contents to logs; gate behind an env flag.

Large plaintext logs risk noise/PII. Keep your dev-time verbosity but guard it.

-    Logger.debug("All text chunks", { text_chunks })
-    Logger.debug("All text chunk positions", { text_chunk_pos })
-    Logger.debug("All image chunks", { image_chunks })
-    Logger.debug("All image chunk positions", { image_chunk_pos })
+    if (process.env.DEBUG_PDF_CONTENT === "true") {
+      Logger.debug("All text chunks", { text_chunks })
+      Logger.debug("All text chunk positions", { text_chunk_pos })
+      Logger.debug("All image chunks", { image_chunks })
+      Logger.debug("All image chunk positions", { image_chunk_pos })
+    } else {
+      Logger.debug("Chunk counts", {
+        textChunks: text_chunks.length,
+        imageChunks: image_chunks.length,
+        textPos: text_chunk_pos.length,
+        imagePos: image_chunk_pos.length,
+      })
+    }

727-731: Drop unnecessary await and @ts-ignore on imageType; type it and keep default.

image-type is sync; the await and suppressions hide type issues.

-                        // @ts-ignore
-                        let type = await imageType(buffer)
-                        if (!type) type = { mime: "image/png", ext: "png" }
+                        let type = imageType(buffer) as { mime: string; ext?: string } | null
+                        if (!type) type = { mime: "image/png", ext: "png" }
-                        // @ts-ignore
-                        let type = await imageType(buffer)
-                        if (!type) type = { mime: "image/png", ext: "png" }
+                        let type = imageType(buffer) as { mime: string; ext?: string } | null
+                        if (!type) type = { mime: "image/png", ext: "png" }
-                    // @ts-ignore
-                    let type = await imageType(buffer)
+                    let type = imageType(buffer) as { mime: string; ext?: string } | null
+                    if (!type) {
+                      Logger.debug("Could not determine MIME type, defaulting to image/png", { imageName })
+                      type = { mime: "image/png", ext: "png" }
+                    }

Also applies to: 826-829, 1164-1183


1087-1093: Respect Uint8Array view offsets when creating Buffer.

Using the underlying ArrayBuffer can include extra bytes.

-                        const imgBuffer = Buffer.from(uint8Data.buffer)
+                        const imgBuffer = Buffer.from(
+                          uint8Data.buffer,
+                          uint8Data.byteOffset,
+                          uint8Data.byteLength,
+                        )
🧹 Nitpick comments (5)
server/pdfChunks.ts (5)

236-238: Inline the trivial return.

Minor readability nit.

-  const result = typeof text === "string" ? text : ""
-  return result
+  return typeof text === "string" ? text : ""

270-270: Use configured chunk size instead of hardcoded 512/128.

Aligns with DATASOURCE_CONFIG.MAX_CHUNK_SIZE and makes behavior tunable.

-  const chunks = chunkTextByParagraph(cleanedText, 512, 128)
+  const chunks = chunkTextByParagraph(
+    cleanedText,
+    DATASOURCE_CONFIG.MAX_CHUNK_SIZE,
+    Math.min(128, Math.floor(DATASOURCE_CONFIG.MAX_CHUNK_SIZE / 4)),
+  )

544-548: CTM tracking is computed but not consumed.

currentCTM/ctmStack are updated but never used for placement decisions or cropping. Either wire this into downstream logic or remove to reduce complexity.

Also applies to: 589-604


1234-1236: Adjust log level for cache hits.

Reusing an image description is expected; warn is noisy. Suggest debug or info.

-                      Logger.warn(
+                      Logger.debug(
                         `Reusing description for repeated image ${imageName} on page ${pageNum}`,
                       )

28-42: Broaden zero-width character removal (optional).

Add other common ZW chars to avoid stray artifacts in output.

-  normalized = normalized.replace(/\u200B/g, "") // zero-width space
+  normalized = normalized.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "") // zero-width chars
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5ab007e and 9f816a0.

📒 Files selected for processing (1)
  • server/pdfChunks.ts (8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.

Applied to files:

  • server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.

Applied to files:

  • server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
  • getLogger (33-90)
server/chunks.ts (1)
  • chunkTextByParagraph (9-112)
server/integrations/dataSource/config.ts (1)
  • DATASOURCE_CONFIG (3-71)
server/lib/describeImageWithllm.ts (1)
  • describeImageWithllm (56-112)
🔇 Additional comments (1)
server/pdfChunks.ts (1)

513-521: Ordering contract: verify consumer merges by *_pos.

Images are emitted during operator pass; text chunks are appended after page processing. Ensure downstream merges by pos to preserve reading order; otherwise, order will skew.

Also applies to: 1350-1356, 1358-1359

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
server/pdfChunks.ts (5)

613-616: Remove ?: null in switch cases; guard inside the case.

The case extractImages ? OPS : null creates dead case null branches.

-            case extractImages ? PDFJS.OPS.paintImageXObject : null:
-            case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null:
-            case extractImages ? PDFJS.OPS.paintInlineImageXObject : null:
-            case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: {
+            case PDFJS.OPS.paintImageXObject:
+            case PDFJS.OPS.paintImageXObjectRepeat:
+            case PDFJS.OPS.paintInlineImageXObject:
+            case PDFJS.OPS.paintImageMaskXObject: {
+              if (!extractImages) break

1397-1400: Don’t log full chunk contents (PII risk, heavy logs).

Even at debug, avoid dumping arrays of content. Log counts/lengths only or behind an explicit env flag.

-    Logger.debug("All text chunks", { text_chunks })
-    Logger.debug("All text chunk positions", { text_chunk_pos })
-    Logger.debug("All image chunks", { image_chunks })
-    Logger.debug("All image chunk positions", { image_chunk_pos })
+    // Avoid dumping raw content; counts are already logged above.

522-541: Explain or eliminate @ts-ignore usages around page.objs.

Either narrow types with assertions or add a brief why-it’s-safe comment per occurrence to prevent type regressions.


883-889: Also exit early for small image-like objects.

Same fall-through risk here; add a continue.

-                    if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) {
+                    if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) {
                       Logger.debug("Skipped small image-like object", {
                         imageName,
                         width,
                         height,
                         minRequired: MIN_IMAGE_DIM_PX,
                       })
-                    } else {
+                      continue
+                    } else {

783-799: Ensure “skip” actually skips further processing (canvas-like images).

On small or too-large canvas images you log but don’t exit; control falls through to other fallbacks and may still process the image.

-                    if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) {
+                    if (width < MIN_IMAGE_DIM_PX || height < MIN_IMAGE_DIM_PX) {
                       Logger.debug("Skipped small canvas image", {
                         imageName,
                         width,
                         height,
                         minRequired: MIN_IMAGE_DIM_PX,
                       })
-                    } else {
+                      continue
+                    } else {
                       const buffer = c.toBuffer("image/png")
                       // Run all filters BEFORE attempting LLM description
                       if (
                         buffer.length >
                         DATASOURCE_CONFIG.MAX_IMAGE_FILE_SIZE_MB * 1024 * 1024
                       ) {
                         Logger.warn(
                           `Skipping objs/canvas image due to size ${(buffer.length / (1024 * 1024)).toFixed(2)} MB: ${imageName}`,
                         )
-                      } else {
+                        continue
+                      } else {
🧹 Nitpick comments (6)
server/pdfChunks.ts (6)

18-20: Use the defined constant in tolerance calc (avoid magic number).

You introduced MIN_LINE_HEIGHT_FOR_TOLERANCE but still hardcode 10 in the tolerance. Use the constant for readability and single-point tuning.

-const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
+const MIN_LINE_HEIGHT_FOR_TOLERANCE = 10
@@
-          const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
+          const tol = Math.max(prevH || 0, h || 0, MIN_LINE_HEIGHT_FOR_TOLERANCE) * 0.4 // dynamic tolerance

Also applies to: 392-394


369-424: Avoid duplicate getTextContent call per page.

You fetch textContent inside buildParagraphsFromPage and again for textOperatorCount. Cache once and reuse (pass item count out or return alongside paragraphs).

Also applies to: 502-503


549-574: Remove or use vectorOpsDetected.

Flag is set but never read; drop it or hook it into heuristics.


800-803: Drop unnecessary await on imageType; it’s synchronous.

Removes @ts-ignore, improves perf and types.

-                        // @ts-ignore
-                        let type = await imageType(buffer)
+                        let type = imageType(buffer)
-                        // @ts-ignore
-                        let type = await imageType(buffer)
+                        let type = imageType(buffer)
-                    // @ts-ignore
-                    let type = await imageType(buffer)
+                    let type = imageType(buffer)

Also applies to: 909-912, 1176-1182


1099-1105: Respect Uint8Array view offsets when creating Buffer.

Using the underlying ArrayBuffer can include unrelated bytes.

-                        const imgBuffer = Buffer.from(uint8Data.buffer)
+                        const imgBuffer = Buffer.from(
+                          uint8Data.buffer,
+                          uint8Data.byteOffset,
+                          uint8Data.byteLength,
+                        )

1247-1249: Use debug (not warn) when reusing cached image descriptions.

Re-use is expected; warn is too noisy.

-                      Logger.warn(
-                        `Reusing description for repeated image ${imageName} on page ${pageNum}`,
-                      )
+                      Logger.debug(
+                        `Reusing description for repeated image ${imageName} on page ${pageNum}`,
+                      )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f816a0 and 2f497d0.

📒 Files selected for processing (1)
  • server/pdfChunks.ts (8 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-09-11T08:30:51.408Z
Learnt from: Aman-Asrani-Juspay
PR: xynehq/xyne#777
File: server/api/agent/workflowAgentUtils.ts:95-103
Timestamp: 2025-09-11T08:30:51.408Z
Learning: The development team prefers to keep detailed logging (including user emails, attachment IDs, and content previews) during integration phases for debugging purposes, even if it includes potentially sensitive information. They prioritize debugging capability over log privacy during development.

Applied to files:

  • server/pdfChunks.ts
📚 Learning: 2025-08-11T08:44:08.412Z
Learnt from: MayankBansal2004
PR: xynehq/xyne#719
File: server/api/chat/chat.ts:4023-4029
Timestamp: 2025-08-11T08:44:08.412Z
Learning: In server/api/chat/chat.ts and similar files, MayankBansal2004 prefers not to add explicit type definitions for metadata.usage structures when the receiving array is already typed and default values ensure type safety through the || 0 pattern.

Applied to files:

  • server/pdfChunks.ts
🧬 Code graph analysis (1)
server/pdfChunks.ts (4)
server/logger/index.ts (1)
  • getLogger (33-90)
server/chunks.ts (1)
  • chunkTextByParagraph (9-112)
server/integrations/dataSource/config.ts (1)
  • DATASOURCE_CONFIG (3-71)
server/lib/describeImageWithllm.ts (1)
  • describeImageWithllm (56-112)
🔇 Additional comments (6)
server/pdfChunks.ts (6)

28-42: normalizeText: solid Unicode and whitespace normalization.

NFC + control-char stripping (except NL/TAB), nbsp/ZWSP handling, and trim look good.


112-125: cleanText pipeline is well-structured and safer placeholder.

Private-use sentinel for paragraph breaks + hyphenation and punctuation fixes are sensible. Good step-up from magic strings like [[PARA]].

Also applies to: 133-147


46-99: Validate despacing behavior for mixed letter-digit runs.

Current rule merges any >=3 single tokens, including digits. “B 2 B” -> “B2B” and “3 D” -> “3D”. If that’s intended, fine; otherwise restrict digits.


306-324: Ordering guarantee: confirm consumer uses positions to reconstruct interleaving.

Images are pushed during operator iteration; text is flushed after the page. Ensure downstream merges by sequence (pos arrays) and not by array order. If not, we’ll need intra-page flushing of text around image ops.

Also applies to: 1332-1345


158-186: Matrix math utility at module scope: good move.

Reusability + testability improved.


1009-1096: Pixel-format handling incl. 1bpp unpack: looks correct.

Expected length computation and MSB-first bit expansion are handled correctly; ImageData(width, height) usage fixed.

Comment on lines 613 to 616
case extractImages ? PDFJS.OPS.paintImageXObject : null:
case extractImages ? PDFJS.OPS.paintImageXObjectRepeat : null:
case extractImages ? PDFJS.OPS.paintInlineImageXObject : null:
case extractImages ? PDFJS.OPS.paintImageMaskXObject : null: {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

One-shot verification script for the above changes.

Run to spot remaining patterns across the repo.

Also applies to: 835-847, 948-956, 1308-1318, 783-799, 883-889, 800-803, 909-912, 1099-1105, 1397-1400


🏁 Script executed:

#!/bin/bash
set -euo pipefail
echo "Searching for switch cases with '?: null'..."
rg -n 'case\s+[^:\n]+\?\s*[^:]+:\s*null' -C1 --type=ts

echo -e "\nFinding Buffer.from(ArrayBuffer) without offsets..."
rg -nP 'Buffer\.from\(\s*[^,()]+\.buffer\s*\)' --type=ts -C2

echo -e "\nFinding 'await imageType(' occurrences..."
rg -n 'await\s+imageType\(' --type=ts -C2

echo -e "\nFinding debug logs dumping chunk arrays..."
rg -nP 'Logger\.debug\([^)]*(text_chunks|image_chunks|text_chunk_pos|image_chunk_pos)' --type=ts -C2

echo -e "\nChecking for path.join(baseDir, docid) writes..."
rg -nP 'path\.(join|resolve)\([^)]*docid[^)]*\).*writeFile' --type=ts -C3

Length of output: 3184


Fix critical Buffer.from(ArrayBuffer) bug; remove ternary case labels; verify imageType usage; stop dumping large arrays to logs

  • Critical — server/pdfChunks.ts:1099: Buffer.from(uint8Data.buffer) uses the entire backing ArrayBuffer (ignores byteOffset/length). Replace with Buffer.from(uint8Data) or Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength).
  • Logic — server/pdfChunks.ts:613–616: lines like case extractImages ? PDFJS.OPS.paintImageXObject : null: produce case null when extractImages is false; remove the ternary and either guard extractImages outside the switch or check inside the case block.
  • Verify — server/pdfChunks.ts:801–803, 909–912, 1177–1179 and server/integrations/dataSource/index.ts:547: multiple await imageType(...) uses. Confirm whether your image-type dependency is async; if it’s synchronous, drop await and the @ts-ignore; if async, consider reducing per-image overhead.
  • Logging — server/pdfChunks.ts:1397–1400: Logger.debug dumps full chunk arrays (text_chunks, image_chunks, …). Avoid logging large arrays (truncate or gate behind debug-level sampling).
🤖 Prompt for AI Agents
In server/pdfChunks.ts around lines 613–616, 801–803, 909–912, 1099, 1177–1179
and around 1397–1400: remove the ternary expressions from the switch case labels
(e.g., `case extractImages ? PDFJS.OPS.paintImageXObject : null:`) and instead
either guard the switch with `if (extractImages) { ... }` or keep the case
labels as the OPS constants and check `if (!extractImages) break` inside the
case body; fix the Buffer.from(ArrayBuffer) bug at line ~1099 by using the
Uint8Array directly (Buffer.from(uint8Data)) or pass the buffer with offsets
(Buffer.from(uint8Data.buffer, uint8Data.byteOffset, uint8Data.byteLength)) so
you don't include the entire backing ArrayBuffer; review uses of `await
imageType(...)` at the listed lines and either remove `await` (and the
@ts-ignore) if imageType is synchronous or keep it but batch/reduce per-image
calls if it is async; and stop dumping full large arrays in Logger.debug around
lines 1397–1400—truncate, sample, or gate those logs behind a strict debug flag
to avoid huge outputs.

Comment on lines +835 to +847
const baseDir = path.resolve(
process.env.IMAGE_DIR ||
"downloads/xyne_images_db",
)
const outputDir = path.join(baseDir, docid)
await fsPromises.mkdir(outputDir, {
recursive: true,
})
const imageFilename = `${globalSeq.value}.${type.ext || "png"}`
const imagePath = path.join(
outputDir,
imageFilename,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Block path traversal via docid when writing images.

Current path.join(baseDir, docid) can escape baseDir. Keep raw docid (per product decision) but validate resolved path.

-                            const baseDir = path.resolve(
+                            const baseDir = path.resolve(
                               process.env.IMAGE_DIR ||
                                 "downloads/xyne_images_db",
                             )
-                            const outputDir = path.join(baseDir, docid)
+                            const resolvedOutputDir = path.resolve(baseDir, docid)
+                            const rel = path.relative(baseDir, resolvedOutputDir)
+                            if (rel.startsWith("..") || path.isAbsolute(rel)) {
+                              Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+                              break
+                            }
+                            const outputDir = resolvedOutputDir
-                            const baseDir = path.resolve(
+                            const baseDir = path.resolve(
                               process.env.IMAGE_DIR ||
                                 "downloads/xyne_images_db",
                             )
-                            const outputDir = path.join(baseDir, docid)
+                            const resolvedOutputDir = path.resolve(baseDir, docid)
+                            const rel = path.relative(baseDir, resolvedOutputDir)
+                            if (rel.startsWith("..") || path.isAbsolute(rel)) {
+                              Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+                              break
+                            }
+                            const outputDir = resolvedOutputDir
-                      const baseDir = path.resolve(
+                      const baseDir = path.resolve(
                         process.env.IMAGE_DIR || "downloads/xyne_images_db",
                       )
-                      const outputDir = path.join(baseDir, docid)
+                      const resolvedOutputDir = path.resolve(baseDir, docid)
+                      const rel = path.relative(baseDir, resolvedOutputDir)
+                      if (rel.startsWith("..") || path.isAbsolute(rel)) {
+                        Logger.warn("Refusing to write image outside base dir", { baseDir, docid, resolvedOutputDir })
+                        continue
+                      }
+                      const outputDir = resolvedOutputDir

Also applies to: 948-956, 1308-1318

🤖 Prompt for AI Agents
In server/pdfChunks.ts around lines 835-847, the code uses path.join(baseDir,
docid) which allows docid to escape baseDir; keep using the raw docid per
product decision but replace the join with path.resolve(baseDir, docid) and
validate that the resulting resolved path is inside baseDir (e.g., compute
path.relative(baseDir, resolvedPath) and ensure it does not start with '..' and
that resolvedPath starts with baseDir + path.sep or equals baseDir); if
validation fails, log/error and abort the write. Apply the same
resolve+validation logic to the other occurrences at lines 948-956 and
1308-1318, and only call fsPromises.mkdir/write after the path has been
validated.

@zereraz zereraz merged commit d1046d5 into main Sep 15, 2025
3 of 4 checks passed
@zereraz zereraz deleted the feat/pdfIngestFix branch September 15, 2025 09:49
@Aayushjshah Aayushjshah restored the feat/pdfIngestFix branch September 15, 2025 12:48
This was referenced Sep 15, 2025
@coderabbitai coderabbitai bot mentioned this pull request Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants