Feat/pdf ingest fix #835

Aayushjshah · 2025-09-15T12:52:01Z

Removed older Text processing

Summary by CodeRabbit

Refactor
- Improved PDF text extraction to produce cleaner, more consistent paragraphs per page, reducing duplicate or missing lines.
- Maintains page-to-page continuity so text flows smoothly across page breaks.
- Simplified handling when mixing text and images while preserving existing image support.
- No changes to public APIs; behavior remains compatible.

gemini-code-assist · 2025-09-15T12:52:05Z

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

coderabbitai · 2025-09-15T13:00:25Z

Caution

Review failed

The pull request is closed.

Walkthrough

Refactors PDF text extraction to use per-page textContent paragraphs via buildParagraphsFromPage, removing operator-args parsing and fallback paths. Updates per-page processing, overlap handling, and logging; adds a Logger.debug call for all text chunks. Image handling and public API signatures remain unchanged.

Changes

Cohort / File(s)	Summary
PDF text extraction refactor `server/pdfChunks.ts`	Removes operator-level extraction helpers (`extractTextFromArgs`, `extractFallbackTextFromOperators`, `combineTextSources`) and `MIN_LINE_HEIGHT_FOR_TOLERANCE`; switches to `buildParagraphsFromPage` (textContent) as primary source; per-page flow now prepends `pageOverlap` to first paragraph and uses `processTextParagraphs` for overlap continuity; simplifies image/text continuity logic; updates logging to reference primary `paragraphs.length` and adds `Logger.debug("All text chunks", { text_chunks })`; no exported signature changes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant PDFProcessor as PDF Chunks Processor
  participant PDFJS as PDF.js
  participant Logger as Logs/Metrics

  Client->>PDFProcessor: generateChunksFromPdf(pdf)
  loop For each page
    PDFProcessor->>PDFJS: getPage(pageNum)
    PDFProcessor->>PDFJS: page.getTextContent()
    Note over PDFProcessor: buildParagraphsFromPage(textContent)
    PDFProcessor->>PDFProcessor: prepend pageOverlap to first paragraph
    PDFProcessor->>PDFProcessor: processTextParagraphs(paragraphs, prevOverlap)
    alt Page has images
      PDFProcessor->>PDFJS: extractImages(page)
      PDFProcessor->>PDFProcessor: integrate image chunks (image markers insertion)
    end
    PDFProcessor->>Logger: log page info (paragraphs.length, images count)
  end
  PDFProcessor->>Logger: Logger.debug("All text chunks", { text_chunks })
  PDFProcessor-->>Client: chunks

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

pdfChunks file updated for better text processing #819 — Overlapping edits to server/pdfChunks.ts towards a textContent-based paragraph pipeline and changed overlap/logging behavior.
fix: fixed chunking size issues and improved docx chunking #711 — Changes touching text-extraction helpers and chunking imports related to the removed/refactored helpers.

Suggested reviewers

shivamashtikar
junaid-shirur
kalpadhwaryu
zereraz
devesh-juspay

Poem

I nibble lines from each PDF page bright,
Stitch overlaps tidy, keep paragraphs tight.
Operators hushed, textContent leads the way,
Images tuck in, logs hum at the end of day.
Hop—chunk—deliver! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "Feat/pdf ingest fix" is on-topic because the changeset centers on PDF ingestion and a refactor of PDF text extraction, so it correctly signals a PDF-ingest-related change; however, it is somewhat generic since "fix" does not describe the specific refactor or scope and could be more informative for readers scanning history.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c98b7a and bb60269.

📒 Files selected for processing (1)

server/pdfChunks.ts (4 hunks)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

server/pdfChunks.ts

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/pdfChunks.ts (1)

333-389: Paragraph construction misses intra-line spacing and causes readability regressions; also compute itemCount here to avoid a second getTextContent call

Current logic concatenates item.str without considering X gaps, which can collapse words in many PDFs. Add a simple X-gap heuristic and return the item count so the caller doesn’t call getTextContent twice.

-    const buildParagraphsFromPage = async (
-      page: pdfjsLib.PDFPageProxy,
-    ): Promise<string[]> => {
+    const buildParagraphsFromPage = async (
+      page: pdfjsLib.PDFPageProxy,
+    ): Promise<{ paragraphs: string[]; itemCount: number }> => {
       const textContent = await page.getTextContent({
         includeMarkedContent: false,
         disableNormalization: false,
       })
 
       // Build lines using hasEOL and Y-position changes (handles PPT/DOC exports)
       const lines: string[] = []
       let current = ""
       let prevY: number | null = null
       let prevH: number | null = null
+      let prevX: number | null = null
       for (const item of textContent.items as any[]) {
         const str: string = item && typeof item.str === "string" ? item.str : ""
         if (!str) continue
 
         const tr = Array.isArray(item.transform) ? item.transform : []
+        const x = typeof tr[4] === "number" ? tr[4] : null
         const y = typeof tr[5] === "number" ? tr[5] : null
         const h = typeof item.height === "number" ? item.height : null
 
         let newLine = false
         if (prevY != null && y != null) {
           const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
           if (Math.abs(y - prevY) > tol) newLine = true
         }
 
         if (newLine || (item as any).hasEOL) {
           if (current.length > 0) lines.push(current)
           current = str
+          prevX = x
         } else {
-          current += str
+          // Insert a space when there's a notable horizontal gap between chunks.
+          const gapTol = Math.max(prevH || 0, h || 0, 10) * 0.6
+          if (
+            prevX != null &&
+            x != null &&
+            x > prevX &&
+            x - prevX > gapTol &&
+            current.length > 0 &&
+            !/\s$/.test(current) &&
+            !/^\s/.test(str)
+          ) {
+            current += " "
+          }
+          current += str
+          prevX = x
         }
 
         prevY = y
         prevH = h
       }
       if (current.trim().length > 0) lines.push(current)
 
       // Group lines into paragraphs separated by blank lines
       const paragraphs: string[] = []
       let buf: string[] = []
       const pushPara = () => {
         if (buf.length === 0) return
         paragraphs.push(buf.join("\n"))
         buf = []
       }
       for (const ln of lines) {
         if (ln.trim().length === 0) pushPara()
         else buf.push(ln)
       }
       pushPara()
 
       // Filter raw paragraphs - check trimmed length but don't apply full cleaning yet
-      return paragraphs.filter((p) => p.trim().length > 0)
+      return {
+        paragraphs: paragraphs.filter((p) => p.trim().length > 0),
+        itemCount: (textContent.items as any[]).length,
+      }
     }

🧹 Nitpick comments (5)

server/pdfChunks.ts (5)
400-406: Avoid duplicate page.getTextContent call; use the count returned by buildParagraphsFromPage

This saves one heavy API call per page.
-        let paragraphs: string[] = await buildParagraphsFromPage(page)
-
-        
-
-        let textOperatorCount = (await page.getTextContent()).items.length
+        const { paragraphs, itemCount: textOperatorCount } =
+          await buildParagraphsFromPage(page)
415-420: Tighten log keys and drop redundant field

"primaryParagraphs" vs "finalParagraphs" are identical post-refactor. Use a single "paragraphs" key.
         Logger.debug("Text extraction summary for page", {
           pageNum,
-          primaryParagraphs: paragraphs.length,
-         
-          finalParagraphs: paragraphs.length,
+          paragraphs: paragraphs.length,
           textOperatorCount,
           initialPageOverlap: pageOverlap,
         })
1300-1304: Also avoid emitting full payloads at debug level

Consider logging counts and hashes, not full arrays, to reduce noise and accidental PII leakage.

17-21: LGTM on removing the old line-height constant; keep env validation for MIN_IMAGE_DIM_PX

The remaining env-driven default is fine. Consider clamping to a sane range (e.g., 64–2048) and logging the effective value once at startup.

703-705: Fix image-type usage and drop ts-ignore

image-type is synchronous; the await + ts-ignore combo weakens type safety. Either:

keep image-type and call it synchronously, or

switch to file-type (async) if you need broader detection.
-                        // @ts-ignore
-                        let type = await imageType(buffer)
+                        let type = imageType(buffer)
                         if (!type) type = { mime: "image/png", ext: "png" }
Repeat similarly at the other call sites.

Also applies to: 812-814, 1078-1097

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ddb1479 and 0c98b7a.

📒 Files selected for processing (1)

server/pdfChunks.ts (4 hunks)

coderabbitai · 2025-09-15T13:07:09Z

server/pdfChunks.ts

      describeImages,
    })

+    console.log("All text chunks", { text_chunks })


⚠️ Potential issue

Remove console.log dumping all text chunks (PII/log volume risk)

This prints full extracted content to stdout in production. Use structured debug logs with sampling or summaries instead.

- console.log("All text chunks", { text_chunks })

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

console.log("All text chunks", { text_chunks })

🤖 Prompt for AI Agents

In server/pdfChunks.ts around line 1299, remove the plain console.log that dumps all text_chunks to stdout; instead use the application's structured logger at debug/trace level (or guard behind a verbose env flag) and log either a safe summary (e.g., count of chunks, lengths, or hashes) or a sampled/truncated subset to avoid exposing PII and high-volume output. Ensure the log call does not include full chunk contents, and add a comment noting the change for auditing.

Aayushjshah added 8 commits September 12, 2025 13:07

pdfChunks file updated for better text processing

feb056f

comment fixes

60bfdb9

comment fixes

29508fa

comment fixes

498c544

comment fixes

5ab007e

comment fixes

9f816a0

comment fixes

2f497d0

Merge branch 'main' into feat/pdfIngestFix

9dab216

removing old logic

0c98b7a

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners September 15, 2025 13:00

Himanshvarma reviewed Sep 15, 2025

View reviewed changes

server/pdfChunks.ts Outdated Show resolved Hide resolved

Himanshvarma approved these changes Sep 15, 2025

View reviewed changes

removing old logic

bb60269

junaid-shirur approved these changes Sep 15, 2025

View reviewed changes

junaid-shirur merged commit 4933a37 into main Sep 15, 2025
3 checks passed

junaid-shirur deleted the feat/pdfIngestFix branch September 15, 2025 13:06

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Sep 18, 2025

Feat/pdf layout #889

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/pdf ingest fix #835

Feat/pdf ingest fix #835

Uh oh!

Aayushjshah commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

gemini-code-assist bot commented Sep 15, 2025

Uh oh!

coderabbitai bot commented Sep 15, 2025 •

edited

Loading

Review failed

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat/pdf ingest fix #835

Feat/pdf ingest fix #835

Uh oh!

Conversation

Aayushjshah commented Sep 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Sep 15, 2025

Uh oh!

coderabbitai bot commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aayushjshah commented Sep 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 15, 2025 •

edited

Loading