Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Sep 15, 2025

Removed older Text processing

Summary by CodeRabbit

  • Refactor
    • Improved PDF text extraction to produce cleaner, more consistent paragraphs per page, reducing duplicate or missing lines.
    • Maintains page-to-page continuity so text flows smoothly across page breaks.
    • Simplified handling when mixing text and images while preserving existing image support.
    • No changes to public APIs; behavior remains compatible.

@gemini-code-assist
Copy link
Contributor

Note

Gemini is unable to generate a summary for this pull request due to the file types involved not being currently supported.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 15, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

Refactors PDF text extraction to use per-page textContent paragraphs via buildParagraphsFromPage, removing operator-args parsing and fallback paths. Updates per-page processing, overlap handling, and logging; adds a Logger.debug call for all text chunks. Image handling and public API signatures remain unchanged.

Changes

Cohort / File(s) Summary
PDF text extraction refactor
server/pdfChunks.ts
Removes operator-level extraction helpers (extractTextFromArgs, extractFallbackTextFromOperators, combineTextSources) and MIN_LINE_HEIGHT_FOR_TOLERANCE; switches to buildParagraphsFromPage (textContent) as primary source; per-page flow now prepends pageOverlap to first paragraph and uses processTextParagraphs for overlap continuity; simplifies image/text continuity logic; updates logging to reference primary paragraphs.length and adds Logger.debug("All text chunks", { text_chunks }); no exported signature changes.

Sequence Diagram(s)

sequenceDiagram
  participant Client
  participant PDFProcessor as PDF Chunks Processor
  participant PDFJS as PDF.js
  participant Logger as Logs/Metrics

  Client->>PDFProcessor: generateChunksFromPdf(pdf)
  loop For each page
    PDFProcessor->>PDFJS: getPage(pageNum)
    PDFProcessor->>PDFJS: page.getTextContent()
    Note over PDFProcessor: buildParagraphsFromPage(textContent)
    PDFProcessor->>PDFProcessor: prepend pageOverlap to first paragraph
    PDFProcessor->>PDFProcessor: processTextParagraphs(paragraphs, prevOverlap)
    alt Page has images
      PDFProcessor->>PDFJS: extractImages(page)
      PDFProcessor->>PDFProcessor: integrate image chunks (image markers insertion)
    end
    PDFProcessor->>Logger: log page info (paragraphs.length, images count)
  end
  PDFProcessor->>Logger: Logger.debug("All text chunks", { text_chunks })
  PDFProcessor-->>Client: chunks
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • shivamashtikar
  • junaid-shirur
  • kalpadhwaryu
  • zereraz
  • devesh-juspay

Poem

I nibble lines from each PDF page bright,
Stitch overlaps tidy, keep paragraphs tight.
Operators hushed, textContent leads the way,
Images tuck in, logs hum at the end of day.
Hop—chunk—deliver! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Feat/pdf ingest fix" is on-topic because the changeset centers on PDF ingestion and a refactor of PDF text extraction, so it correctly signals a PDF-ingest-related change; however, it is somewhat generic since "fix" does not describe the specific refactor or scope and could be more informative for readers scanning history.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0c98b7a and bb60269.

📒 Files selected for processing (1)
  • server/pdfChunks.ts (4 hunks)

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@junaid-shirur junaid-shirur merged commit 4933a37 into main Sep 15, 2025
3 checks passed
@junaid-shirur junaid-shirur deleted the feat/pdfIngestFix branch September 15, 2025 13:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/pdfChunks.ts (1)

333-389: Paragraph construction misses intra-line spacing and causes readability regressions; also compute itemCount here to avoid a second getTextContent call

Current logic concatenates item.str without considering X gaps, which can collapse words in many PDFs. Add a simple X-gap heuristic and return the item count so the caller doesn’t call getTextContent twice.

-    const buildParagraphsFromPage = async (
-      page: pdfjsLib.PDFPageProxy,
-    ): Promise<string[]> => {
+    const buildParagraphsFromPage = async (
+      page: pdfjsLib.PDFPageProxy,
+    ): Promise<{ paragraphs: string[]; itemCount: number }> => {
       const textContent = await page.getTextContent({
         includeMarkedContent: false,
         disableNormalization: false,
       })
 
       // Build lines using hasEOL and Y-position changes (handles PPT/DOC exports)
       const lines: string[] = []
       let current = ""
       let prevY: number | null = null
       let prevH: number | null = null
+      let prevX: number | null = null
       for (const item of textContent.items as any[]) {
         const str: string = item && typeof item.str === "string" ? item.str : ""
         if (!str) continue
 
         const tr = Array.isArray(item.transform) ? item.transform : []
+        const x = typeof tr[4] === "number" ? tr[4] : null
         const y = typeof tr[5] === "number" ? tr[5] : null
         const h = typeof item.height === "number" ? item.height : null
 
         let newLine = false
         if (prevY != null && y != null) {
           const tol = Math.max(prevH || 0, h || 0, 10) * 0.4 // dynamic tolerance
           if (Math.abs(y - prevY) > tol) newLine = true
         }
 
         if (newLine || (item as any).hasEOL) {
           if (current.length > 0) lines.push(current)
           current = str
+          prevX = x
         } else {
-          current += str
+          // Insert a space when there's a notable horizontal gap between chunks.
+          const gapTol = Math.max(prevH || 0, h || 0, 10) * 0.6
+          if (
+            prevX != null &&
+            x != null &&
+            x > prevX &&
+            x - prevX > gapTol &&
+            current.length > 0 &&
+            !/\s$/.test(current) &&
+            !/^\s/.test(str)
+          ) {
+            current += " "
+          }
+          current += str
+          prevX = x
         }
 
         prevY = y
         prevH = h
       }
       if (current.trim().length > 0) lines.push(current)
 
       // Group lines into paragraphs separated by blank lines
       const paragraphs: string[] = []
       let buf: string[] = []
       const pushPara = () => {
         if (buf.length === 0) return
         paragraphs.push(buf.join("\n"))
         buf = []
       }
       for (const ln of lines) {
         if (ln.trim().length === 0) pushPara()
         else buf.push(ln)
       }
       pushPara()
 
       // Filter raw paragraphs - check trimmed length but don't apply full cleaning yet
-      return paragraphs.filter((p) => p.trim().length > 0)
+      return {
+        paragraphs: paragraphs.filter((p) => p.trim().length > 0),
+        itemCount: (textContent.items as any[]).length,
+      }
     }
🧹 Nitpick comments (5)
server/pdfChunks.ts (5)

400-406: Avoid duplicate page.getTextContent call; use the count returned by buildParagraphsFromPage

This saves one heavy API call per page.

-        let paragraphs: string[] = await buildParagraphsFromPage(page)
-
-        
-
-        let textOperatorCount = (await page.getTextContent()).items.length
+        const { paragraphs, itemCount: textOperatorCount } =
+          await buildParagraphsFromPage(page)

415-420: Tighten log keys and drop redundant field

"primaryParagraphs" vs "finalParagraphs" are identical post-refactor. Use a single "paragraphs" key.

         Logger.debug("Text extraction summary for page", {
           pageNum,
-          primaryParagraphs: paragraphs.length,
-         
-          finalParagraphs: paragraphs.length,
+          paragraphs: paragraphs.length,
           textOperatorCount,
           initialPageOverlap: pageOverlap,
         })

1300-1304: Also avoid emitting full payloads at debug level

Consider logging counts and hashes, not full arrays, to reduce noise and accidental PII leakage.


17-21: LGTM on removing the old line-height constant; keep env validation for MIN_IMAGE_DIM_PX

The remaining env-driven default is fine. Consider clamping to a sane range (e.g., 64–2048) and logging the effective value once at startup.


703-705: Fix image-type usage and drop ts-ignore

image-type is synchronous; the await + ts-ignore combo weakens type safety. Either:

  • keep image-type and call it synchronously, or
  • switch to file-type (async) if you need broader detection.
-                        // @ts-ignore
-                        let type = await imageType(buffer)
+                        let type = imageType(buffer)
                         if (!type) type = { mime: "image/png", ext: "png" }

Repeat similarly at the other call sites.

Also applies to: 812-814, 1078-1097

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ddb1479 and 0c98b7a.

📒 Files selected for processing (1)
  • server/pdfChunks.ts (4 hunks)

describeImages,
})

console.log("All text chunks", { text_chunks })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove console.log dumping all text chunks (PII/log volume risk)

This prints full extracted content to stdout in production. Use structured debug logs with sampling or summaries instead.

-    console.log("All text chunks", { text_chunks })
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
console.log("All text chunks", { text_chunks })
🤖 Prompt for AI Agents
In server/pdfChunks.ts around line 1299, remove the plain console.log that dumps
all text_chunks to stdout; instead use the application's structured logger at
debug/trace level (or guard behind a verbose env flag) and log either a safe
summary (e.g., count of chunks, lengths, or hashes) or a sampled/truncated
subset to avoid exposing PII and high-volume output. Ensure the log call does
not include full chunk contents, and add a comment noting the change for
auditing.

@coderabbitai coderabbitai bot mentioned this pull request Sep 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants