Feat/paddle for all pdf #1118

Aayushjshah · 2025-10-16T06:43:10Z

Description

Testing

Additional Notes

Summary by CodeRabbit

Refactor
- Migrated PDF processing pipeline across integrations to an improved extraction framework with updated field mapping for text and image chunks.
- Enhanced OCR batching logic with refined code structure and error handling.
- Streamlined PDF data extraction workflows with improved consistency across processing stages.

coderabbitai · 2025-10-16T06:43:34Z

Walkthrough

The PR replaces the Gemini-based PDF extraction mechanism with a new PdfProcessor.processWithFallback workflow across three integration files (dataSource, Google, Microsoft), updating function calls, result field mappings, and imports. Minor formatting refinements are applied to OCR chunking logic, and obsolete documentation comments are removed.

Changes

Cohort / File(s)	Change Summary
PDF processor migration `server/integrations/dataSource/index.ts`, `server/integrations/google/worker-utils.ts`, `server/integrations/microsoft/attachment-utils.ts`	Replaces `extractTextAndImagesWithChunksFromPDFviaGemini` calls with `PdfProcessor.processWithFallback`, updates result field references (text_chunks → chunks, image_chunk_pos → image_chunks_pos), converts input to `Buffer.from()`, and adjusts imports.
Library refinements `server/lib/chunkByOCR.ts`	Formatting and spacing adjustments to constant definitions, loop headers, logging calls, and string concatenations without altering control flow or logic.
Comment cleanup `server/lib/chunkPdfWithGemini.ts`	Removes obsolete Gemini-backed PDF extractor documentation comment blocks.

Sequence Diagram

sequenceDiagram
    actor Integration as Integration Layer
    participant Old as Gemini Extractor<br/>(Previous)
    participant New as PdfProcessor<br/>(New)
    participant Result as Result Handler

    rect rgb(220, 240, 250)
    note over Integration,Result: Previous Flow
    Integration->>Old: extractTextAndImagesWithChunksFromPDFviaGemini<br/>(pdfBuffer, docId)
    Old->>Old: Gemini-based chunking
    Old->>Result: {text_chunks, image_chunks,<br/>text_chunk_pos, image_chunk_pos}
    end

    rect rgb(240, 220, 250)
    note over Integration,Result: New Flow
    Integration->>New: PdfProcessor.processWithFallback<br/>(Buffer.from(pdfBuffer), ...)
    New->>New: PdfProcessor logic with fallback
    New->>Result: {chunks, image_chunks,<br/>chunks_pos, image_chunks_pos,<br/>processingMethod}
    end

    Result->>Result: Map fields & construct<br/>VespaDataSourceFile

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes follow a consistent replacement pattern across three integration files, but require verification that field mappings (text_chunks → chunks, etc.) are correctly applied throughout. The OCR file changes are cosmetic. Medium heterogeneity across files offset by repetitive nature of core substitution pattern.

Possibly related PRs

feat: XYN-100 adding falback for pdf based changes and storing in db #1099: Directly replaces Gemini-based PDF chunking with PdfProcessor.processWithFallback API and updates result field mappings identically.
Feat/paddle #1019: Modifies the PDF processing pipeline with same output field changes (chunks, image_chunks, chunks_pos).
Feat/pdf layout #889: Modifies the same PDF extraction call sites but in opposite direction (introduced extractTextAndImagesWithChunksFromPDFviaGemini).

Suggested reviewers

zereraz
shivamashtikar
kalpadhwaryu
junaid-shirur
devesh-juspay

Poem

🐰 The Gemini days are past, hooray!
PdfProcessor shows the modern way,
Chunks and images dance in harmony,
No more gemstone magic—just clean efficiency! ✨
Three integrations leap and play!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title uses the term "paddle" which doesn’t correspond to any component or workflow in the changes and fails to summarize the key migration from Gemini-based PDF extraction to a PdfProcessor-based workflow.	Please update the PR title to clearly reflect the main change, for example “feat(pdf): migrate PDF extraction from Gemini to PdfProcessor” so that reviewers immediately understand the purpose of the pull request.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/paddleForAllPdf

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee84f74 and 6b60ae4.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (12 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

server/lib/chunkByOCR.ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-16T06:43:35Z

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's PDF processing capabilities by introducing a new PdfProcessor abstraction and an advanced OCR batch dispatching mechanism. The core change involves replacing previous PDF extraction methods with a more generalized and resilient pipeline that can intelligently process PDF documents in batches, leveraging dynamic concurrency, retry logic, and instance status monitoring. This refactoring improves scalability, reliability, and maintainability of PDF content ingestion across different data sources.

Highlights

Centralized PDF Processing: Introduced a new PdfProcessor class to centralize and abstract PDF content extraction, replacing direct calls to Gemini-specific functions across various integration points.
Advanced OCR Batch Dispatcher: Implemented a robust OCR batch dispatching system (dispatchOCRBatches) in chunkByOCR.ts that supports both sequential and concurrent processing of PDF page batches.
Dynamic Concurrency and Retry Logic: The new dispatcher includes features like monitoring idle OCR instances, dynamic batch dispatching based on available capacity, configurable timeouts, exponential backoff with jitter for retries, and comprehensive error handling.
Enhanced PDF Batching: The splitPdfIntoBatches function was updated to return richer PdfPageBatch objects, including startPage and endPage information, facilitating better tracking and processing of PDF segments.
Comprehensive Testing Utilities: A new script (testOcrDispatcher.ts) was added to provide extensive testing and demonstration of the OCR dispatcher's behavior under various simulated conditions, including fluctuating idle capacities and server unreachability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a PdfProcessor class to handle PDF processing with fallback mechanisms, including OCR, Gemini, and PDF.js. It also includes a new OCR dispatcher to manage OCR processing batches and a test suite for the dispatcher logic. The changes aim to improve the robustness and efficiency of PDF processing. I have provided review comments for issues of medium, high, and critical severity.

coderabbitai

Actionable comments posted: 6

🧹 Nitpick comments (1)

server/scripts/testOcrDispatcher.ts (1)

158-202: Stabilize test determinism and resource usage (optional).

Expose a fixed seed or a configurable failureRate for sendBatch to avoid flaky runs.

Consider mocking per-batch buffers (e.g., Buffer.alloc(16)) vs. real PDF copying to reduce memory/time for large PDFs.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1cb30d8 and 30c80a9.

📒 Files selected for processing (6)

server/integrations/dataSource/index.ts (2 hunks)
server/integrations/google/worker-utils.ts (2 hunks)
server/integrations/microsoft/attachment-utils.ts (2 hunks)
server/lib/chunkByOCR.ts (11 hunks)
server/lib/chunkPdfWithGemini.ts (0 hunks)
server/scripts/testOcrDispatcher.ts (1 hunks)

💤 Files with no reviewable changes (1)

server/lib/chunkPdfWithGemini.ts

🧰 Additional context used

🧬 Code graph analysis (5)

server/integrations/dataSource/index.ts (2)

server/lib/pdfProcessor.ts (1)

PdfProcessor (42-281)

server/integrations/dataSource/errors.ts (1)

ContentExtractionError (60-68)

server/integrations/microsoft/attachment-utils.ts (1)

server/lib/pdfProcessor.ts (1)

PdfProcessor (42-281)

server/integrations/google/worker-utils.ts (1)

server/lib/pdfProcessor.ts (1)

PdfProcessor (42-281)

server/scripts/testOcrDispatcher.ts (1)

server/lib/chunkByOCR.ts (6)

PdfOcrBatch (116-122)

DispatchOptions (124-140)

DispatchReport (142-155)

dispatchOCRBatches (813-880)

error (171-171)

error (1627-1633)

server/lib/chunkByOCR.ts (2)

server/scripts/testOcrDispatcher.ts (1)

error (218-221)

server/lib/pdfProcessor.ts (1)

ProcessingResult (22-30)

🔇 Additional comments (3)

server/integrations/google/worker-utils.ts (1)

52-60: PdfProcessor migration looks correct.

Buffer conversion, params, and result.chunks mapping are appropriate for Gmail attachments.

server/integrations/microsoft/attachment-utils.ts (1)

52-60: PdfProcessor migration looks correct.

Buffer conversion, params, and result.chunks usage match the intended behavior.

server/integrations/dataSource/index.ts (1)

211-233: Good integration with PdfProcessor + image support.

Correct parameters, robust empty-result check, and proper mapping of positions/method.

server/lib/chunkByOCR.ts

server/scripts/testOcrDispatcher.ts

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (3)

server/lib/chunkByOCR.ts (3)
596-597: waitForAll() blocks throughput by awaiting all promises.

This function awaits all in-flight promises before returning, preventing incremental scheduling and throttling throughput. A better approach is to wait for at least one completion and return only completed items.

This issue was previously flagged. Apply the suggested fix to race for the first completion:
-  await Promise.allSettled(inFlight.map((state) => state.promise))
-  return inFlight
+  // Wait until at least one completes, then return completed states
+  await Promise.race(inFlight.map((s) => s.promise.then(() => undefined)))
+  return inFlight.filter((s) => s.done)
632-637: Hard dependency on status endpoint in sequential mode.

Sequential dispatch calls fetchIdleInstances, creating a hard dependency on the status server. If the status server fails, the entire run is aborted, even though sequential mode doesn't rely on capacity information for correctness.

This issue was previously flagged. Wrap the status fetch in a try/catch or skip it entirely in sequential mode:
-  const idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics)
-  logger.info(
-    `[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`,
-  )
+  let idle = 0
+  try {
+    idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics)
+  } catch (e) {
+    logger.warn(`[cycle=sequential] status poll failed, proceeding without capacity info`)
+  }
+  logger.info(`[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`)
715-717: Fractional dispatch count causes extra loop iterations.

Math.min(idle/2, remaining) produces a fractional value when idle/2 isn't an integer (e.g., idle=5 → 2.5), causing the loop at line 723 to dispatch one extra task unintentionally.

This issue was previously flagged. Use integer capacity:
-        toDispatch = idle > 0 ? Math.min(idle/2, remaining) : 1
+        const capacity = idle > 0 ? Math.max(0, Math.floor(idle / 2)) : 0
+        toDispatch = capacity > 0 ? Math.min(capacity, remaining) : 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30c80a9 and fba575a.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (10 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/lib/chunkByOCR.ts (2)

server/lib/pdfProcessor.ts (1)

ProcessingResult (22-30)

server/services/fileProcessor.ts (1)

ProcessingResult (33-41)

server/lib/chunkByOCR.ts

This reverts commit f80d6fa.

This reverts commit fba575a.

This reverts commit f80d6fa.

Aayushjshah added 2 commits October 16, 2025 12:08

feat: XYNE-189 instance based requests for large pdfswith retry /backoff

676cd88

feat: XYN-203 enable for drive pdfs mail attachments etc

30c80a9

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners October 16, 2025 06:43

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

Aayushjshah added 2 commits October 16, 2025 13:15

feat: XYN-189 removed local testing function

fba575a

feat: XYN-189 remved local file paths

f80d6fa

coderabbitai bot reviewed Oct 16, 2025

View reviewed changes

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

Aayushjshah added 3 commits October 16, 2025 14:10

Revert "feat: XYN-189 remved local file paths"

0fa6588

This reverts commit f80d6fa.

Revert "feat: XYN-189 removed local testing function"

ee84f74

This reverts commit fba575a.

Revert "feat: XYN-189 remved local file paths"

6b60ae4

This reverts commit f80d6fa.

naSim087 approved these changes Oct 16, 2025

View reviewed changes

junaid-shirur approved these changes Oct 16, 2025

View reviewed changes

junaid-shirur merged commit 00d9fac into main Oct 16, 2025
4 checks passed

junaid-shirur deleted the feat/paddleForAllPdf branch October 16, 2025 14:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/paddle for all pdf #1118

Feat/paddle for all pdf #1118

Uh oh!

Aayushjshah commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 16, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat/paddle for all pdf #1118

Feat/paddle for all pdf #1118

Uh oh!

Conversation

Aayushjshah commented Oct 16, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aayushjshah commented Oct 16, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 16, 2025 •

edited

Loading