Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Oct 16, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • Refactor
    • Migrated PDF processing pipeline across integrations to an improved extraction framework with updated field mapping for text and image chunks.
    • Enhanced OCR batching logic with refined code structure and error handling.
    • Streamlined PDF data extraction workflows with improved consistency across processing stages.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 16, 2025

Walkthrough

The PR replaces the Gemini-based PDF extraction mechanism with a new PdfProcessor.processWithFallback workflow across three integration files (dataSource, Google, Microsoft), updating function calls, result field mappings, and imports. Minor formatting refinements are applied to OCR chunking logic, and obsolete documentation comments are removed.

Changes

Cohort / File(s) Change Summary
PDF processor migration
server/integrations/dataSource/index.ts, server/integrations/google/worker-utils.ts, server/integrations/microsoft/attachment-utils.ts
Replaces extractTextAndImagesWithChunksFromPDFviaGemini calls with PdfProcessor.processWithFallback, updates result field references (text_chunks → chunks, image_chunk_pos → image_chunks_pos), converts input to Buffer.from(), and adjusts imports.
Library refinements
server/lib/chunkByOCR.ts
Formatting and spacing adjustments to constant definitions, loop headers, logging calls, and string concatenations without altering control flow or logic.
Comment cleanup
server/lib/chunkPdfWithGemini.ts
Removes obsolete Gemini-backed PDF extractor documentation comment blocks.

Sequence Diagram

sequenceDiagram
    actor Integration as Integration Layer
    participant Old as Gemini Extractor<br/>(Previous)
    participant New as PdfProcessor<br/>(New)
    participant Result as Result Handler

    rect rgb(220, 240, 250)
    note over Integration,Result: Previous Flow
    Integration->>Old: extractTextAndImagesWithChunksFromPDFviaGemini<br/>(pdfBuffer, docId)
    Old->>Old: Gemini-based chunking
    Old->>Result: {text_chunks, image_chunks,<br/>text_chunk_pos, image_chunk_pos}
    end

    rect rgb(240, 220, 250)
    note over Integration,Result: New Flow
    Integration->>New: PdfProcessor.processWithFallback<br/>(Buffer.from(pdfBuffer), ...)
    New->>New: PdfProcessor logic with fallback
    New->>Result: {chunks, image_chunks,<br/>chunks_pos, image_chunks_pos,<br/>processingMethod}
    end

    Result->>Result: Map fields & construct<br/>VespaDataSourceFile
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

The changes follow a consistent replacement pattern across three integration files, but require verification that field mappings (text_chunks → chunks, etc.) are correctly applied throughout. The OCR file changes are cosmetic. Medium heterogeneity across files offset by repetitive nature of core substitution pattern.

Possibly related PRs

Suggested reviewers

  • zereraz
  • shivamashtikar
  • kalpadhwaryu
  • junaid-shirur
  • devesh-juspay

Poem

🐰 The Gemini days are past, hooray!
PdfProcessor shows the modern way,
Chunks and images dance in harmony,
No more gemstone magic—just clean efficiency! ✨
Three integrations leap and play!

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title uses the term "paddle" which doesn’t correspond to any component or workflow in the changes and fails to summarize the key migration from Gemini-based PDF extraction to a PdfProcessor-based workflow. Please update the PR title to clearly reflect the main change, for example “feat(pdf): migrate PDF extraction from Gemini to PdfProcessor” so that reviewers immediately understand the purpose of the pull request.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/paddleForAllPdf

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ee84f74 and 6b60ae4.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/lib/chunkByOCR.ts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the system's PDF processing capabilities by introducing a new PdfProcessor abstraction and an advanced OCR batch dispatching mechanism. The core change involves replacing previous PDF extraction methods with a more generalized and resilient pipeline that can intelligently process PDF documents in batches, leveraging dynamic concurrency, retry logic, and instance status monitoring. This refactoring improves scalability, reliability, and maintainability of PDF content ingestion across different data sources.

Highlights

  • Centralized PDF Processing: Introduced a new PdfProcessor class to centralize and abstract PDF content extraction, replacing direct calls to Gemini-specific functions across various integration points.
  • Advanced OCR Batch Dispatcher: Implemented a robust OCR batch dispatching system (dispatchOCRBatches) in chunkByOCR.ts that supports both sequential and concurrent processing of PDF page batches.
  • Dynamic Concurrency and Retry Logic: The new dispatcher includes features like monitoring idle OCR instances, dynamic batch dispatching based on available capacity, configurable timeouts, exponential backoff with jitter for retries, and comprehensive error handling.
  • Enhanced PDF Batching: The splitPdfIntoBatches function was updated to return richer PdfPageBatch objects, including startPage and endPage information, facilitating better tracking and processing of PDF segments.
  • Comprehensive Testing Utilities: A new script (testOcrDispatcher.ts) was added to provide extensive testing and demonstration of the OCR dispatcher's behavior under various simulated conditions, including fluctuating idle capacities and server unreachability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PdfProcessor class to handle PDF processing with fallback mechanisms, including OCR, Gemini, and PDF.js. It also includes a new OCR dispatcher to manage OCR processing batches and a test suite for the dispatcher logic. The changes aim to improve the robustness and efficiency of PDF processing. I have provided review comments for issues of medium, high, and critical severity.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (1)
server/scripts/testOcrDispatcher.ts (1)

158-202: Stabilize test determinism and resource usage (optional).

  • Expose a fixed seed or a configurable failureRate for sendBatch to avoid flaky runs.
  • Consider mocking per-batch buffers (e.g., Buffer.alloc(16)) vs. real PDF copying to reduce memory/time for large PDFs.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1cb30d8 and 30c80a9.

📒 Files selected for processing (6)
  • server/integrations/dataSource/index.ts (2 hunks)
  • server/integrations/google/worker-utils.ts (2 hunks)
  • server/integrations/microsoft/attachment-utils.ts (2 hunks)
  • server/lib/chunkByOCR.ts (11 hunks)
  • server/lib/chunkPdfWithGemini.ts (0 hunks)
  • server/scripts/testOcrDispatcher.ts (1 hunks)
💤 Files with no reviewable changes (1)
  • server/lib/chunkPdfWithGemini.ts
🧰 Additional context used
🧬 Code graph analysis (5)
server/integrations/dataSource/index.ts (2)
server/lib/pdfProcessor.ts (1)
  • PdfProcessor (42-281)
server/integrations/dataSource/errors.ts (1)
  • ContentExtractionError (60-68)
server/integrations/microsoft/attachment-utils.ts (1)
server/lib/pdfProcessor.ts (1)
  • PdfProcessor (42-281)
server/integrations/google/worker-utils.ts (1)
server/lib/pdfProcessor.ts (1)
  • PdfProcessor (42-281)
server/scripts/testOcrDispatcher.ts (1)
server/lib/chunkByOCR.ts (6)
  • PdfOcrBatch (116-122)
  • DispatchOptions (124-140)
  • DispatchReport (142-155)
  • dispatchOCRBatches (813-880)
  • error (171-171)
  • error (1627-1633)
server/lib/chunkByOCR.ts (2)
server/scripts/testOcrDispatcher.ts (1)
  • error (218-221)
server/lib/pdfProcessor.ts (1)
  • ProcessingResult (22-30)
🔇 Additional comments (3)
server/integrations/google/worker-utils.ts (1)

52-60: PdfProcessor migration looks correct.

Buffer conversion, params, and result.chunks mapping are appropriate for Gmail attachments.

server/integrations/microsoft/attachment-utils.ts (1)

52-60: PdfProcessor migration looks correct.

Buffer conversion, params, and result.chunks usage match the intended behavior.

server/integrations/dataSource/index.ts (1)

211-233: Good integration with PdfProcessor + image support.

Correct parameters, robust empty-result check, and proper mapping of positions/method.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (3)
server/lib/chunkByOCR.ts (3)

596-597: waitForAll() blocks throughput by awaiting all promises.

This function awaits all in-flight promises before returning, preventing incremental scheduling and throttling throughput. A better approach is to wait for at least one completion and return only completed items.

This issue was previously flagged. Apply the suggested fix to race for the first completion:

-  await Promise.allSettled(inFlight.map((state) => state.promise))
-  return inFlight
+  // Wait until at least one completes, then return completed states
+  await Promise.race(inFlight.map((s) => s.promise.then(() => undefined)))
+  return inFlight.filter((s) => s.done)

632-637: Hard dependency on status endpoint in sequential mode.

Sequential dispatch calls fetchIdleInstances, creating a hard dependency on the status server. If the status server fails, the entire run is aborted, even though sequential mode doesn't rely on capacity information for correctness.

This issue was previously flagged. Wrap the status fetch in a try/catch or skip it entirely in sequential mode:

-  const idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics)
-  logger.info(
-    `[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`,
-  )
+  let idle = 0
+  try {
+    idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics)
+  } catch (e) {
+    logger.warn(`[cycle=sequential] status poll failed, proceeding without capacity info`)
+  }
+  logger.info(`[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`)

715-717: Fractional dispatch count causes extra loop iterations.

Math.min(idle/2, remaining) produces a fractional value when idle/2 isn't an integer (e.g., idle=52.5), causing the loop at line 723 to dispatch one extra task unintentionally.

This issue was previously flagged. Use integer capacity:

-        toDispatch = idle > 0 ? Math.min(idle/2, remaining) : 1
+        const capacity = idle > 0 ? Math.max(0, Math.floor(idle / 2)) : 0
+        toDispatch = capacity > 0 ? Math.min(capacity, remaining) : 1
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 30c80a9 and fba575a.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/lib/chunkByOCR.ts (2)
server/lib/pdfProcessor.ts (1)
  • ProcessingResult (22-30)
server/services/fileProcessor.ts (1)
  • ProcessingResult (33-41)

@junaid-shirur junaid-shirur merged commit 00d9fac into main Oct 16, 2025
4 checks passed
@junaid-shirur junaid-shirur deleted the feat/paddleForAllPdf branch October 16, 2025 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants