-
Notifications
You must be signed in to change notification settings - Fork 56
Feat/paddle for all pdf #1118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/paddle for all pdf #1118
Conversation
WalkthroughThe PR replaces the Gemini-based PDF extraction mechanism with a new Changes
Sequence DiagramsequenceDiagram
actor Integration as Integration Layer
participant Old as Gemini Extractor<br/>(Previous)
participant New as PdfProcessor<br/>(New)
participant Result as Result Handler
rect rgb(220, 240, 250)
note over Integration,Result: Previous Flow
Integration->>Old: extractTextAndImagesWithChunksFromPDFviaGemini<br/>(pdfBuffer, docId)
Old->>Old: Gemini-based chunking
Old->>Result: {text_chunks, image_chunks,<br/>text_chunk_pos, image_chunk_pos}
end
rect rgb(240, 220, 250)
note over Integration,Result: New Flow
Integration->>New: PdfProcessor.processWithFallback<br/>(Buffer.from(pdfBuffer), ...)
New->>New: PdfProcessor logic with fallback
New->>Result: {chunks, image_chunks,<br/>chunks_pos, image_chunks_pos,<br/>processingMethod}
end
Result->>Result: Map fields & construct<br/>VespaDataSourceFile
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes The changes follow a consistent replacement pattern across three integration files, but require verification that field mappings (text_chunks → chunks, etc.) are correctly applied throughout. The OCR file changes are cosmetic. Medium heterogeneity across files offset by repetitive nature of core substitution pattern. Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the system's PDF processing capabilities by introducing a new PdfProcessor abstraction and an advanced OCR batch dispatching mechanism. The core change involves replacing previous PDF extraction methods with a more generalized and resilient pipeline that can intelligently process PDF documents in batches, leveraging dynamic concurrency, retry logic, and instance status monitoring. This refactoring improves scalability, reliability, and maintainability of PDF content ingestion across different data sources. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a PdfProcessor
class to handle PDF processing with fallback mechanisms, including OCR, Gemini, and PDF.js. It also includes a new OCR dispatcher to manage OCR processing batches and a test suite for the dispatcher logic. The changes aim to improve the robustness and efficiency of PDF processing. I have provided review comments for issues of medium
, high
, and critical
severity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (1)
server/scripts/testOcrDispatcher.ts (1)
158-202
: Stabilize test determinism and resource usage (optional).
- Expose a fixed seed or a configurable failureRate for sendBatch to avoid flaky runs.
- Consider mocking per-batch buffers (e.g., Buffer.alloc(16)) vs. real PDF copying to reduce memory/time for large PDFs.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
server/integrations/dataSource/index.ts
(2 hunks)server/integrations/google/worker-utils.ts
(2 hunks)server/integrations/microsoft/attachment-utils.ts
(2 hunks)server/lib/chunkByOCR.ts
(11 hunks)server/lib/chunkPdfWithGemini.ts
(0 hunks)server/scripts/testOcrDispatcher.ts
(1 hunks)
💤 Files with no reviewable changes (1)
- server/lib/chunkPdfWithGemini.ts
🧰 Additional context used
🧬 Code graph analysis (5)
server/integrations/dataSource/index.ts (2)
server/lib/pdfProcessor.ts (1)
PdfProcessor
(42-281)server/integrations/dataSource/errors.ts (1)
ContentExtractionError
(60-68)
server/integrations/microsoft/attachment-utils.ts (1)
server/lib/pdfProcessor.ts (1)
PdfProcessor
(42-281)
server/integrations/google/worker-utils.ts (1)
server/lib/pdfProcessor.ts (1)
PdfProcessor
(42-281)
server/scripts/testOcrDispatcher.ts (1)
server/lib/chunkByOCR.ts (6)
PdfOcrBatch
(116-122)DispatchOptions
(124-140)DispatchReport
(142-155)dispatchOCRBatches
(813-880)error
(171-171)error
(1627-1633)
server/lib/chunkByOCR.ts (2)
server/scripts/testOcrDispatcher.ts (1)
error
(218-221)server/lib/pdfProcessor.ts (1)
ProcessingResult
(22-30)
🔇 Additional comments (3)
server/integrations/google/worker-utils.ts (1)
52-60
: PdfProcessor migration looks correct.Buffer conversion, params, and result.chunks mapping are appropriate for Gmail attachments.
server/integrations/microsoft/attachment-utils.ts (1)
52-60
: PdfProcessor migration looks correct.Buffer conversion, params, and result.chunks usage match the intended behavior.
server/integrations/dataSource/index.ts (1)
211-233
: Good integration with PdfProcessor + image support.Correct parameters, robust empty-result check, and proper mapping of positions/method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (3)
server/lib/chunkByOCR.ts (3)
596-597
:waitForAll()
blocks throughput by awaiting all promises.This function awaits all in-flight promises before returning, preventing incremental scheduling and throttling throughput. A better approach is to wait for at least one completion and return only completed items.
This issue was previously flagged. Apply the suggested fix to race for the first completion:
- await Promise.allSettled(inFlight.map((state) => state.promise)) - return inFlight + // Wait until at least one completes, then return completed states + await Promise.race(inFlight.map((s) => s.promise.then(() => undefined))) + return inFlight.filter((s) => s.done)
632-637
: Hard dependency on status endpoint in sequential mode.Sequential dispatch calls
fetchIdleInstances
, creating a hard dependency on the status server. If the status server fails, the entire run is aborted, even though sequential mode doesn't rely on capacity information for correctness.This issue was previously flagged. Wrap the status fetch in a try/catch or skip it entirely in sequential mode:
- const idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics) - logger.info( - `[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`, - ) + let idle = 0 + try { + idle = await fetchIdleInstances(statusEndpoint, logger, options.metrics) + } catch (e) { + logger.warn(`[cycle=sequential] status poll failed, proceeding without capacity info`) + } + logger.info(`[cycle=sequential] idle=${idle} remaining=${batches.length} inflight=0 dispatching=1`)
715-717
: Fractional dispatch count causes extra loop iterations.
Math.min(idle/2, remaining)
produces a fractional value whenidle/2
isn't an integer (e.g.,idle=5
→2.5
), causing the loop at line 723 to dispatch one extra task unintentionally.This issue was previously flagged. Use integer capacity:
- toDispatch = idle > 0 ? Math.min(idle/2, remaining) : 1 + const capacity = idle > 0 ? Math.max(0, Math.floor(idle / 2)) : 0 + toDispatch = capacity > 0 ? Math.min(capacity, remaining) : 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/lib/chunkByOCR.ts
(10 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/lib/chunkByOCR.ts (2)
server/lib/pdfProcessor.ts (1)
ProcessingResult
(22-30)server/services/fileProcessor.ts (1)
ProcessingResult
(33-41)
Description
Testing
Additional Notes
Summary by CodeRabbit