-
Notifications
You must be signed in to change notification settings - Fork 56
Fix/ocr chunking pages and images #1049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Caution Review failedThe pull request is closed. Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughChanged chunk metadata from a single numeric Changes
Sequence Diagram(s)sequenceDiagram
participant Caller
participant OCR as chunkByOCRFromBuffer
participant Seq as processBatchesSequentially
participant Con as processBatchesConcurrently
participant LPA as LayoutParsingAPI
participant Asm as ChunkAssembler
Caller->>OCR: start(buffer)
OCR->>OCR: read USE_SEQUENTIAL_BATCH_PROCESSING
alt Sequential
OCR->>Seq: process(batches)
loop per batch
Seq->>LPA: parse(batch)
LPA-->>Seq: blocks/images
end
Seq-->>OCR: results[]
else Concurrent
OCR->>Con: process(batches)
par each batch
Con->>LPA: parse(batch)
LPA-->>Con: blocks/images
end
Con-->>OCR: results[]
end
note right of OCR: skip images without description
OCR->>Asm: assemble(results) with page_numbers[]
Asm-->>Caller: chunks_map & image_chunks_map (page_numbers arrays)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the OCR chunking functionality, particularly for multi-page documents. The core change involves transitioning from a single page number association to an array of page numbers for both text and image chunks, providing a more granular and accurate representation of content location. Additionally, it introduces a configurable mechanism for processing large PDF files in batches, allowing for either concurrent or sequential execution, and refines image handling by filtering out images without meaningful descriptions. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the OCR chunking logic to correctly handle page numbers for chunks that may span multiple pages, by changing page_number
from a single number to an array page_numbers
. It also introduces the ability to process PDF batches sequentially for debugging and stability purposes. The changes are logical and move towards a more robust implementation. My review includes suggestions to improve logging practices, code maintainability by removing duplication, ensuring proper configuration management, and fixing a potential issue in the Vespa schema where new fields might not be indexed correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/lib/chunkByOCR.ts (1)
770-815
: Reset page context between pages
currentPageNumbers
is only cleared on the final document chunk, so numbers accumulate across pages. Insert a clear at the start of each page loop—e.g. beforecurrentPageNumbers.add(pageNumber)
(line 824) callcurrentPageNumbers.clear()
—so each page’s chunks only include that page. Handle empty pages similarly by clearing state even when there are no blocks.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
server/api/files.ts
(1 hunks)server/lib/chunkByOCR.ts
(19 hunks)server/package.json
(1 hunks)server/services/fileProcessor.ts
(1 hunks)server/types.ts
(1 hunks)server/vespa/schemas/kb_items.sd
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/services/fileProcessor.ts (1)
server/types.ts (1)
ChunkMetadata
(608-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (7)
server/package.json (1)
81-81
: LGTM!The Vespa client version bump aligns with the schema changes in this PR (migrating from
page_number
topage_numbers
).server/api/files.ts (1)
283-292
: LGTM!The migration from
page_number: 0
topage_numbers: [0]
correctly aligns with the updatedChunkMetadata
type. Bothchunks_map
andimage_chunks_map
are updated consistently.server/services/fileProcessor.ts (1)
156-166
: LGTM!The migration to
page_numbers: [-1]
for non-PDF files correctly implements the type change. The sentinel value-1
appropriately indicates that page information is not applicable for these file types.server/lib/chunkByOCR.ts (2)
580-651
: LGTM!The new helper functions
processBatchesConcurrently
andprocessBatchesSequentially
are well-structured and provide clear separation of concerns for the two batch processing strategies.
761-761
: LGTM!The migration from scalar
lastPageNumber
toSet<number> currentPageNumbers
correctly enables tracking multiple pages for chunks that span page boundaries. The implementation properly:
- Adds pages to the set (line 824)
- Converts to sorted array for metadata (line 802)
- Uses single-element arrays for image chunks (line 898)
Also applies to: 802-802, 824-824, 898-898
server/vespa/schemas/kb_items.sd (1)
4-8
: Ensure data migration for breaking schema change
The change frompage_number: int
topage_numbers: array<int>
is a breaking schema update. Reindex or migrate existing Vespa documents to populate the new array field; no code or queries reference the oldpage_number
field.server/types.ts (1)
608-612
: No leftoverChunkMetadata.page_number
usages found
The onlypage_number
occurrence in chat.ts is a span attribute for telemetry, not the removed type field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (3)
server/lib/chunkByOCR.ts (3)
690-690
: Flag currently forces sequential path.Until Line 20 is fixed, processingMode will always be 'sequential' and the sequential branch will always execute. After fixing, please verify the flag behavior in your environments.
Also applies to: 693-695
858-867
: Behavior change: images without descriptions are skipped — document or gate via flag.Good default to reduce noise, but it can impact completeness. Add a brief comment and/or a toggle (e.g., OCR_SKIP_IMAGES_WITHOUT_DESCRIPTION=true) with a WARN when skipping to aid observability.
Based on learnings
20-22
: Feature flag must be boolean (not string) and parsed from env.As written, the string 'true' forces sequential mode and bypasses configurability.
Apply this diff:
-const USE_SEQUENTIAL_BATCH_PROCESSING='true' -// Control whether to process batches concurrently (all at once) or sequentially (one by one) -// const USE_SEQUENTIAL_BATCH_PROCESSING = USE_SEQUENTIAL_BATCH_PROCESSING === 'true' || false +// Control whether to process batches concurrently (all at once) or sequentially (one by one) +const USE_SEQUENTIAL_BATCH_PROCESSING = + (process.env.USE_SEQUENTIAL_BATCH_PROCESSING ?? '').toLowerCase() === 'true'Based on learnings
🧹 Nitpick comments (2)
server/lib/chunkByOCR.ts (2)
580-612
: Concurrent processing: limit fan-out and handle partial failures.Unbounded Promise.all can overwhelm the OCR service for large PDFs. Consider a small concurrency limit (e.g., 4–8) and logging per-batch failures via allSettled, or backoff/retry on 429/5xx.
Example (no new deps):
const CONCURRENCY = Number.parseInt(process.env.LAYOUT_PARSING_CONCURRENCY ?? '4', 10) || 4 const results: LayoutParsingApiPayload[] = [] let i = 0 while (i < batches.length) { const slice = batches.slice(i, i + CONCURRENCY) const settled = await Promise.allSettled(slice.map((batch, idx) => { const batchIndex = i + idx + 1 Logger.info("Processing PDF batch", { fileName, batchIndex, totalBatches: batches.length, batchSizeBytes: batch.length }) return callLayoutParsingApi(batch, `${fileName}_batch_${batchIndex}`) })) for (const s of settled) { if (s.status === 'fulfilled') results.push(s.value) else Logger.error(s.reason, "Batch failed", { fileName }) } i += CONCURRENCY } return results
613-651
: Minor: deduplicate batch-processing code paths.Both sequential and concurrent branches share logging/shape. Extract a shared processSingleBatch helper to reduce duplication.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/lib/chunkByOCR.ts
(12 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (1)
server/lib/chunkByOCR.ts (1)
898-898
: Confirm indexing convention for page_numbers.page_numbers: [pageNumber] appears 0-based (consistent with ocrResponse keys). Verify downstream expects 0-based; if not, adjust (+1) here.
let currentTextBuffer = "" | ||
let currentBlockLabels: string[] = [] | ||
let lastPageNumber = -1 | ||
let currentPageNumbers: Set<number> = new Set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix page_numbers semantics: add pages only when content is actually appended.
Adding the page at Line 825 marks the chunk with pages even if the page contributes no text/placeholder (e.g., only images without description). Move page tracking to the points where content gets appended to currentTextBuffer.
Apply this diff to stop pre-marking pages:
- currentPageNumbers.add(pageNumber)
Then, add page tagging where content is appended (outside the changed ranges; illustrative snippets):
- For images (after appending the placeholder at Line 903):
currentTextBuffer += `${currentTextBuffer ? " " : ""}[IMG#${block.image_index}]`
currentPageNumbers.add(pageNumber)
- For text (just before Line 919 append):
if (!currentPageNumbers.has(pageNumber)) {
currentPageNumbers.add(pageNumber)
}
currentTextBuffer += (currentTextBuffer ? " " : "") + normalizedText
Keeping addChunk(true) for mid-page splits and addChunk(false) at final flush remains correct. This ensures page_numbers reflects only pages contributing content to the chunk.
Also applies to: 770-777, 812-814, 825-825, 916-916, 928-928
Description
Testing
Additional Notes
Summary by CodeRabbit