Fix/ocr chunking pages and images #1049

Aayushjshah · 2025-10-06T12:09:46Z

Description

Testing

Additional Notes

Summary by CodeRabbit

New Features
- Multi-page support: chunk and image metadata now reference one or more page numbers, improving page-aware search, previews, and grouping.
Bug Fixes
- Skip OCR images with no useful text to reduce noise and improve extraction accuracy.
Refactor
- OCR batch processing improved with selectable sequential/concurrent modes and better preservation/clearing of page associations across mid-page splits.
Chores
- Vespa client dependency upgraded for compatibility and stability.

and image storage

coderabbitai · 2025-10-06T12:09:58Z

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Changed chunk metadata from a single numeric page_number to an array page_numbers across types, OCR chunking, file processing, API ingestion, and Vespa schema; added a feature flag to choose sequential vs concurrent OCR batch processing, improved logging and skip logic for images, and bumped the Vespa client dependency.

Changes

Cohort / File(s)	Summary
Types & Schema `server/types.ts`, `server/vespa/schemas/kb_items.sd`	Replace `page_number: number` with `page_numbers: number[]` in `ChunkMetadata` and update Vespa `chunk_meta` to use `page_numbers` (array); remove `page_number` struct-field usage.
OCR Batch Processing `server/lib/chunkByOCR.ts`	Add `USE_SEQUENTIAL_BATCH_PROCESSING` flag; introduce `processBatchesSequentially` and `processBatchesConcurrently`; preserve/clear `page_numbers` across chunking; convert chunk metadata to use `page_numbers`; skip OCR images without descriptions; add logging and enhanced error traces.
File Processing (non-PDF) `server/services/fileProcessor.ts`	Initialize chunk metadata with `page_numbers: [-1]` for `chunks_map` and `image_chunks_map` instead of `page_number: -1`.
API Ingestion `server/api/files.ts`	Change payload sent to Vespa to include `page_numbers` arrays (e.g., `[0]`) in `chunks_map` and `image_chunks_map` instead of `page_number`.
Dependency `server/package.json`	Bump `@xyne/vespa-ts` from `1.0.11` to `1.0.12`.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant OCR as chunkByOCRFromBuffer
  participant Seq as processBatchesSequentially
  participant Con as processBatchesConcurrently
  participant LPA as LayoutParsingAPI
  participant Asm as ChunkAssembler

  Caller->>OCR: start(buffer)
  OCR->>OCR: read USE_SEQUENTIAL_BATCH_PROCESSING
  alt Sequential
    OCR->>Seq: process(batches)
    loop per batch
      Seq->>LPA: parse(batch)
      LPA-->>Seq: blocks/images
    end
    Seq-->>OCR: results[]
  else Concurrent
    OCR->>Con: process(batches)
    par each batch
      Con->>LPA: parse(batch)
      LPA-->>Con: blocks/images
    end
    Con-->>OCR: results[]
  end

  note right of OCR: skip images without description
  OCR->>Asm: assemble(results) with page_numbers[]
  Asm-->>Caller: chunks_map & image_chunks_map (page_numbers arrays)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Fix/ocr chunking pages and images #1049 — Directly overlaps changes moving page_number → page_numbers across OCR and ingestion paths.
feat(addpageChunking): adding env driven 100 default buffer for paddl… #1040 — Modifies PDF/layout parsing batching and merging logic in chunkByOCR.ts (batching behavior similarity).
Feat/paddle #1019 — Touches chunk metadata and chunks_map/image_chunks_map shape (page_number ↦ page_numbers) across similar paths.

Suggested reviewers

zereraz
kalpadhwaryu
devesh-juspay
junaid-shirur

Poem

A rabbit hops through pages, tidy and bright,
I gather page_numbers in arrays by moonlight.
Sequential or speedy, I nibble each batch,
Skipping empty carrots, no chunk left to patch.
Vespa hums as docs land — a soft thump of delight. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title accurately describes the primary change—correcting OCR chunking for pages and images—which aligns with the code modifications shifting to array-based page references without extraneous details. It is concise, focused on the main fix, and clear to a reader scanning the history.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38c3792 and 6a9ec9a.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (12 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-06T12:10:07Z

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the OCR chunking functionality, particularly for multi-page documents. The core change involves transitioning from a single page number association to an array of page numbers for both text and image chunks, providing a more granular and accurate representation of content location. Additionally, it introduces a configurable mechanism for processing large PDF files in batches, allowing for either concurrent or sequential execution, and refines image handling by filtering out images without meaningful descriptions.

Highlights

Multi-page Chunk Association: The system now supports associating text and image chunks with an array of page numbers (page_numbers) instead of a single page_number. This allows for more accurate representation of content that spans across multiple pages.
Configurable PDF Batch Processing: New logic has been introduced to process large PDF files in batches, with the option to perform this processing either concurrently or sequentially, controlled by a new configuration flag (USE_SEQUENTIAL_BATCH_PROCESSING).
Image Processing Refinement: The OCR chunking process now includes a check to skip images that do not have any descriptive content (block_content), preventing the creation of empty or irrelevant image chunks.
Logging Update: Several instances of Logger.info calls have been replaced with console.log for logging messages related to OCR processing.
Dependency Update: The @xyne/vespa-ts dependency has been updated to version 1.0.12.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the OCR chunking logic to correctly handle page numbers for chunks that may span multiple pages, by changing page_number from a single number to an array page_numbers. It also introduces the ability to process PDF batches sequentially for debugging and stability purposes. The changes are logical and move towards a more robust implementation. My review includes suggestions to improve logging practices, code maintainability by removing duplication, ensuring proper configuration management, and fixing a potential issue in the Vespa schema where new fields might not be indexed correctly.

server/vespa/schemas/kb_items.sd

server/lib/chunkByOCR.ts

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/lib/chunkByOCR.ts (1)

770-815: Reset page context between pages
currentPageNumbers is only cleared on the final document chunk, so numbers accumulate across pages. Insert a clear at the start of each page loop—e.g. before currentPageNumbers.add(pageNumber) (line 824) call currentPageNumbers.clear()—so each page’s chunks only include that page. Handle empty pages similarly by clearing state even when there are no blocks.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45295a0 and 277b51c.

📒 Files selected for processing (6)

server/api/files.ts (1 hunks)
server/lib/chunkByOCR.ts (19 hunks)
server/package.json (1 hunks)
server/services/fileProcessor.ts (1 hunks)
server/types.ts (1 hunks)
server/vespa/schemas/kb_items.sd (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/services/fileProcessor.ts (1)

server/types.ts (1)

ChunkMetadata (608-612)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (7)

server/package.json (1)

81-81: LGTM!

The Vespa client version bump aligns with the schema changes in this PR (migrating from page_number to page_numbers).

server/api/files.ts (1)

283-292: LGTM!

The migration from page_number: 0 to page_numbers: [0] correctly aligns with the updated ChunkMetadata type. Both chunks_map and image_chunks_map are updated consistently.

server/services/fileProcessor.ts (1)

156-166: LGTM!

The migration to page_numbers: [-1] for non-PDF files correctly implements the type change. The sentinel value -1 appropriately indicates that page information is not applicable for these file types.

server/lib/chunkByOCR.ts (2)

580-651: LGTM!

The new helper functions processBatchesConcurrently and processBatchesSequentially are well-structured and provide clear separation of concerns for the two batch processing strategies.

761-761: LGTM!

The migration from scalar lastPageNumber to Set<number> currentPageNumbers correctly enables tracking multiple pages for chunks that span page boundaries. The implementation properly:

Adds pages to the set (line 824)

Converts to sorted array for metadata (line 802)

Uses single-element arrays for image chunks (line 898)

Also applies to: 802-802, 824-824, 898-898

server/vespa/schemas/kb_items.sd (1)

4-8: Ensure data migration for breaking schema change
The change from page_number: int to page_numbers: array<int> is a breaking schema update. Reindex or migrate existing Vespa documents to populate the new array field; no code or queries reference the old page_number field.

server/types.ts (1)

608-612: No leftover ChunkMetadata.page_number usages found
The only page_number occurrence in chat.ts is a span attribute for telemetry, not the removed type field.

server/lib/chunkByOCR.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (3)

server/lib/chunkByOCR.ts (3)
690-690: Flag currently forces sequential path.

Until Line 20 is fixed, processingMode will always be 'sequential' and the sequential branch will always execute. After fixing, please verify the flag behavior in your environments.

Also applies to: 693-695

858-867: Behavior change: images without descriptions are skipped — document or gate via flag.

Good default to reduce noise, but it can impact completeness. Add a brief comment and/or a toggle (e.g., OCR_SKIP_IMAGES_WITHOUT_DESCRIPTION=true) with a WARN when skipping to aid observability.

Based on learnings

20-22: Feature flag must be boolean (not string) and parsed from env.

As written, the string 'true' forces sequential mode and bypasses configurability.

Apply this diff:
-const USE_SEQUENTIAL_BATCH_PROCESSING='true'
-// Control whether to process batches concurrently (all at once) or sequentially (one by one)
-// const USE_SEQUENTIAL_BATCH_PROCESSING = USE_SEQUENTIAL_BATCH_PROCESSING === 'true' || false
+// Control whether to process batches concurrently (all at once) or sequentially (one by one)
+const USE_SEQUENTIAL_BATCH_PROCESSING =
+  (process.env.USE_SEQUENTIAL_BATCH_PROCESSING ?? '').toLowerCase() === 'true'
Based on learnings

🧹 Nitpick comments (2)

server/lib/chunkByOCR.ts (2)
580-612: Concurrent processing: limit fan-out and handle partial failures.

Unbounded Promise.all can overwhelm the OCR service for large PDFs. Consider a small concurrency limit (e.g., 4–8) and logging per-batch failures via allSettled, or backoff/retry on 429/5xx.

Example (no new deps):
const CONCURRENCY = Number.parseInt(process.env.LAYOUT_PARSING_CONCURRENCY ?? '4', 10) || 4
const results: LayoutParsingApiPayload[] = []
let i = 0
while (i < batches.length) {
  const slice = batches.slice(i, i + CONCURRENCY)
  const settled = await Promise.allSettled(slice.map((batch, idx) => {
    const batchIndex = i + idx + 1
    Logger.info("Processing PDF batch", { fileName, batchIndex, totalBatches: batches.length, batchSizeBytes: batch.length })
    return callLayoutParsingApi(batch, `${fileName}_batch_${batchIndex}`)
  }))
  for (const s of settled) {
    if (s.status === 'fulfilled') results.push(s.value)
    else Logger.error(s.reason, "Batch failed", { fileName })
  }
  i += CONCURRENCY
}
return results
613-651: Minor: deduplicate batch-processing code paths.

Both sequential and concurrent branches share logging/shape. Extract a shared processSingleBatch helper to reduce duplication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 277b51c and 38c3792.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (12 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (1)

server/lib/chunkByOCR.ts (1)

898-898: Confirm indexing convention for page_numbers.

page_numbers: [pageNumber] appears 0-based (consistent with ocrResponse keys). Verify downstream expects 0-based; if not, adjust (+1) here.

coderabbitai · 2025-10-06T13:34:18Z

server/lib/chunkByOCR.ts

  let currentTextBuffer = ""
  let currentBlockLabels: string[] = []
-  let lastPageNumber = -1
+  let currentPageNumbers: Set<number> = new Set()


⚠️ Potential issue | 🟠 Major

Fix page_numbers semantics: add pages only when content is actually appended.

Adding the page at Line 825 marks the chunk with pages even if the page contributes no text/placeholder (e.g., only images without description). Move page tracking to the points where content gets appended to currentTextBuffer.

Apply this diff to stop pre-marking pages:

- currentPageNumbers.add(pageNumber)

Then, add page tagging where content is appended (outside the changed ranges; illustrative snippets):

For images (after appending the placeholder at Line 903):

currentTextBuffer += `${currentTextBuffer ? " " : ""}[IMG#${block.image_index}]` currentPageNumbers.add(pageNumber)

For text (just before Line 919 append):

if (!currentPageNumbers.has(pageNumber)) { currentPageNumbers.add(pageNumber) } currentTextBuffer += (currentTextBuffer ? " " : "") + normalizedText

Keeping addChunk(true) for mid-page splits and addChunk(false) at final flush remains correct. This ensures page_numbers reflects only pages contributing content to the chunk.

Also applies to: 770-777, 812-814, 825-825, 916-916, 928-928

Aayushjshah added 2 commits October 6, 2025 17:34

fix: XYN-108 Adding sequential API call for large pdfs pageNumbersArray

9d3493a

and image storage

Merge branch 'main' into fix/ocrChunkingPagesAndImages

277b51c

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners October 6, 2025 12:09

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

server/vespa/schemas/kb_items.sd Show resolved Hide resolved

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/lib/chunkByOCR.ts Show resolved Hide resolved

fix: XYN-100 removing console.log

38c3792

junaid-shirur reviewed Oct 6, 2025

View reviewed changes

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

fix: XYN-100 comment fix

6a9ec9a

junaid-shirur approved these changes Oct 6, 2025

View reviewed changes

junaid-shirur merged commit 6cff523 into main Oct 6, 2025
2 of 3 checks passed

junaid-shirur deleted the fix/ocrChunkingPagesAndImages branch October 6, 2025 13:32

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

coderabbitai bot mentioned this pull request Oct 14, 2025

feat: XYN-100 adding falback for pdf based changes and storing in db #1099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/ocr chunking pages and images #1049

Fix/ocr chunking pages and images #1049

Uh oh!

Aayushjshah commented Oct 6, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 6, 2025 •

edited

Loading

Review failed

Other AI code review bot(s) detected

Uh oh!

gemini-code-assist bot commented Oct 6, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/ocr chunking pages and images #1049

Fix/ocr chunking pages and images #1049

Uh oh!

Conversation

Aayushjshah commented Oct 6, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 6, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aayushjshah commented Oct 6, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 6, 2025 •

edited

Loading