Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Oct 6, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features
    • Multi-page support: chunk and image metadata now reference one or more page numbers, improving page-aware search, previews, and grouping.
  • Bug Fixes
    • Skip OCR images with no useful text to reduce noise and improve extraction accuracy.
  • Refactor
    • OCR batch processing improved with selectable sequential/concurrent modes and better preservation/clearing of page associations across mid-page splits.
  • Chores
    • Vespa client dependency upgraded for compatibility and stability.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 6, 2025

Caution

Review failed

The pull request is closed.

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Changed chunk metadata from a single numeric page_number to an array page_numbers across types, OCR chunking, file processing, API ingestion, and Vespa schema; added a feature flag to choose sequential vs concurrent OCR batch processing, improved logging and skip logic for images, and bumped the Vespa client dependency.

Changes

Cohort / File(s) Summary
Types & Schema
server/types.ts, server/vespa/schemas/kb_items.sd
Replace page_number: number with page_numbers: number[] in ChunkMetadata and update Vespa chunk_meta to use page_numbers (array); remove page_number struct-field usage.
OCR Batch Processing
server/lib/chunkByOCR.ts
Add USE_SEQUENTIAL_BATCH_PROCESSING flag; introduce processBatchesSequentially and processBatchesConcurrently; preserve/clear page_numbers across chunking; convert chunk metadata to use page_numbers; skip OCR images without descriptions; add logging and enhanced error traces.
File Processing (non-PDF)
server/services/fileProcessor.ts
Initialize chunk metadata with page_numbers: [-1] for chunks_map and image_chunks_map instead of page_number: -1.
API Ingestion
server/api/files.ts
Change payload sent to Vespa to include page_numbers arrays (e.g., [0]) in chunks_map and image_chunks_map instead of page_number.
Dependency
server/package.json
Bump @xyne/vespa-ts from 1.0.11 to 1.0.12.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant OCR as chunkByOCRFromBuffer
  participant Seq as processBatchesSequentially
  participant Con as processBatchesConcurrently
  participant LPA as LayoutParsingAPI
  participant Asm as ChunkAssembler

  Caller->>OCR: start(buffer)
  OCR->>OCR: read USE_SEQUENTIAL_BATCH_PROCESSING
  alt Sequential
    OCR->>Seq: process(batches)
    loop per batch
      Seq->>LPA: parse(batch)
      LPA-->>Seq: blocks/images
    end
    Seq-->>OCR: results[]
  else Concurrent
    OCR->>Con: process(batches)
    par each batch
      Con->>LPA: parse(batch)
      LPA-->>Con: blocks/images
    end
    Con-->>OCR: results[]
  end

  note right of OCR: skip images without description
  OCR->>Asm: assemble(results) with page_numbers[]
  Asm-->>Caller: chunks_map & image_chunks_map (page_numbers arrays)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • zereraz
  • kalpadhwaryu
  • devesh-juspay
  • junaid-shirur

Poem

A rabbit hops through pages, tidy and bright,
I gather page_numbers in arrays by moonlight.
Sequential or speedy, I nibble each batch,
Skipping empty carrots, no chunk left to patch.
Vespa hums as docs land — a soft thump of delight. 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title accurately describes the primary change—correcting OCR chunking for pages and images—which aligns with the code modifications shifting to array-based page references without extraneous details. It is concise, focused on the main fix, and clear to a reader scanning the history.

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 38c3792 and 6a9ec9a.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (12 hunks)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the OCR chunking functionality, particularly for multi-page documents. The core change involves transitioning from a single page number association to an array of page numbers for both text and image chunks, providing a more granular and accurate representation of content location. Additionally, it introduces a configurable mechanism for processing large PDF files in batches, allowing for either concurrent or sequential execution, and refines image handling by filtering out images without meaningful descriptions.

Highlights

  • Multi-page Chunk Association: The system now supports associating text and image chunks with an array of page numbers (page_numbers) instead of a single page_number. This allows for more accurate representation of content that spans across multiple pages.
  • Configurable PDF Batch Processing: New logic has been introduced to process large PDF files in batches, with the option to perform this processing either concurrently or sequentially, controlled by a new configuration flag (USE_SEQUENTIAL_BATCH_PROCESSING).
  • Image Processing Refinement: The OCR chunking process now includes a check to skip images that do not have any descriptive content (block_content), preventing the creation of empty or irrelevant image chunks.
  • Logging Update: Several instances of Logger.info calls have been replaced with console.log for logging messages related to OCR processing.
  • Dependency Update: The @xyne/vespa-ts dependency has been updated to version 1.0.12.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the OCR chunking logic to correctly handle page numbers for chunks that may span multiple pages, by changing page_number from a single number to an array page_numbers. It also introduces the ability to process PDF batches sequentially for debugging and stability purposes. The changes are logical and move towards a more robust implementation. My review includes suggestions to improve logging practices, code maintainability by removing duplication, ensuring proper configuration management, and fixing a potential issue in the Vespa schema where new fields might not be indexed correctly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/lib/chunkByOCR.ts (1)

770-815: Reset page context between pages
currentPageNumbers is only cleared on the final document chunk, so numbers accumulate across pages. Insert a clear at the start of each page loop—e.g. before currentPageNumbers.add(pageNumber) (line 824) call currentPageNumbers.clear()—so each page’s chunks only include that page. Handle empty pages similarly by clearing state even when there are no blocks.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45295a0 and 277b51c.

📒 Files selected for processing (6)
  • server/api/files.ts (1 hunks)
  • server/lib/chunkByOCR.ts (19 hunks)
  • server/package.json (1 hunks)
  • server/services/fileProcessor.ts (1 hunks)
  • server/types.ts (1 hunks)
  • server/vespa/schemas/kb_items.sd (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/services/fileProcessor.ts (1)
server/types.ts (1)
  • ChunkMetadata (608-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (7)
server/package.json (1)

81-81: LGTM!

The Vespa client version bump aligns with the schema changes in this PR (migrating from page_number to page_numbers).

server/api/files.ts (1)

283-292: LGTM!

The migration from page_number: 0 to page_numbers: [0] correctly aligns with the updated ChunkMetadata type. Both chunks_map and image_chunks_map are updated consistently.

server/services/fileProcessor.ts (1)

156-166: LGTM!

The migration to page_numbers: [-1] for non-PDF files correctly implements the type change. The sentinel value -1 appropriately indicates that page information is not applicable for these file types.

server/lib/chunkByOCR.ts (2)

580-651: LGTM!

The new helper functions processBatchesConcurrently and processBatchesSequentially are well-structured and provide clear separation of concerns for the two batch processing strategies.


761-761: LGTM!

The migration from scalar lastPageNumber to Set<number> currentPageNumbers correctly enables tracking multiple pages for chunks that span page boundaries. The implementation properly:

  • Adds pages to the set (line 824)
  • Converts to sorted array for metadata (line 802)
  • Uses single-element arrays for image chunks (line 898)

Also applies to: 802-802, 824-824, 898-898

server/vespa/schemas/kb_items.sd (1)

4-8: Ensure data migration for breaking schema change
The change from page_number: int to page_numbers: array<int> is a breaking schema update. Reindex or migrate existing Vespa documents to populate the new array field; no code or queries reference the old page_number field.

server/types.ts (1)

608-612: No leftover ChunkMetadata.page_number usages found
The only page_number occurrence in chat.ts is a span attribute for telemetry, not the removed type field.

@junaid-shirur junaid-shirur merged commit 6cff523 into main Oct 6, 2025
2 of 3 checks passed
@junaid-shirur junaid-shirur deleted the fix/ocrChunkingPagesAndImages branch October 6, 2025 13:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
server/lib/chunkByOCR.ts (3)

690-690: Flag currently forces sequential path.

Until Line 20 is fixed, processingMode will always be 'sequential' and the sequential branch will always execute. After fixing, please verify the flag behavior in your environments.

Also applies to: 693-695


858-867: Behavior change: images without descriptions are skipped — document or gate via flag.

Good default to reduce noise, but it can impact completeness. Add a brief comment and/or a toggle (e.g., OCR_SKIP_IMAGES_WITHOUT_DESCRIPTION=true) with a WARN when skipping to aid observability.

Based on learnings


20-22: Feature flag must be boolean (not string) and parsed from env.

As written, the string 'true' forces sequential mode and bypasses configurability.

Apply this diff:

-const USE_SEQUENTIAL_BATCH_PROCESSING='true'
-// Control whether to process batches concurrently (all at once) or sequentially (one by one)
-// const USE_SEQUENTIAL_BATCH_PROCESSING = USE_SEQUENTIAL_BATCH_PROCESSING === 'true' || false
+// Control whether to process batches concurrently (all at once) or sequentially (one by one)
+const USE_SEQUENTIAL_BATCH_PROCESSING =
+  (process.env.USE_SEQUENTIAL_BATCH_PROCESSING ?? '').toLowerCase() === 'true'

Based on learnings

🧹 Nitpick comments (2)
server/lib/chunkByOCR.ts (2)

580-612: Concurrent processing: limit fan-out and handle partial failures.

Unbounded Promise.all can overwhelm the OCR service for large PDFs. Consider a small concurrency limit (e.g., 4–8) and logging per-batch failures via allSettled, or backoff/retry on 429/5xx.

Example (no new deps):

const CONCURRENCY = Number.parseInt(process.env.LAYOUT_PARSING_CONCURRENCY ?? '4', 10) || 4
const results: LayoutParsingApiPayload[] = []
let i = 0
while (i < batches.length) {
  const slice = batches.slice(i, i + CONCURRENCY)
  const settled = await Promise.allSettled(slice.map((batch, idx) => {
    const batchIndex = i + idx + 1
    Logger.info("Processing PDF batch", { fileName, batchIndex, totalBatches: batches.length, batchSizeBytes: batch.length })
    return callLayoutParsingApi(batch, `${fileName}_batch_${batchIndex}`)
  }))
  for (const s of settled) {
    if (s.status === 'fulfilled') results.push(s.value)
    else Logger.error(s.reason, "Batch failed", { fileName })
  }
  i += CONCURRENCY
}
return results

613-651: Minor: deduplicate batch-processing code paths.

Both sequential and concurrent branches share logging/shape. Extract a shared processSingleBatch helper to reduce duplication.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 277b51c and 38c3792.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (12 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (1)
server/lib/chunkByOCR.ts (1)

898-898: Confirm indexing convention for page_numbers.

page_numbers: [pageNumber] appears 0-based (consistent with ocrResponse keys). Verify downstream expects 0-based; if not, adjust (+1) here.

let currentTextBuffer = ""
let currentBlockLabels: string[] = []
let lastPageNumber = -1
let currentPageNumbers: Set<number> = new Set()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix page_numbers semantics: add pages only when content is actually appended.

Adding the page at Line 825 marks the chunk with pages even if the page contributes no text/placeholder (e.g., only images without description). Move page tracking to the points where content gets appended to currentTextBuffer.

Apply this diff to stop pre-marking pages:

-    currentPageNumbers.add(pageNumber)

Then, add page tagging where content is appended (outside the changed ranges; illustrative snippets):

  • For images (after appending the placeholder at Line 903):
currentTextBuffer += `${currentTextBuffer ? " " : ""}[IMG#${block.image_index}]`
currentPageNumbers.add(pageNumber)
  • For text (just before Line 919 append):
if (!currentPageNumbers.has(pageNumber)) {
  currentPageNumbers.add(pageNumber)
}
currentTextBuffer += (currentTextBuffer ? " " : "") + normalizedText

Keeping addChunk(true) for mid-page splits and addChunk(false) at final flush remains correct. This ensures page_numbers reflects only pages contributing content to the chunk.

Also applies to: 770-777, 812-814, 825-825, 916-916, 928-928

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants