Feat/paddle #1026

Aayushjshah · 2025-10-01T14:16:12Z

Description

Testing

Additional Notes

Summary by CodeRabbit

New Features
- Adds image context to summaries when available, including prioritized image snippets and listed image file names.
Refactor
- Processing errors are now surfaced (errors rethrown) instead of returning fallback results, enabling clearer failure handling and retries.
Chores
- Increased layout parsing timeout, normalized API URL handling, improved parsing success/error logging and messages, and updated an internal dependency to a patched release.

coderabbitai · 2025-10-01T14:16:20Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds image-chunk selection and filename lines to collection file context, increases OCR layout parsing timeout and logging with improved error context, changes file processor to rethrow processing errors instead of returning fallback chunks, and bumps @xyne/vespa-ts dependency.

Changes

Cohort / File(s)	Summary of Changes
Context construction for image chunks `server/ai/context.ts`	Adds image chunk handling: computes `maxImageChunks` (<=5), selects image chunks via `getSortedScoredImageChunks` when `matchfeatures` exist or builds placeholders otherwise, limits and sorts by original indices, assembles `imageContent`, and appends an "Image File Names" line when `image_chunks_summary` is present.
OCR layout parsing client `server/lib/chunkByOCR.ts`	Normalizes API URL composition (avoids duplicate slashes), increases layout parsing timeout (120000 → 300000 ms), logs success on 200, and enhances error handling with contextual logging (fileName, fileSize, apiUrl) and descriptive rethrows.
File processing error handling `server/services/fileProcessor.ts`	Removes fallback chunk creation on processing errors; now rethrows an error including the file name and original error details for upstream handling.
Dependency bump `server/package.json`	Bumps `@xyne/vespa-ts` from `1.0.10` to `1.0.11`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Uploader/Caller
  participant FP as FileProcessor
  participant OCR as chunkByOCR
  participant API as Layout Parsing API

  U->>FP: process(file)
  FP->>OCR: callLayoutParsingApi(file)
  OCR->>API: POST v2/models/layout-parsing/infer
  API-->>OCR: 200 OK (layout)
  OCR->>OCR: Note over OCR: Log "Layout parsing API request succeeded, parsing response"
  OCR-->>FP: parsed layout/chunks
  FP-->>U: chunks

  alt API error or parsing failure
    API-->>OCR: Error/Timeout
    OCR->>OCR: Note over OCR: Log error with fileName, fileSize, apiUrl
    OCR-->>FP: throw Error("Layout parsing failed for <file>")
    FP-->>U: throw Error("Processing <file> failed: <cause>")
  end

sequenceDiagram
  autonumber
  participant C as ContextBuilder
  participant CF as Collection Files
  participant IMG as Image Chunks/Matches

  C->>CF: fetch file details + chunk maps
  C->>IMG: request image chunk info (image_chunks, image_chunks_pos, image_chunks_summary)
  alt matchfeatures present
    IMG-->>C: getSortedScoredImageChunks(matchfeatures, positions, summaryStrings)
  else no matchfeatures
    IMG-->>C: build placeholder image chunks from docId and positions
  end
  C->>C: limit to maxImageChunks (<=5), sort by original indices
  C-->>C: assemble imageContent and append "Image File Names" when image_chunks_summary exists
  C-->>CF: return final summary including image content (when present)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Feat/paddle #1019: Overlaps OCR/chunking changes and introduces chunks_map/image_chunks_map structures used by the new image chunk handling.
feat(chunk): Sort and filter chunks based on score #356: Adds sorted/scored chunk selection logic for text chunks; conceptually parallels the newly added image chunk selection.
Context fixes #664: Adjusts image filename extraction/propagation in the RAG/context pipeline, related to the appended "Image File Names" behavior.

Suggested reviewers

zereraz
shivamashtikar
kalpadhwaryu
devesh-juspay

Poem

I nibble bytes and hop through lines of code,
Image crumbs gathered, trimmed to a bounded load.
Timeouts stretch, logs gleam bright and clear,
Errors now bounce up for handlers to hear.
I twitch my whiskers — filenames appear! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “Feat/paddle” is derived from the branch name and does not describe any of the substantive changes in the pull request, such as adding image chunk handling to the AI context builder or improving error logging in the OCR and file processing services. It fails to convey the primary feature or intent of the changeset to reviewers scanning the history.	Rename the title to a concise, descriptive summary of the main feature, for example “Add image chunk support to constructCollectionFileContext and improve OCR error logging,” so that it clearly reflects the key changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/paddle

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc20598 and b4baf23.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (3 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

server/lib/chunkByOCR.ts

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-01T14:16:31Z

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces enhancements to the AI context generation by incorporating image-related information, alongside significant improvements to the robustness and error handling of the document layout parsing process. It also refines the file processing error strategy to ensure that failures are explicitly propagated for better system-wide error management.

Highlights

Image Chunk Integration: The AI context generation now includes image chunk summaries, allowing the AI to leverage visual information from documents for a more comprehensive understanding.
Improved Layout Parsing Robustness: The layout parsing API timeout has been extended from 2 to 5 minutes, and error logging and re-throwing have been enhanced for better debugging and reliability during document processing.
Refined File Processing Error Handling: File processing failures now explicitly re-throw errors with detailed messages instead of generating basic fallback chunks, enabling more robust upstream error management and clearer failure signals.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces functionality for handling image chunks, likely from an OCR process, and improves error handling in file processing. I've found a critical bug in server/lib/chunkByOCR.ts where an API URL is constructed incorrectly, which will cause API calls to fail. Additionally, I've provided a couple of refactoring suggestions for the code in server/ai/context.ts to enhance its clarity and maintainability. The error handling improvements in server/services/fileProcessor.ts and server/lib/chunkByOCR.ts are positive changes.

server/lib/chunkByOCR.ts

server/ai/context.ts

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (3)

server/lib/chunkByOCR.ts (1)
378-391: Consider preserving the original error stack trace when rethrowing.

The current implementation wraps the error in a new Error object, which may lose the original stack trace. While the error message is preserved, the stack trace is valuable for debugging.

Apply this diff to preserve the original error while enhancing the message:
   } catch (error) {
     // Log the layout parsing API failure with context
     Logger.error(error, `Layout parsing API call failed for file: ${fileName}`, {
       fileName,
       fileSize: buffer.length,
       apiUrl,
     })
     
     // Re-throw with enhanced error message for better debugging
-    if (error instanceof Error) {
-      throw new Error(`Layout parsing API failed for "${fileName}": ${error.message}`)
-    } else {
-      throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`)
-    }
+    if (error instanceof Error) {
+      error.message = `Layout parsing API failed for "${fileName}": ${error.message}`
+      throw error
+    } else {
+      throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`)
+    }
   } finally {
server/ai/context.ts (2)
629-679: Consider adding validation for image chunk data before processing.

The code uses type assertions (as number[], as string[]) without validating that image_chunks_pos_summary and image_chunks_summary are defined and have matching lengths. If these arrays are out of sync or undefined, it could lead to runtime errors or incorrect chunk construction.

Apply this diff to add defensive checks:
   let imageChunks: ScoredChunk[] = []
   const maxImageChunks =
     fields.image_chunks_summary?.length &&
     fields.image_chunks_summary?.length < 5
       ? fields.image_chunks_summary?.length
       : 5
 
 
   if (fields.matchfeatures) {
 
     const summaryStrings = fields.image_chunks_summary?.map((c) =>
       typeof c === "string" ? c : c.chunk,
     ) || []
     
+    // Only process if we have both summary and position data
+    if (summaryStrings.length > 0 && fields.image_chunks_pos_summary) {
       imageChunks = getSortedScoredImageChunks(
         fields.matchfeatures,
         fields.image_chunks_pos_summary as number[],
         summaryStrings as string[],
         fields.docId,
       )
+    }
   } else {
-    const imageChunksPos = fields.image_chunks_pos_summary as number[]
+    const imageChunksPos = fields.image_chunks_pos_summary as number[] | undefined
   
     imageChunks =
-      fields.image_chunks_summary?.map((chunk, idx) => {
+      (fields.image_chunks_summary && imageChunksPos)?.map((chunk, idx) => {
         const result = {
           chunk: `${fields.docId}_${imageChunksPos[idx]}`,
           index: idx,
           score: 0,
         }
         return result
       }) || []
   }
629-679: Consider extracting image chunk processing into a reusable helper function.

The image chunk processing logic in constructCollectionFileContext (lines 629-679) is nearly identical to the logic in constructDataSourceFileContext (lines 508-545). This duplication increases maintenance burden and the risk of inconsistencies.

Consider extracting a helper function like:
function processImageChunks(
  fields: {
    matchfeatures?: unknown;
    image_chunks_summary?: unknown[];
    image_chunks_pos_summary?: number[];
    docId: string;
  },
  isSelectedFiles?: boolean,
  maxImageChunks: number = 5
): { chunks: ScoredChunk[]; content: string } {
  // Extracted logic here
}
This would centralize the logic and make it easier to maintain and test.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23102ee and 65ed7ba.

📒 Files selected for processing (3)

server/ai/context.ts (2 hunks)
server/lib/chunkByOCR.ts (4 hunks)
server/services/fileProcessor.ts (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/ai/context.ts (1)

server/utils.ts (1)

getRelativeTime (109-133)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (3)

server/lib/chunkByOCR.ts (2)

301-301: Verify that the 5-minute timeout doesn't exceed upstream limits.

The timeout has been increased from 2 minutes to 5 minutes. Ensure that any API gateways, load balancers, or proxy servers in the request path have timeouts configured to at least 5 minutes, or they may terminate the connection before the OCR processing completes.

355-355: LGTM!

Adding a success log improves observability and helps track the processing pipeline.

server/services/fileProcessor.ts (1)

146-153: Running context inspection for queue processor call site...

server/lib/chunkByOCR.ts

Aayushjshah added 2 commits October 1, 2025 19:44

feat(metadataPaddle): resolving commentsD

780beac

Merge branch 'main' into feat/paddle

65ed7ba

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners October 1, 2025 14:16

gemini-code-assist bot reviewed Oct 1, 2025

View reviewed changes

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

server/ai/context.ts Show resolved Hide resolved

server/ai/context.ts Outdated Show resolved Hide resolved

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

server/lib/chunkByOCR.ts Outdated Show resolved Hide resolved

feat(paddle): removing selectedFiles

7cace8d

junaid-shirur previously approved these changes Oct 1, 2025

View reviewed changes

Update vespa package

fc20598

junaid-shirur dismissed their stale review via fc20598 October 1, 2025 14:26

feat(paddle): resolving comments

b4baf23

shivamashtikar approved these changes Oct 1, 2025

View reviewed changes

shivamashtikar merged commit d7cbbe6 into main Oct 1, 2025
3 of 4 checks passed

coderabbitai bot mentioned this pull request Oct 6, 2025

Fix/ocr chunking pages and images #1049

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/paddle #1026

Feat/paddle #1026

Uh oh!

Aayushjshah commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 1, 2025 •

edited

Loading

Other AI code review bot(s) detected

Uh oh!

gemini-code-assist bot commented Oct 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Feat/paddle #1026

Feat/paddle #1026

Uh oh!

Conversation

Aayushjshah commented Oct 1, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Aayushjshah commented Oct 1, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 1, 2025 •

edited

Loading