Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Oct 1, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • Adds image context to summaries when available, including prioritized image snippets and listed image file names.
  • Refactor

    • Processing errors are now surfaced (errors rethrown) instead of returning fallback results, enabling clearer failure handling and retries.
  • Chores

    • Increased layout parsing timeout, normalized API URL handling, improved parsing success/error logging and messages, and updated an internal dependency to a patched release.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds image-chunk selection and filename lines to collection file context, increases OCR layout parsing timeout and logging with improved error context, changes file processor to rethrow processing errors instead of returning fallback chunks, and bumps @xyne/vespa-ts dependency.

Changes

Cohort / File(s) Summary of Changes
Context construction for image chunks
server/ai/context.ts
Adds image chunk handling: computes maxImageChunks (<=5), selects image chunks via getSortedScoredImageChunks when matchfeatures exist or builds placeholders otherwise, limits and sorts by original indices, assembles imageContent, and appends an "Image File Names" line when image_chunks_summary is present.
OCR layout parsing client
server/lib/chunkByOCR.ts
Normalizes API URL composition (avoids duplicate slashes), increases layout parsing timeout (120000 → 300000 ms), logs success on 200, and enhances error handling with contextual logging (fileName, fileSize, apiUrl) and descriptive rethrows.
File processing error handling
server/services/fileProcessor.ts
Removes fallback chunk creation on processing errors; now rethrows an error including the file name and original error details for upstream handling.
Dependency bump
server/package.json
Bumps @xyne/vespa-ts from 1.0.10 to 1.0.11.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Uploader/Caller
  participant FP as FileProcessor
  participant OCR as chunkByOCR
  participant API as Layout Parsing API

  U->>FP: process(file)
  FP->>OCR: callLayoutParsingApi(file)
  OCR->>API: POST v2/models/layout-parsing/infer
  API-->>OCR: 200 OK (layout)
  OCR->>OCR: Note over OCR: Log "Layout parsing API request succeeded, parsing response"
  OCR-->>FP: parsed layout/chunks
  FP-->>U: chunks

  alt API error or parsing failure
    API-->>OCR: Error/Timeout
    OCR->>OCR: Note over OCR: Log error with fileName, fileSize, apiUrl
    OCR-->>FP: throw Error("Layout parsing failed for <file>")
    FP-->>U: throw Error("Processing <file> failed: <cause>")
  end
Loading
sequenceDiagram
  autonumber
  participant C as ContextBuilder
  participant CF as Collection Files
  participant IMG as Image Chunks/Matches

  C->>CF: fetch file details + chunk maps
  C->>IMG: request image chunk info (image_chunks, image_chunks_pos, image_chunks_summary)
  alt matchfeatures present
    IMG-->>C: getSortedScoredImageChunks(matchfeatures, positions, summaryStrings)
  else no matchfeatures
    IMG-->>C: build placeholder image chunks from docId and positions
  end
  C->>C: limit to maxImageChunks (<=5), sort by original indices
  C-->>C: assemble imageContent and append "Image File Names" when image_chunks_summary exists
  C-->>CF: return final summary including image content (when present)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Feat/paddle #1019: Overlaps OCR/chunking changes and introduces chunks_map/image_chunks_map structures used by the new image chunk handling.
  • feat(chunk): Sort and filter chunks based on score #356: Adds sorted/scored chunk selection logic for text chunks; conceptually parallels the newly added image chunk selection.
  • Context fixes #664: Adjusts image filename extraction/propagation in the RAG/context pipeline, related to the appended "Image File Names" behavior.

Suggested reviewers

  • zereraz
  • shivamashtikar
  • kalpadhwaryu
  • devesh-juspay

Poem

I nibble bytes and hop through lines of code,
Image crumbs gathered, trimmed to a bounded load.
Timeouts stretch, logs gleam bright and clear,
Errors now bounce up for handlers to hear.
I twitch my whiskers — filenames appear! 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “Feat/paddle” is derived from the branch name and does not describe any of the substantive changes in the pull request, such as adding image chunk handling to the AI context builder or improving error logging in the OCR and file processing services. It fails to convey the primary feature or intent of the changeset to reviewers scanning the history. Rename the title to a concise, descriptive summary of the main feature, for example “Add image chunk support to constructCollectionFileContext and improve OCR error logging,” so that it clearly reflects the key changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/paddle

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fc20598 and b4baf23.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/lib/chunkByOCR.ts
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces enhancements to the AI context generation by incorporating image-related information, alongside significant improvements to the robustness and error handling of the document layout parsing process. It also refines the file processing error strategy to ensure that failures are explicitly propagated for better system-wide error management.

Highlights

  • Image Chunk Integration: The AI context generation now includes image chunk summaries, allowing the AI to leverage visual information from documents for a more comprehensive understanding.
  • Improved Layout Parsing Robustness: The layout parsing API timeout has been extended from 2 to 5 minutes, and error logging and re-throwing have been enhanced for better debugging and reliability during document processing.
  • Refined File Processing Error Handling: File processing failures now explicitly re-throw errors with detailed messages instead of generating basic fallback chunks, enabling more robust upstream error management and clearer failure signals.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces functionality for handling image chunks, likely from an OCR process, and improves error handling in file processing. I've found a critical bug in server/lib/chunkByOCR.ts where an API URL is constructed incorrectly, which will cause API calls to fail. Additionally, I've provided a couple of refactoring suggestions for the code in server/ai/context.ts to enhance its clarity and maintainability. The error handling improvements in server/services/fileProcessor.ts and server/lib/chunkByOCR.ts are positive changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
server/lib/chunkByOCR.ts (1)

378-391: Consider preserving the original error stack trace when rethrowing.

The current implementation wraps the error in a new Error object, which may lose the original stack trace. While the error message is preserved, the stack trace is valuable for debugging.

Apply this diff to preserve the original error while enhancing the message:

   } catch (error) {
     // Log the layout parsing API failure with context
     Logger.error(error, `Layout parsing API call failed for file: ${fileName}`, {
       fileName,
       fileSize: buffer.length,
       apiUrl,
     })
     
     // Re-throw with enhanced error message for better debugging
-    if (error instanceof Error) {
-      throw new Error(`Layout parsing API failed for "${fileName}": ${error.message}`)
-    } else {
-      throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`)
-    }
+    if (error instanceof Error) {
+      error.message = `Layout parsing API failed for "${fileName}": ${error.message}`
+      throw error
+    } else {
+      throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`)
+    }
   } finally {
server/ai/context.ts (2)

629-679: Consider adding validation for image chunk data before processing.

The code uses type assertions (as number[], as string[]) without validating that image_chunks_pos_summary and image_chunks_summary are defined and have matching lengths. If these arrays are out of sync or undefined, it could lead to runtime errors or incorrect chunk construction.

Apply this diff to add defensive checks:

   let imageChunks: ScoredChunk[] = []
   const maxImageChunks =
     fields.image_chunks_summary?.length &&
     fields.image_chunks_summary?.length < 5
       ? fields.image_chunks_summary?.length
       : 5
 
 
   if (fields.matchfeatures) {
 
     const summaryStrings = fields.image_chunks_summary?.map((c) =>
       typeof c === "string" ? c : c.chunk,
     ) || []
     
+    // Only process if we have both summary and position data
+    if (summaryStrings.length > 0 && fields.image_chunks_pos_summary) {
       imageChunks = getSortedScoredImageChunks(
         fields.matchfeatures,
         fields.image_chunks_pos_summary as number[],
         summaryStrings as string[],
         fields.docId,
       )
+    }
   } else {
-    const imageChunksPos = fields.image_chunks_pos_summary as number[]
+    const imageChunksPos = fields.image_chunks_pos_summary as number[] | undefined
   
     imageChunks =
-      fields.image_chunks_summary?.map((chunk, idx) => {
+      (fields.image_chunks_summary && imageChunksPos)?.map((chunk, idx) => {
         const result = {
           chunk: `${fields.docId}_${imageChunksPos[idx]}`,
           index: idx,
           score: 0,
         }
         return result
       }) || []
   }

629-679: Consider extracting image chunk processing into a reusable helper function.

The image chunk processing logic in constructCollectionFileContext (lines 629-679) is nearly identical to the logic in constructDataSourceFileContext (lines 508-545). This duplication increases maintenance burden and the risk of inconsistencies.

Consider extracting a helper function like:

function processImageChunks(
  fields: {
    matchfeatures?: unknown;
    image_chunks_summary?: unknown[];
    image_chunks_pos_summary?: number[];
    docId: string;
  },
  isSelectedFiles?: boolean,
  maxImageChunks: number = 5
): { chunks: ScoredChunk[]; content: string } {
  // Extracted logic here
}

This would centralize the logic and make it easier to maintain and test.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 23102ee and 65ed7ba.

📒 Files selected for processing (3)
  • server/ai/context.ts (2 hunks)
  • server/lib/chunkByOCR.ts (4 hunks)
  • server/services/fileProcessor.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/ai/context.ts (1)
server/utils.ts (1)
  • getRelativeTime (109-133)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (3)
server/lib/chunkByOCR.ts (2)

301-301: Verify that the 5-minute timeout doesn't exceed upstream limits.

The timeout has been increased from 2 minutes to 5 minutes. Ensure that any API gateways, load balancers, or proxy servers in the request path have timeouts configured to at least 5 minutes, or they may terminate the connection before the OCR processing completes.


355-355: LGTM!

Adding a success log improves observability and helps track the processing pipeline.

server/services/fileProcessor.ts (1)

146-153: Running context inspection for queue processor call site...

junaid-shirur
junaid-shirur previously approved these changes Oct 1, 2025
@shivamashtikar shivamashtikar merged commit d7cbbe6 into main Oct 1, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants