-
Notifications
You must be signed in to change notification settings - Fork 56
Feat/paddle #1026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/paddle #1026
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds image-chunk selection and filename lines to collection file context, increases OCR layout parsing timeout and logging with improved error context, changes file processor to rethrow processing errors instead of returning fallback chunks, and bumps Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Uploader/Caller
participant FP as FileProcessor
participant OCR as chunkByOCR
participant API as Layout Parsing API
U->>FP: process(file)
FP->>OCR: callLayoutParsingApi(file)
OCR->>API: POST v2/models/layout-parsing/infer
API-->>OCR: 200 OK (layout)
OCR->>OCR: Note over OCR: Log "Layout parsing API request succeeded, parsing response"
OCR-->>FP: parsed layout/chunks
FP-->>U: chunks
alt API error or parsing failure
API-->>OCR: Error/Timeout
OCR->>OCR: Note over OCR: Log error with fileName, fileSize, apiUrl
OCR-->>FP: throw Error("Layout parsing failed for <file>")
FP-->>U: throw Error("Processing <file> failed: <cause>")
end
sequenceDiagram
autonumber
participant C as ContextBuilder
participant CF as Collection Files
participant IMG as Image Chunks/Matches
C->>CF: fetch file details + chunk maps
C->>IMG: request image chunk info (image_chunks, image_chunks_pos, image_chunks_summary)
alt matchfeatures present
IMG-->>C: getSortedScoredImageChunks(matchfeatures, positions, summaryStrings)
else no matchfeatures
IMG-->>C: build placeholder image chunks from docId and positions
end
C->>C: limit to maxImageChunks (<=5), sort by original indices
C-->>C: assemble imageContent and append "Image File Names" when image_chunks_summary exists
C-->>CF: return final summary including image content (when present)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces enhancements to the AI context generation by incorporating image-related information, alongside significant improvements to the robustness and error handling of the document layout parsing process. It also refines the file processing error strategy to ensure that failures are explicitly propagated for better system-wide error management. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces functionality for handling image chunks, likely from an OCR process, and improves error handling in file processing. I've found a critical bug in server/lib/chunkByOCR.ts
where an API URL is constructed incorrectly, which will cause API calls to fail. Additionally, I've provided a couple of refactoring suggestions for the code in server/ai/context.ts
to enhance its clarity and maintainability. The error handling improvements in server/services/fileProcessor.ts
and server/lib/chunkByOCR.ts
are positive changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (3)
server/lib/chunkByOCR.ts (1)
378-391
: Consider preserving the original error stack trace when rethrowing.The current implementation wraps the error in a new Error object, which may lose the original stack trace. While the error message is preserved, the stack trace is valuable for debugging.
Apply this diff to preserve the original error while enhancing the message:
} catch (error) { // Log the layout parsing API failure with context Logger.error(error, `Layout parsing API call failed for file: ${fileName}`, { fileName, fileSize: buffer.length, apiUrl, }) // Re-throw with enhanced error message for better debugging - if (error instanceof Error) { - throw new Error(`Layout parsing API failed for "${fileName}": ${error.message}`) - } else { - throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`) - } + if (error instanceof Error) { + error.message = `Layout parsing API failed for "${fileName}": ${error.message}` + throw error + } else { + throw new Error(`Layout parsing API failed for "${fileName}": Unknown error occurred`) + } } finally {server/ai/context.ts (2)
629-679
: Consider adding validation for image chunk data before processing.The code uses type assertions (
as number[]
,as string[]
) without validating thatimage_chunks_pos_summary
andimage_chunks_summary
are defined and have matching lengths. If these arrays are out of sync or undefined, it could lead to runtime errors or incorrect chunk construction.Apply this diff to add defensive checks:
let imageChunks: ScoredChunk[] = [] const maxImageChunks = fields.image_chunks_summary?.length && fields.image_chunks_summary?.length < 5 ? fields.image_chunks_summary?.length : 5 if (fields.matchfeatures) { const summaryStrings = fields.image_chunks_summary?.map((c) => typeof c === "string" ? c : c.chunk, ) || [] + // Only process if we have both summary and position data + if (summaryStrings.length > 0 && fields.image_chunks_pos_summary) { imageChunks = getSortedScoredImageChunks( fields.matchfeatures, fields.image_chunks_pos_summary as number[], summaryStrings as string[], fields.docId, ) + } } else { - const imageChunksPos = fields.image_chunks_pos_summary as number[] + const imageChunksPos = fields.image_chunks_pos_summary as number[] | undefined imageChunks = - fields.image_chunks_summary?.map((chunk, idx) => { + (fields.image_chunks_summary && imageChunksPos)?.map((chunk, idx) => { const result = { chunk: `${fields.docId}_${imageChunksPos[idx]}`, index: idx, score: 0, } return result }) || [] }
629-679
: Consider extracting image chunk processing into a reusable helper function.The image chunk processing logic in
constructCollectionFileContext
(lines 629-679) is nearly identical to the logic inconstructDataSourceFileContext
(lines 508-545). This duplication increases maintenance burden and the risk of inconsistencies.Consider extracting a helper function like:
function processImageChunks( fields: { matchfeatures?: unknown; image_chunks_summary?: unknown[]; image_chunks_pos_summary?: number[]; docId: string; }, isSelectedFiles?: boolean, maxImageChunks: number = 5 ): { chunks: ScoredChunk[]; content: string } { // Extracted logic here }This would centralize the logic and make it easier to maintain and test.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
server/ai/context.ts
(2 hunks)server/lib/chunkByOCR.ts
(4 hunks)server/services/fileProcessor.ts
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/ai/context.ts (1)
server/utils.ts (1)
getRelativeTime
(109-133)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (3)
server/lib/chunkByOCR.ts (2)
301-301
: Verify that the 5-minute timeout doesn't exceed upstream limits.The timeout has been increased from 2 minutes to 5 minutes. Ensure that any API gateways, load balancers, or proxy servers in the request path have timeouts configured to at least 5 minutes, or they may terminate the connection before the OCR processing completes.
355-355
: LGTM!Adding a success log improves observability and helps track the processing pipeline.
server/services/fileProcessor.ts (1)
146-153
: Running context inspection for queue processor call site...
Description
Testing
Additional Notes
Summary by CodeRabbit
New Features
Refactor
Chores