-
Notifications
You must be signed in to change notification settings - Fork 56
Fix/process complete file in one go #1038
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughIntroduces MIME type detection via file-type in knowledgeBase upload flow, persists detected MIME, and adjusts related logging/validation. Simplifies OCR chunking by removing PDF batching and making a single layout-API call. Adds file-type dependency to server/package.json. No public API signatures changed. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant UploadFilesApi
participant detectMimeType
participant FileTypeLib as file-type (magic bytes)
participant Storage as Disk/DB
Client->>UploadFilesApi: POST /upload (file, browserMimeType)
UploadFilesApi->>UploadFilesApi: Save file buffer to disk
UploadFilesApi->>detectMimeType: detect(fileName, buffer, browserMimeType)
detectMimeType->>FileTypeLib: fileTypeFromBuffer(buffer)
alt Magic bytes found
FileTypeLib-->>detectMimeType: mime
detectMimeType-->>UploadFilesApi: mime (magic)
else Not found / error
detectMimeType-->>UploadFilesApi: mime (ext or browser or octet-stream)
end
UploadFilesApi->>Storage: Persist record with detectedMimeType
UploadFilesApi-->>Client: 201 Created (metadata)
sequenceDiagram
autonumber
participant Caller as Caller
participant chunkByOCR
participant LayoutAPI as Layout Parsing API
Note over chunkByOCR: New flow (no PDF batching)
Caller->>chunkByOCR: process(buffer)
chunkByOCR->>LayoutAPI: parse(buffer)
LayoutAPI-->>chunkByOCR: layout result
chunkByOCR-->>Caller: chunks
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (3)
🧰 Additional context used🧬 Code graph analysis (1)server/api/knowledgeBase.ts (2)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
🔇 Additional comments (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request refactors the file processing logic to handle entire files as a single unit, specifically by removing the previous batching mechanism for large PDFs. Concurrently, it introduces a more sophisticated and reliable method for detecting file MIME types, leveraging both file extensions and magic byte analysis. These changes aim to streamline file ingestion and improve the accuracy of file type identification within the system. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a more robust MIME type detection for file uploads by using magic byte analysis, which is a significant improvement. It also refactors the OCR processing to handle files in a single request, removing the previous PDF batching logic. While this simplifies the code, I have raised a high-severity concern about the removal of this batching, as it could lead to performance issues or failures with large files that were previously handled. I've also included a medium-severity suggestion to simplify some buffer handling logic for better code conciseness.
Description
Testing
Additional Notes
Summary by CodeRabbit
New Features
Performance
Refactor
Chores