Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Oct 3, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • More accurate file type detection during uploads, using content-based checks with safe fallbacks for unknown types.
  • Performance

    • Streamlined OCR processing for PDFs with a single-call flow, reducing overhead and improving reliability.
  • Refactor

    • Simplified control flow in OCR processing and adjusted logging.
    • Minor input validation and indexing logic updates without changing behavior.
  • Chores

    • Added a new dependency to support file type detection.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 3, 2025

Walkthrough

Introduces MIME type detection via file-type in knowledgeBase upload flow, persists detected MIME, and adjusts related logging/validation. Simplifies OCR chunking by removing PDF batching and making a single layout-API call. Adds file-type dependency to server/package.json. No public API signatures changed.

Changes

Cohort / File(s) Summary of Changes
MIME detection integration
server/api/knowledgeBase.ts
Added EXTENSION_MIME_MAP and detectMimeType using file-type magic bytes with fallback order (magic > extension map > browser > application/octet-stream). Integrated into UploadFilesApi to persist detectedMimeType. Minor control-flow/logging updates in GetChunkContentApi and PollCollectionsStatusApi.
OCR flow simplification
server/lib/chunkByOCR.ts
Removed PDF batching logic; now always makes a single call to the layout parsing API. Simplified error handling and logs accordingly; post-processing unchanged.
Dependency addition
server/package.json
Added dependency: "file-type" ^21.0.0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant UploadFilesApi
  participant detectMimeType
  participant FileTypeLib as file-type (magic bytes)
  participant Storage as Disk/DB

  Client->>UploadFilesApi: POST /upload (file, browserMimeType)
  UploadFilesApi->>UploadFilesApi: Save file buffer to disk
  UploadFilesApi->>detectMimeType: detect(fileName, buffer, browserMimeType)
  detectMimeType->>FileTypeLib: fileTypeFromBuffer(buffer)
  alt Magic bytes found
    FileTypeLib-->>detectMimeType: mime
    detectMimeType-->>UploadFilesApi: mime (magic)
  else Not found / error
    detectMimeType-->>UploadFilesApi: mime (ext or browser or octet-stream)
  end
  UploadFilesApi->>Storage: Persist record with detectedMimeType
  UploadFilesApi-->>Client: 201 Created (metadata)
Loading
sequenceDiagram
  autonumber
  participant Caller as Caller
  participant chunkByOCR
  participant LayoutAPI as Layout Parsing API

  Note over chunkByOCR: New flow (no PDF batching)
  Caller->>chunkByOCR: process(buffer)
  chunkByOCR->>LayoutAPI: parse(buffer)
  LayoutAPI-->>chunkByOCR: layout result
  chunkByOCR-->>Caller: chunks
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Feat/paddle #1019 — Also modifies server/lib/chunkByOCR.ts, server/api/knowledgeBase.ts, and server/package.json around PDF processing and MIME/processing logic.

Suggested reviewers

  • zereraz
  • shivamashtikar
  • kalpadhwaryu
  • junaid-shirur
  • devesh-juspay

Poem

I sniffed the bytes—sniff sniff, hop!
Found the MIME, let guesswork stop.
One leap for OCR, no batchy maze,
Straight to layout, swift as rays.
New deps nibbled, logs aligned—
Carrots compiled, reviews refined. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title highlights the change to process files in one go, which matches the removal of PDF batching in chunkByOCR, but it omits the substantial MIME detection improvements added to the upload flow and knowledgeBase module.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/processCompleteFileInOneGo

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e51ebaa and 60d4026.

📒 Files selected for processing (3)
  • server/api/knowledgeBase.ts (6 hunks)
  • server/lib/chunkByOCR.ts (1 hunks)
  • server/package.json (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/api/knowledgeBase.ts (2)
server/utils.ts (1)
  • getErrorMessage (103-106)
server/types.ts (1)
  • ChunkMetadata (608-612)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (1)
server/lib/chunkByOCR.ts (1)

557-558: Confirm layout API can handle full-document uploads

We now always send the entire PDF in one shot. The layout service used to need 30-page batching due to payload/time limits; without that guard, very large files could time out or blow past request-size caps (base64 grows ~33%). Please confirm the upstream API was hardened for this scenario, or consider keeping a fallback batch path for oversized documents.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the file processing logic to handle entire files as a single unit, specifically by removing the previous batching mechanism for large PDFs. Concurrently, it introduces a more sophisticated and reliable method for detecting file MIME types, leveraging both file extensions and magic byte analysis. These changes aim to streamline file ingestion and improve the accuracy of file type identification within the system.

Highlights

  • Unified File Processing: The system will now process entire files, including large PDFs, in a single operation, removing previous batching logic that split PDFs into smaller chunks.
  • Enhanced MIME Type Detection: A more robust MIME type detection mechanism has been implemented, leveraging magic bytes (via the file-type library) and a comprehensive extension map to improve file classification accuracy during uploads.
  • Dependency Update: The file-type library has been added as a new dependency to support the advanced MIME type detection capabilities.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a more robust MIME type detection for file uploads by using magic byte analysis, which is a significant improvement. It also refactors the OCR processing to handle files in a single request, removing the previous PDF batching logic. While this simplifies the code, I have raised a high-severity concern about the removal of this batching, as it could lead to performance issues or failures with large files that were previously handled. I've also included a medium-severity suggestion to simplify some buffer handling logic for better code conciseness.

@shivamashtikar shivamashtikar merged commit 2808ab4 into main Oct 3, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants