Fix/process complete file in one go #1038

Aayushjshah · 2025-10-03T14:14:17Z

Description

Testing

Additional Notes

Summary by CodeRabbit

New Features
- More accurate file type detection during uploads, using content-based checks with safe fallbacks for unknown types.
Performance
- Streamlined OCR processing for PDFs with a single-call flow, reducing overhead and improving reliability.
Refactor
- Simplified control flow in OCR processing and adjusted logging.
- Minor input validation and indexing logic updates without changing behavior.
Chores
- Added a new dependency to support file type detection.

coderabbitai · 2025-10-03T14:14:28Z

Walkthrough

Introduces MIME type detection via file-type in knowledgeBase upload flow, persists detected MIME, and adjusts related logging/validation. Simplifies OCR chunking by removing PDF batching and making a single layout-API call. Adds file-type dependency to server/package.json. No public API signatures changed.

Changes

Cohort / File(s)	Summary of Changes
MIME detection integration `server/api/knowledgeBase.ts`	Added EXTENSION_MIME_MAP and detectMimeType using file-type magic bytes with fallback order (magic > extension map > browser > application/octet-stream). Integrated into UploadFilesApi to persist detectedMimeType. Minor control-flow/logging updates in GetChunkContentApi and PollCollectionsStatusApi.
OCR flow simplification `server/lib/chunkByOCR.ts`	Removed PDF batching logic; now always makes a single call to the layout parsing API. Simplified error handling and logs accordingly; post-processing unchanged.
Dependency addition `server/package.json`	Added dependency: "file-type" ^21.0.0.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Client
  participant UploadFilesApi
  participant detectMimeType
  participant FileTypeLib as file-type (magic bytes)
  participant Storage as Disk/DB

  Client->>UploadFilesApi: POST /upload (file, browserMimeType)
  UploadFilesApi->>UploadFilesApi: Save file buffer to disk
  UploadFilesApi->>detectMimeType: detect(fileName, buffer, browserMimeType)
  detectMimeType->>FileTypeLib: fileTypeFromBuffer(buffer)
  alt Magic bytes found
    FileTypeLib-->>detectMimeType: mime
    detectMimeType-->>UploadFilesApi: mime (magic)
  else Not found / error
    detectMimeType-->>UploadFilesApi: mime (ext or browser or octet-stream)
  end
  UploadFilesApi->>Storage: Persist record with detectedMimeType
  UploadFilesApi-->>Client: 201 Created (metadata)

sequenceDiagram
  autonumber
  participant Caller as Caller
  participant chunkByOCR
  participant LayoutAPI as Layout Parsing API

  Note over chunkByOCR: New flow (no PDF batching)
  Caller->>chunkByOCR: process(buffer)
  chunkByOCR->>LayoutAPI: parse(buffer)
  LayoutAPI-->>chunkByOCR: layout result
  chunkByOCR-->>Caller: chunks

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Feat/paddle #1019 — Also modifies server/lib/chunkByOCR.ts, server/api/knowledgeBase.ts, and server/package.json around PDF processing and MIME/processing logic.

Suggested reviewers

zereraz
shivamashtikar
kalpadhwaryu
junaid-shirur
devesh-juspay

Poem

I sniffed the bytes—sniff sniff, hop!
Found the MIME, let guesswork stop.
One leap for OCR, no batchy maze,
Straight to layout, swift as rays.
New deps nibbled, logs aligned—
Carrots compiled, reviews refined. 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title highlights the change to process files in one go, which matches the removal of PDF batching in chunkByOCR, but it omits the substantial MIME detection improvements added to the upload flow and knowledgeBase module.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/processCompleteFileInOneGo

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e51ebaa and 60d4026.

📒 Files selected for processing (3)

server/api/knowledgeBase.ts (6 hunks)
server/lib/chunkByOCR.ts (1 hunks)
server/package.json (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/api/knowledgeBase.ts (2)

server/utils.ts (1)

getErrorMessage (103-106)

server/types.ts (1)

ChunkMetadata (608-612)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (1)

server/lib/chunkByOCR.ts (1)

557-558: Confirm layout API can handle full-document uploads

We now always send the entire PDF in one shot. The layout service used to need 30-page batching due to payload/time limits; without that guard, very large files could time out or blow past request-size caps (base64 grows ~33%). Please confirm the upstream API was hardened for this scenario, or consider keeping a fallback batch path for oversized documents.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-03T14:14:34Z

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the file processing logic to handle entire files as a single unit, specifically by removing the previous batching mechanism for large PDFs. Concurrently, it introduces a more sophisticated and reliable method for detecting file MIME types, leveraging both file extensions and magic byte analysis. These changes aim to streamline file ingestion and improve the accuracy of file type identification within the system.

Highlights

Unified File Processing: The system will now process entire files, including large PDFs, in a single operation, removing previous batching logic that split PDFs into smaller chunks.
Enhanced MIME Type Detection: A more robust MIME type detection mechanism has been implemented, leveraging magic bytes (via the file-type library) and a comprehensive extension map to improve file classification accuracy during uploads.
Dependency Update: The file-type library has been added as a new dependency to support the advanced MIME type detection capabilities.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a more robust MIME type detection for file uploads by using magic byte analysis, which is a significant improvement. It also refactors the OCR processing to handle files in a single request, removing the previous PDF batching logic. While this simplifies the code, I have raised a high-severity concern about the removal of this batching, as it could lead to performance issues or failures with large files that were previously handled. I've also included a medium-severity suggestion to simplify some buffer handling logic for better code conciseness.

server/lib/chunkByOCR.ts

server/api/knowledgeBase.ts

Aayushjshah added 5 commits October 3, 2025 19:08

fix(KBmimType): adding mimeType detection

8a65fd1

fix(processCompleteFileInOneGo): removing splitting of pdf logic

9a629ff

fix(KBmimType): adding mimeType detection

2912eea

Merge branch 'fix/KBmimeType' into fix/processCompleteFileInOneGo

f0053ac

Merge branch 'main' into fix/processCompleteFileInOneGo

60d4026

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners October 3, 2025 14:14

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

server/lib/chunkByOCR.ts Show resolved Hide resolved

server/api/knowledgeBase.ts Show resolved Hide resolved

shivamashtikar approved these changes Oct 3, 2025

View reviewed changes

shivamashtikar merged commit 2808ab4 into main Oct 3, 2025
3 of 4 checks passed

coderabbitai bot mentioned this pull request Oct 3, 2025

feat(pdf-batch-upload-pdf-worker): Knowledgebase-Vespa-Ingestion-Thre… #1037

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix/process complete file in one go #1038

Fix/process complete file in one go #1038

Uh oh!

Aayushjshah commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 3, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix/process complete file in one go #1038

Fix/process complete file in one go #1038

Uh oh!

Conversation

Aayushjshah commented Oct 3, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Oct 3, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aayushjshah commented Oct 3, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 3, 2025 •

edited

Loading