Skip to content

Conversation

Aayushjshah
Copy link
Contributor

@Aayushjshah Aayushjshah commented Sep 30, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • OCR-based PDF ingestion with page-aware text and image extraction, batching, and robust processing.
    • Rich per-chunk metadata for text and images (index, page, labels) stored with items to enable improved previews, filtering, and navigation.
  • Improvements

    • More consistent filename quoting for downloads/attachments.
  • Refactor / Chores

    • Minor formatting and logging refinements; updated dependency version.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 30, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds an OCR-based document chunking pipeline, a new ChunkMetadata type, persists chunk metadata maps (chunks_map, image_chunks_map) through file processing, queueing, APIs, and Vespa schema, and updates imports/logging and minor formatting across related files.

Changes

Cohort / File(s) Summary
OCR chunking pipeline (new)
server/lib/chunkByOCR.ts
New OCR-driven chunker: layout API calls, batching, image extraction/saving, block normalization, text/image chunking, and assembly of chunks, image_chunks, and metadata maps; exports ProcessingResult, chunkByOCRFromBuffer, and chunkByOCR.
File processing service
server/services/fileProcessor.ts
Replaces Gemini PDF extraction with OCR flow via chunkByOCRFromBuffer; adds module logger; extends ProcessingResult to include chunks_map and image_chunks_map; builds default maps for non-PDF paths; returns OCR results for PDFs.
Queue → Vespa integration
server/queue/fileProcessor.ts
Adds chunks_map and image_chunks_map to Vespa document payloads for file/collection/folder processing; initializes empty arrays for collection/folder docs; insertion flow unchanged otherwise.
API adjustments (knowledge base)
server/api/knowledgeBase.ts
Accepts legacy number[] and new ChunkMetadata[] chunk-position formats; import and logging tweaks; queue invocation formatting changed (no semantic change).
Files API / attachments
server/api/files.ts
Adds chunks_map and image_chunks_map to attachment Vespa docs for non-image files; consistently quotes filenames in Content-Disposition when serving attachments; error extraction unchanged.
Shared types
server/types.ts
Adds exported ChunkMetadata type: { chunk_index: number; page_number: number; block_labels: string[] }.
Vespa schema
server/vespa/schemas/kb_items.sd
Adds struct chunk_meta and new document fields chunks_map and image_chunks_map (array<chunk_meta>) with attribute/summary mappings for chunk_index and page_number.
Misc formatting & dependency
server/api/knowledgeBase.ts, server/docxChunks.ts, server/package.json
Minor import/spacing/logging and multi-line queue invocation formatting; bumps @xyne/vespa-ts from ^1.0.8 to 1.0.10.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Uploader
  participant FPS as FileProcessorService
  participant OCR as Layout/OCR API
  participant FS as Disk
  participant Q as FileProcessingQueue
  participant V as Vespa

  Uploader->>FPS: process(fileBuffer, fileName, docId)
  alt PDF / multi-page
    FPS->>OCR: POST /parse (base64 payload)
    OCR-->>FPS: OcrResponse (pages, blocks, images)
    FPS->>FPS: chunkByOCR(...) ⇒ chunks, image_chunks, chunks_map, image_chunks_map
    FPS->>FS: save extracted images (document dir)
  else Non-PDF or fallback
    FPS->>FPS: fallback chunking ⇒ chunks + default maps
  end
  FPS-->>Q: enqueue vespaDoc {chunks, image_chunks, chunks_map, image_chunks_map}
  Q->>V: insert/update `kb_items` (includes chunk maps)
  V-->>Q: ack
  Q-->>Uploader: processing complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • zereraz
  • shivamashtikar
  • kalpadhwaryu
  • junaid-shirur
  • devesh-juspay

Poem

I nibble bytes and hop through text,
I map each chunk, each image indexed.
Pages parsed, maps tucked in a row,
Vespa stores what I bestow.
Hop, queue, ingest — a rabbit's glow 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Title Check ⚠️ Warning The title “Feat/paddle” does not relate to the significant changes in this pull request, which center on introducing an OCR-based chunking pipeline, adding chunk metadata maps throughout the processing flow, and updating schemas to support those maps; there is no “paddle” feature or code in the diff, making the title misleading and off-topic. Please rename the pull request to succinctly reflect its primary change, for example “Add OCR-based document chunking and chunk metadata maps,” so that the title accurately conveys the new functionality being introduced.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feat/paddle

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e847037 and 601945d.

📒 Files selected for processing (1)
  • server/package.json (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/package.json
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the document processing pipeline by integrating a robust OCR-based chunking solution. The core change involves replacing the existing PDF extraction method with a new module that leverages an external layout parsing API to extract text and images from documents, along with rich metadata such as page numbers and block types. This enhancement aims to improve the accuracy and detail of document chunking, particularly for complex PDF layouts, and ensures that this granular information is available throughout the system for better knowledge base management and retrieval.

Highlights

  • New OCR-based Document Chunking: A new module (server/lib/chunkByOCR.ts) has been introduced to process documents, particularly PDFs, using an external layout parsing API. This system extracts both text and images, generating detailed metadata for each chunk, including page numbers and block labels.
  • PDF Processing Rerouted: The FileProcessorService now directs PDF file processing to the newly implemented OCR-based chunking mechanism, replacing the previous Gemini-based extraction method for improved accuracy and detail.
  • Enhanced Chunk Metadata: A new ChunkMetadata type has been defined and integrated, allowing for richer information (chunk index, page number, block labels) to be stored with both text and image chunks, providing more granular document representation.
  • System Integration and Compatibility: The knowledgeBase API and fileProcessor queue have been updated to seamlessly handle and persist the new ChunkMetadata structure. Additionally, logic was added to ensure backward compatibility for existing documents and to provide default metadata for non-OCR processed file types.
  • Comprehensive Testing: Multiple new test scripts (server/scripts/testOCRFromProcessFile.ts, server/test-chunkByOCR-enhanced.ts, server/test-chunkByOCR.ts) have been added to thoroughly validate the new OCR chunking functionality, including mocking external API interactions.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new OCR-based file processing pipeline, primarily for PDFs, replacing the previous Gemini-based implementation. It adds a new chunkByOCR service and integrates it into the FileProcessorService. The data structures for chunking results have been expanded to include more detailed metadata (chunks_map, image_chunks_map), and these changes are propagated through the file processing and Vespa indexing logic. Several new test scripts have been added to validate the new OCR functionality. My review focuses on some critical issues in the new implementation and tests, including a syntax error, incorrect environment variable parsing, hardcoded local paths in tests that will break CI/CD, and broken test logic. There are also several opportunities to improve code quality by using a structured logger instead of console.log and improving type safety.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Nitpick comments (5)
server/services/fileProcessor.ts (1)

3-3: Remove commented-out import.

The commented import on line 3 appears to be dead code from the old PDF processing approach. Clean it up to reduce clutter.

-// import { extractTextAndImagesWithChunksFromPDF } from "@/pdf
server/scripts/testOCRFromProcessFile.ts (1)

8-8: Empty default PDF path may cause confusing errors.

If no PDF path is provided via CLI or environment, DEFAULT_TEST_PDF is an empty string, which will produce a cryptic "ENOENT" error when the script attempts to read it.

Consider providing a clear error message:

-const DEFAULT_TEST_PDF = ""
+const DEFAULT_TEST_PDF = ""

 function resolvePdfPath(args: string[], envs: EnvMap): string {
   // Priority: CLI arg -> TEST_PDF_PATH -> PDF_PATH
   const cli = args[0]
   const fromEnv = envs["TEST_PDF_PATH"] || envs["PDF_PATH"] || process.env.TEST_PDF_PATH || process.env.PDF_PATH
   const p = cli || fromEnv || DEFAULT_TEST_PDF
+  if (!p) {
+    throw new Error("No PDF path provided. Set TEST_PDF_PATH env var or pass as CLI argument.")
+  }
   return path.resolve(p)
 }
server/test-chunkByOCR-enhanced.ts (1)

223-224: Use proper typing instead of @ts-ignore.

The @ts-ignore directive suppresses type errors but reduces type safety. Consider using proper typing for the mock.

Replace with a typed mock:

-  // @ts-ignore
-  global.fetch = async (url: string, options: any) => {
+  global.fetch = async (url: string | URL | Request, options?: RequestInit): Promise<Response> => {
     console.log("🌐 Mock API called:", { url, method: options?.method })
server/lib/chunkByOCR.ts (2)

246-253: Remove or uncomment dead code.

The table handling logic is commented out. If it's needed for future use, move it to a tracked TODO or issue. Otherwise, remove it to keep the codebase clean.


567-574: Clarify commented code and filename strategy.

The complex filename derivation logic is commented out (lines 567-573) in favor of a simple sequential pattern (line 574). Either remove the commented code or document why it's preserved for future use.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc545d0 and ac03796.

📒 Files selected for processing (9)
  • server/api/knowledgeBase.ts (3 hunks)
  • server/docxChunks.ts (1 hunks)
  • server/lib/chunkByOCR.ts (1 hunks)
  • server/queue/fileProcessor.ts (3 hunks)
  • server/scripts/testOCRFromProcessFile.ts (1 hunks)
  • server/services/fileProcessor.ts (4 hunks)
  • server/test-chunkByOCR-enhanced.ts (1 hunks)
  • server/test-chunkByOCR.ts (1 hunks)
  • server/types.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
server/services/fileProcessor.ts (2)
server/types.ts (1)
  • ChunkMetadata (605-609)
server/lib/chunkByOCR.ts (1)
  • chunkByOCRFromBuffer (447-477)
server/lib/chunkByOCR.ts (3)
server/logger/index.ts (2)
  • getLogger (36-93)
  • Subsystem (15-15)
server/services/fileProcessor.ts (1)
  • ProcessingResult (19-26)
server/types.ts (1)
  • ChunkMetadata (605-609)
server/test-chunkByOCR.ts (1)
server/lib/chunkByOCR.ts (2)
  • chunkByOCR (479-653)
  • chunkByOCRFromBuffer (447-477)
server/test-chunkByOCR-enhanced.ts (1)
server/lib/chunkByOCR.ts (2)
  • chunkByOCR (479-653)
  • chunkByOCRFromBuffer (447-477)
server/scripts/testOCRFromProcessFile.ts (1)
server/lib/chunkByOCR.ts (1)
  • chunkByOCRFromBuffer (447-477)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (20)
server/types.ts (1)

604-609: LGTM! Clean type definition for chunk metadata.

The ChunkMetadata type is well-structured and clearly documents the metadata associated with each chunk in the OCR-based processing pipeline. The fields are appropriately typed and self-explanatory.

server/docxChunks.ts (1)

2627-2627: LGTM! Formatting improvement.

Minor style fix adding space before the opening parenthesis, aligning with standard JavaScript/TypeScript conventions.

server/queue/fileProcessor.ts (2)

220-221: LGTM! Chunk metadata propagation for files.

Correctly populates chunks_map and image_chunks_map from the processing result for file entities. This aligns with the new OCR-based processing workflow.


330-331: LGTM! Consistent metadata initialization for collections and folders.

Appropriately initializes chunks_map and image_chunks_map as empty arrays for collection and folder entities, which do not have content to chunk.

Also applies to: 451-452

server/api/knowledgeBase.ts (3)

47-47: LGTM! Import style change.

Switching from default to namespace import (import * as crypto) is a valid style choice with no functional impact.


755-759: LGTM! Cleaner iteration pattern.

The switch from for-of with .entries() to .forEach() is more idiomatic for this cleanup scenario and maintains the same functionality.


1579-1598: LGTM! Robust dual-format chunk index lookup.

The enhanced lookup logic correctly handles both legacy number[] and new ChunkMetadata[] formats:

  • Type checking discriminates between formats
  • Warning logged for unexpected object formats (good defensive programming)
  • Maintains backward compatibility
  • Existing 404 behavior preserved when no match is found
server/services/fileProcessor.ts (1)

164-175: LGTM! Sensible default metadata for non-PDF files.

The default chunk metadata generation for non-PDF processing paths is well-implemented:

  • Correct index mapping
  • Reasonable defaults (page_number: 0, block_labels: ["text"]/["image"])
  • Image chunks continue indexing after text chunks to maintain uniqueness
server/scripts/testOCRFromProcessFile.ts (1)

79-119: LGTM!

The result extraction and logging correctly uses the ProcessingResult fields (chunks, chunks_map, image_chunks, image_chunks_map) and provides comprehensive output for debugging.

server/test-chunkByOCR-enhanced.ts (1)

258-424: LGTM!

The test function is well-structured with proper setup/teardown, comprehensive logging, and correct usage of the ProcessingResult fields. The try/finally ensures mock restoration even on failure.

server/lib/chunkByOCR.ts (10)

125-149: LGTM!

The magic byte detection for common image formats is correctly implemented with a sensible fallback.


151-173: LGTM!

File name sanitization and uniqueness logic is correctly implemented.


175-200: LGTM!

Bounding box normalization and parsing logic includes appropriate validation and error handling.


202-222: LGTM!

Image lookup construction correctly deduplicates by bounding box key.


224-238: LGTM!

Block content transformation applies appropriate markdown formatting based on label type.


262-291: LGTM!

Image filename derivation handles multiple scenarios with appropriate fallbacks.


293-372: LGTM!

The Layout Parsing API call is well-structured with proper timeout handling, error messages, and cleanup. The Triton request format is correctly implemented.


498-500: LGTM!

Directory creation with recursive: true ensures the path exists before image writes.


504-531: LGTM!

The addChunk closure correctly handles splitting oversized text buffers into multiple chunks with continuation markers, assuming individual chunks don't exceed the limit.


543-606: LGTM!

Image block processing correctly handles buffer lookup, metadata association, file saving with error handling, and chunk map creation.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/api/knowledgeBase.ts (1)

1593-1610: Type the callback parameter explicitly for better clarity.

The chunk position handling correctly supports both legacy number[] and new ChunkMetadata[] formats. However, the pos parameter in the findIndex callback relies on type inference. For better type safety and clarity, consider explicitly typing it as number | ChunkMetadata.

Apply this diff for explicit typing:

-    const index = resp.fields.chunks_pos.findIndex((pos: number | ChunkMetadata) => {
+    const index = resp.fields.chunks_pos.findIndex((pos: number | ChunkMetadata): boolean => {
       // If it's a number (legacy format), compare directly
       if (typeof pos === "number") {
         return pos === chunkIndex
       }
       // If it's a ChunkMetadata object, compare the index field
       if (typeof pos === "object") {
         if (pos.chunk_index !== undefined) {
           return pos.chunk_index === chunkIndex
         } else {
           loggerWithChild({ email: userEmail }).warn(
             `Unexpected chunk position object format: ${JSON.stringify(pos)}`,
           )
         }
       }
       return false
     })

Note: The callback logic is correct and the warning for unexpected formats is a good defensive practice.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4599f2 and be9d7b9.

📒 Files selected for processing (1)
  • server/api/knowledgeBase.ts (9 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/api/knowledgeBase.ts (2)
server/queue/api-server-queue.ts (2)
  • boss (9-12)
  • FileProcessingQueue (14-14)
server/types.ts (1)
  • ChunkMetadata (605-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (3)
server/api/knowledgeBase.ts (3)

193-203: LGTM! Improved readability.

The reformatting of boss.send calls to multi-line style improves readability without changing semantics.

Also applies to: 686-696, 836-846, 1213-1220


1208-1208: LGTM! Clearer status message.

The updated status message clearly indicates that the file has been uploaded and is awaiting processing.


1251-1254: LGTM! Better error logging.

The reformatted error logging follows structured logging best practices by passing the error object as the first parameter.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
server/services/fileProcessor.ts (1)

24-25: Verify all consumers handle the new required fields.

The ProcessingResult interface now includes chunks_map and image_chunks_map as required fields. Ensure all consumers (particularly in server/queue/fileProcessor.ts and Vespa integration) correctly handle these new fields.

🧹 Nitpick comments (2)
server/services/fileProcessor.ts (2)

29-29: Logger instance defined but not used in this file.

The logger is instantiated but never referenced within FileProcessorService. The past review flagged console.log statements in the PDF processing path (lines 48-53), but those logs are not visible in the current code. If logging was removed or moved to chunkByOCRFromBuffer, consider removing this unused field. Otherwise, use it for error logging at line 144.


152-163: Clarify metadata creation scope.

The comment states "For non-PDF files" but this code executes for ALL files, including PDFs that encounter errors in the OCR path (lines 143-150). If PDFs should always produce metadata via OCR, consider moving this logic inside conditional branches for specific file types, or update the comment to reflect that it serves as a fallback for any processing failure.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be9d7b9 and 81ea622.

📒 Files selected for processing (1)
  • server/services/fileProcessor.ts (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/services/fileProcessor.ts (2)
server/lib/chunkByOCR.ts (2)
  • ProcessingResult (14-21)
  • chunkByOCRFromBuffer (447-477)
server/types.ts (1)
  • ChunkMetadata (605-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (4)
server/services/fileProcessor.ts (4)

7-8: LGTM!

The new imports are correctly structured and align with the OCR processing enhancements.


17-17: LGTM!

The logger import enables structured logging as recommended in past reviews.


47-53: LGTM! PDF processing correctly uses OCR.

The PDF processing path now directly returns the result from chunkByOCRFromBuffer, which already includes chunks_map and image_chunks_map (as shown in the relevant code snippet from server/lib/chunkByOCR.ts). This correctly bypasses the fallback metadata creation at lines 152-163.


165-172: LGTM! Return statement now includes all required fields.

The return statement correctly includes chunks_map and image_chunks_map, addressing the issue flagged in the previous review.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (4)
server/lib/chunkByOCR.ts (4)

713-721: Remove commented-out code or use the more descriptive naming.

The deriveImageFileName function is commented out in favor of a simple sequential numbering scheme. If the commented code is not needed, remove it to improve clarity. If the more descriptive naming is valuable, uncomment and use it.

Apply this diff if removing:

-        // const fileName = deriveImageFileName(
-        //   metadata.fileName,
-        //   metadata.bboxKey ?? normalizeBBox(block.block_bbox),
-        //   imageBuffer,
-        //   block.image_index,
-        //   metadata.pageIndex ?? pageNumber,
-        // )
         const extension = detectImageExtension(imageBuffer)
         const fileName = String(globalSeq.value) + "." + extension

Based on past review comments.


16-23: Remove duplicate type definition and import from fileProcessor.

The ProcessingResult interface is already defined in server/services/fileProcessor.ts (as shown in relevant code snippets). This duplication violates DRY and creates a maintenance burden—changes must be synchronized across files.

Apply this diff:

-export interface ProcessingResult {
-  chunks: string[]
-  chunks_pos: number[]
-  image_chunks: string[]
-  image_chunks_pos: number[]
-  chunks_map: ChunkMetadata[]
-  image_chunks_map: ChunkMetadata[]
-}
+import type { ProcessingResult } from "../services/fileProcessor"

Based on past review comments.


99-125: Handle edge case where single sentence exceeds maxBytes.

If a single sentence is larger than maxBytes, the current logic adds it to chunks as-is (line 112), potentially violating the size constraint. This can cause downstream issues with storage or processing systems that enforce strict size limits.

Consider splitting oversized sentences at word or character boundaries to ensure no chunk exceeds maxBytes. For example:

function splitText(text: string, maxBytes: number): string[] {
  const sentences = text.match(/[^.!?]+[.!?]+|\S+/g) ?? []
  const chunks: string[] = []
  let currentChunk: string[] = []
  let currentBytes = 0

  for (const sentence of sentences) {
    const sentenceBytes = getByteLength(sentence) + 1

    // If single sentence exceeds limit, split it further
    if (sentenceBytes > maxBytes) {
      if (currentChunk.length > 0) {
        chunks.push(currentChunk.join(" "))
        currentChunk = []
        currentBytes = 0
      }
      
      // Split oversized sentence by words
      const words = sentence.split(/\s+/)
      let wordBuffer = ""
      
      for (const word of words) {
        const wordBytes = getByteLength(word) + 1
        if (getByteLength(wordBuffer) + wordBytes > maxBytes) {
          if (wordBuffer) chunks.push(wordBuffer.trim())
          wordBuffer = word
        } else {
          wordBuffer += (wordBuffer ? " " : "") + word
        }
      }
      if (wordBuffer) chunks.push(wordBuffer.trim())
      continue
    }

    if (currentBytes + sentenceBytes > maxBytes) {
      if (currentChunk.length > 0) {
        chunks.push(currentChunk.join(" "))
      }
      currentChunk = [sentence]
      currentBytes = sentenceBytes
    } else {
      currentChunk.push(sentence)
      currentBytes += sentenceBytes
    }
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(" "))
  }

  return chunks
}

Based on past review comments.


631-638: Fix parseInt radix argument.

The second argument to parseInt must be the radix (2-36), but 1024 is passed. This causes parseInt to return NaN, and the fallback logic on lines 635-638 accidentally works around it. However, this is incorrect and fragile.

Apply this diff:

   const maxChunkBytes = Number.parseInt(
     process.env.OCR_MAX_CHUNK_BYTES ?? "",
-    1024,
+    10,
   )

Based on past review comments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81ea622 and f16ec65.

📒 Files selected for processing (1)
  • server/lib/chunkByOCR.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/lib/chunkByOCR.ts (3)
server/logger/index.ts (2)
  • getLogger (36-93)
  • Subsystem (15-15)
server/services/fileProcessor.ts (1)
  • ProcessingResult (19-26)
server/types.ts (1)
  • ChunkMetadata (605-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f16ec65 and ccfb5ef.

📒 Files selected for processing (2)
  • server/api/knowledgeBase.ts (9 hunks)
  • server/lib/chunkByOCR.ts (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/lib/chunkByOCR.ts
🧰 Additional context used
🧬 Code graph analysis (1)
server/api/knowledgeBase.ts (2)
server/queue/api-server-queue.ts (2)
  • boss (9-12)
  • FileProcessingQueue (14-14)
server/types.ts (1)
  • ChunkMetadata (605-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (6)
server/api/knowledgeBase.ts (6)

109-126: LGTM - Return type annotation added.

The explicit : string return type annotation improves type safety and code clarity.


192-202: LGTM - Queue send calls reformatted for readability.

The multi-line format for boss.send() calls improves readability without changing semantics. All calls use consistent retry/expiry configuration.

Also applies to: 685-695, 835-845, 1212-1219


760-767: LGTM - Session cleanup formatting.

The session cleanup logic correctly removes sessions older than 1 hour. The formatting changes maintain the same behavior.


1207-1207: LGTM - Status message added for file upload.

Adding the status message "File uploaded successfully, queued for processing" provides clear feedback about the file's processing state and aligns with the UploadStatus.PENDING status set later.


1250-1253: LGTM - Error logging reformatted.

Multi-line format improves readability without changing behavior.


47-47: LGTM - Explicit crypto namespace import.

The explicit namespace import import * as crypto from "crypto" is the recommended pattern for Node.js built-in modules.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/api/files.ts (1)

267-292: Preserve the metadata returned by FileProcessorService.

The OCR pipeline already supplies chunks_map and image_chunks_map. Rebuilding placeholder maps here wipes page numbers and block labels, so Vespa never sees the enriched metadata. Destructure the maps from processingResult and pass them through unchanged (the service already backfills sensible defaults for non-OCR formats).

-          const { chunks, chunks_pos, image_chunks, image_chunks_pos } =
-            processingResult
+          const {
+            chunks,
+            chunks_pos,
+            image_chunks,
+            image_chunks_pos,
+            chunks_map,
+            image_chunks_map,
+          } = processingResult
...
-            chunks_map: chunks.map((_, index) => ({
-              chunk_index: index,
-              page_number: 0,
-              block_labels: [],
-            })),
-            image_chunks_map: image_chunks.map((_, index) => ({
-              chunk_index: index,
-              page_number: 0,
-              block_labels: [],
-            })),
+            chunks_map,
+            image_chunks_map,
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ccfb5ef and 34ae08f.

📒 Files selected for processing (2)
  • server/api/files.ts (3 hunks)
  • server/services/fileProcessor.ts (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/api/files.ts (1)
server/integrations/dataSource/errors.ts (1)
  • isDataSourceError (137-139)
server/services/fileProcessor.ts (3)
server/logger/index.ts (2)
  • getLogger (36-93)
  • Subsystem (15-15)
server/types.ts (1)
  • ChunkMetadata (605-609)
server/lib/chunkByOCR.ts (1)
  • chunkByOCRFromBuffer (536-609)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

@Aayushjshah Aayushjshah merged commit 3822113 into main Oct 1, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants