Feat/paddle #1019

Aayushjshah · 2025-09-30T13:49:35Z

Description

Testing

Additional Notes

Summary by CodeRabbit

New Features
- OCR-based PDF ingestion with page-aware text and image extraction, batching, and robust processing.
- Rich per-chunk metadata for text and images (index, page, labels) stored with items to enable improved previews, filtering, and navigation.
Improvements
- More consistent filename quoting for downloads/attachments.
Refactor / Chores
- Minor formatting and logging refinements; updated dependency version.

coderabbitai · 2025-09-30T13:49:44Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds an OCR-based document chunking pipeline, a new ChunkMetadata type, persists chunk metadata maps (chunks_map, image_chunks_map) through file processing, queueing, APIs, and Vespa schema, and updates imports/logging and minor formatting across related files.

Changes

Cohort / File(s)	Summary
OCR chunking pipeline (new) `server/lib/chunkByOCR.ts`	New OCR-driven chunker: layout API calls, batching, image extraction/saving, block normalization, text/image chunking, and assembly of chunks, image_chunks, and metadata maps; exports `ProcessingResult`, `chunkByOCRFromBuffer`, and `chunkByOCR`.
File processing service `server/services/fileProcessor.ts`	Replaces Gemini PDF extraction with OCR flow via `chunkByOCRFromBuffer`; adds module logger; extends `ProcessingResult` to include `chunks_map` and `image_chunks_map`; builds default maps for non-PDF paths; returns OCR results for PDFs.
Queue → Vespa integration `server/queue/fileProcessor.ts`	Adds `chunks_map` and `image_chunks_map` to Vespa document payloads for file/collection/folder processing; initializes empty arrays for collection/folder docs; insertion flow unchanged otherwise.
API adjustments (knowledge base) `server/api/knowledgeBase.ts`	Accepts legacy `number[]` and new `ChunkMetadata[]` chunk-position formats; import and logging tweaks; queue invocation formatting changed (no semantic change).
Files API / attachments `server/api/files.ts`	Adds `chunks_map` and `image_chunks_map` to attachment Vespa docs for non-image files; consistently quotes filenames in Content-Disposition when serving attachments; error extraction unchanged.
Shared types `server/types.ts`	Adds exported `ChunkMetadata` type: `{ chunk_index: number; page_number: number; block_labels: string[] }`.
Vespa schema `server/vespa/schemas/kb_items.sd`	Adds `struct chunk_meta` and new document fields `chunks_map` and `image_chunks_map` (array<chunk_meta>) with attribute/summary mappings for `chunk_index` and `page_number`.
Misc formatting & dependency `server/api/knowledgeBase.ts`, `server/docxChunks.ts`, `server/package.json`	Minor import/spacing/logging and multi-line queue invocation formatting; bumps `@xyne/vespa-ts` from `^1.0.8` to `1.0.10`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Uploader
  participant FPS as FileProcessorService
  participant OCR as Layout/OCR API
  participant FS as Disk
  participant Q as FileProcessingQueue
  participant V as Vespa

  Uploader->>FPS: process(fileBuffer, fileName, docId)
  alt PDF / multi-page
    FPS->>OCR: POST /parse (base64 payload)
    OCR-->>FPS: OcrResponse (pages, blocks, images)
    FPS->>FPS: chunkByOCR(...) ⇒ chunks, image_chunks, chunks_map, image_chunks_map
    FPS->>FS: save extracted images (document dir)
  else Non-PDF or fallback
    FPS->>FPS: fallback chunking ⇒ chunks + default maps
  end
  FPS-->>Q: enqueue vespaDoc {chunks, image_chunks, chunks_map, image_chunks_map}
  Q->>V: insert/update `kb_items` (includes chunk maps)
  V-->>Q: ack
  Q-->>Uploader: processing complete

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

fix(vespa): update vespa package #996 — Related dependency bump sequence touching server/package.json and the @xyne/vespa-ts version progression.
update new version of vespa package #840 — Overlapping dependency bump for @xyne/vespa-ts in server/package.json.
pdfChunks file updated for better text processing #819 — Changes to the PDF text/image chunking pipeline; touches related extraction and chunking logic.

Suggested reviewers

zereraz
shivamashtikar
kalpadhwaryu
junaid-shirur
devesh-juspay

Poem

I nibble bytes and hop through text,
I map each chunk, each image indexed.
Pages parsed, maps tucked in a row,
Vespa stores what I bestow.
Hop, queue, ingest — a rabbit's glow 🐇✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “Feat/paddle” does not relate to the significant changes in this pull request, which center on introducing an OCR-based chunking pipeline, adding chunk metadata maps throughout the processing flow, and updating schemas to support those maps; there is no “paddle” feature or code in the diff, making the title misleading and off-topic.	Please rename the pull request to succinctly reflect its primary change, for example “Add OCR-based document chunking and chunk metadata maps,” so that the title accurately conveys the new functionality being introduced.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feat/paddle

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e847037 and 601945d.

📒 Files selected for processing (1)

server/package.json (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

server/package.json

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-09-30T13:50:01Z

Summary of Changes

Hello @Aayushjshah, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the document processing pipeline by integrating a robust OCR-based chunking solution. The core change involves replacing the existing PDF extraction method with a new module that leverages an external layout parsing API to extract text and images from documents, along with rich metadata such as page numbers and block types. This enhancement aims to improve the accuracy and detail of document chunking, particularly for complex PDF layouts, and ensures that this granular information is available throughout the system for better knowledge base management and retrieval.

Highlights

New OCR-based Document Chunking: A new module (server/lib/chunkByOCR.ts) has been introduced to process documents, particularly PDFs, using an external layout parsing API. This system extracts both text and images, generating detailed metadata for each chunk, including page numbers and block labels.
PDF Processing Rerouted: The FileProcessorService now directs PDF file processing to the newly implemented OCR-based chunking mechanism, replacing the previous Gemini-based extraction method for improved accuracy and detail.
Enhanced Chunk Metadata: A new ChunkMetadata type has been defined and integrated, allowing for richer information (chunk index, page number, block labels) to be stored with both text and image chunks, providing more granular document representation.
System Integration and Compatibility: The knowledgeBase API and fileProcessor queue have been updated to seamlessly handle and persist the new ChunkMetadata structure. Additionally, logic was added to ensure backward compatibility for existing documents and to provide default metadata for non-OCR processed file types.
Comprehensive Testing: Multiple new test scripts (server/scripts/testOCRFromProcessFile.ts, server/test-chunkByOCR-enhanced.ts, server/test-chunkByOCR.ts) have been added to thoroughly validate the new OCR chunking functionality, including mocking external API interactions.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new OCR-based file processing pipeline, primarily for PDFs, replacing the previous Gemini-based implementation. It adds a new chunkByOCR service and integrates it into the FileProcessorService. The data structures for chunking results have been expanded to include more detailed metadata (chunks_map, image_chunks_map), and these changes are propagated through the file processing and Vespa indexing logic. Several new test scripts have been added to validate the new OCR functionality. My review focuses on some critical issues in the new implementation and tests, including a syntax error, incorrect environment variable parsing, hardcoded local paths in tests that will break CI/CD, and broken test logic. There are also several opportunities to improve code quality by using a structured logger instead of console.log and improving type safety.

server/lib/chunkByOCR.ts

server/test-chunkByOCR-enhanced.ts

server/test-chunkByOCR.ts

server/lib/chunkByOCR.ts

server/scripts/testOCRFromProcessFile.ts

server/services/fileProcessor.ts

coderabbitai

Actionable comments posted: 16

🧹 Nitpick comments (5)

server/services/fileProcessor.ts (1)
3-3: Remove commented-out import.

The commented import on line 3 appears to be dead code from the old PDF processing approach. Clean it up to reduce clutter.
-// import { extractTextAndImagesWithChunksFromPDF } from "@/pdf
server/scripts/testOCRFromProcessFile.ts (1)
8-8: Empty default PDF path may cause confusing errors.

If no PDF path is provided via CLI or environment, DEFAULT_TEST_PDF is an empty string, which will produce a cryptic "ENOENT" error when the script attempts to read it.

Consider providing a clear error message:
-const DEFAULT_TEST_PDF = ""
+const DEFAULT_TEST_PDF = ""

 function resolvePdfPath(args: string[], envs: EnvMap): string {
   // Priority: CLI arg -> TEST_PDF_PATH -> PDF_PATH
   const cli = args[0]
   const fromEnv = envs["TEST_PDF_PATH"] || envs["PDF_PATH"] || process.env.TEST_PDF_PATH || process.env.PDF_PATH
   const p = cli || fromEnv || DEFAULT_TEST_PDF
+  if (!p) {
+    throw new Error("No PDF path provided. Set TEST_PDF_PATH env var or pass as CLI argument.")
+  }
   return path.resolve(p)
 }
server/test-chunkByOCR-enhanced.ts (1)
223-224: Use proper typing instead of @ts-ignore.

The @ts-ignore directive suppresses type errors but reduces type safety. Consider using proper typing for the mock.

Replace with a typed mock:
-  // @ts-ignore
-  global.fetch = async (url: string, options: any) => {
+  global.fetch = async (url: string | URL | Request, options?: RequestInit): Promise<Response> => {
     console.log("🌐 Mock API called:", { url, method: options?.method })
server/lib/chunkByOCR.ts (2)

246-253: Remove or uncomment dead code.

The table handling logic is commented out. If it's needed for future use, move it to a tracked TODO or issue. Otherwise, remove it to keep the codebase clean.

567-574: Clarify commented code and filename strategy.

The complex filename derivation logic is commented out (lines 567-573) in favor of a simple sequential pattern (line 574). Either remove the commented code or document why it's preserved for future use.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc545d0 and ac03796.

📒 Files selected for processing (9)

server/api/knowledgeBase.ts (3 hunks)
server/docxChunks.ts (1 hunks)
server/lib/chunkByOCR.ts (1 hunks)
server/queue/fileProcessor.ts (3 hunks)
server/scripts/testOCRFromProcessFile.ts (1 hunks)
server/services/fileProcessor.ts (4 hunks)
server/test-chunkByOCR-enhanced.ts (1 hunks)
server/test-chunkByOCR.ts (1 hunks)
server/types.ts (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (5)

server/services/fileProcessor.ts (2)

server/types.ts (1)

ChunkMetadata (605-609)

server/lib/chunkByOCR.ts (1)

chunkByOCRFromBuffer (447-477)

server/lib/chunkByOCR.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/services/fileProcessor.ts (1)

ProcessingResult (19-26)

server/types.ts (1)

ChunkMetadata (605-609)

server/test-chunkByOCR.ts (1)

server/lib/chunkByOCR.ts (2)

chunkByOCR (479-653)

chunkByOCRFromBuffer (447-477)

server/test-chunkByOCR-enhanced.ts (1)

server/lib/chunkByOCR.ts (2)

chunkByOCR (479-653)

chunkByOCRFromBuffer (447-477)

server/scripts/testOCRFromProcessFile.ts (1)

server/lib/chunkByOCR.ts (1)

chunkByOCRFromBuffer (447-477)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (20)

server/types.ts (1)

604-609: LGTM! Clean type definition for chunk metadata.

The ChunkMetadata type is well-structured and clearly documents the metadata associated with each chunk in the OCR-based processing pipeline. The fields are appropriately typed and self-explanatory.

server/docxChunks.ts (1)

2627-2627: LGTM! Formatting improvement.

Minor style fix adding space before the opening parenthesis, aligning with standard JavaScript/TypeScript conventions.

server/queue/fileProcessor.ts (2)

220-221: LGTM! Chunk metadata propagation for files.

Correctly populates chunks_map and image_chunks_map from the processing result for file entities. This aligns with the new OCR-based processing workflow.

330-331: LGTM! Consistent metadata initialization for collections and folders.

Appropriately initializes chunks_map and image_chunks_map as empty arrays for collection and folder entities, which do not have content to chunk.

Also applies to: 451-452

server/api/knowledgeBase.ts (3)

47-47: LGTM! Import style change.

Switching from default to namespace import (import * as crypto) is a valid style choice with no functional impact.

755-759: LGTM! Cleaner iteration pattern.

The switch from for-of with .entries() to .forEach() is more idiomatic for this cleanup scenario and maintains the same functionality.

1579-1598: LGTM! Robust dual-format chunk index lookup.

The enhanced lookup logic correctly handles both legacy number[] and new ChunkMetadata[] formats:

Type checking discriminates between formats

Warning logged for unexpected object formats (good defensive programming)

Maintains backward compatibility

Existing 404 behavior preserved when no match is found

server/services/fileProcessor.ts (1)

164-175: LGTM! Sensible default metadata for non-PDF files.

The default chunk metadata generation for non-PDF processing paths is well-implemented:

Correct index mapping

Reasonable defaults (page_number: 0, block_labels: ["text"]/["image"])

Image chunks continue indexing after text chunks to maintain uniqueness

server/scripts/testOCRFromProcessFile.ts (1)

79-119: LGTM!

The result extraction and logging correctly uses the ProcessingResult fields (chunks, chunks_map, image_chunks, image_chunks_map) and provides comprehensive output for debugging.

server/test-chunkByOCR-enhanced.ts (1)

258-424: LGTM!

The test function is well-structured with proper setup/teardown, comprehensive logging, and correct usage of the ProcessingResult fields. The try/finally ensures mock restoration even on failure.

server/lib/chunkByOCR.ts (10)

125-149: LGTM!

The magic byte detection for common image formats is correctly implemented with a sensible fallback.

151-173: LGTM!

File name sanitization and uniqueness logic is correctly implemented.

175-200: LGTM!

Bounding box normalization and parsing logic includes appropriate validation and error handling.

202-222: LGTM!

Image lookup construction correctly deduplicates by bounding box key.

224-238: LGTM!

Block content transformation applies appropriate markdown formatting based on label type.

262-291: LGTM!

Image filename derivation handles multiple scenarios with appropriate fallbacks.

293-372: LGTM!

The Layout Parsing API call is well-structured with proper timeout handling, error messages, and cleanup. The Triton request format is correctly implemented.

498-500: LGTM!

Directory creation with recursive: true ensures the path exists before image writes.

504-531: LGTM!

The addChunk closure correctly handles splitting oversized text buffers into multiple chunks with continuation markers, assuming individual chunks don't exceed the limit.

543-606: LGTM!

Image block processing correctly handles buffer lookup, metadata association, file saving with error handling, and chunk map creation.

server/lib/chunkByOCR.ts

server/services/fileProcessor.ts

server/test-chunkByOCR-enhanced.ts

server/test-chunkByOCR.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/api/knowledgeBase.ts (1)
1593-1610: Type the callback parameter explicitly for better clarity.

The chunk position handling correctly supports both legacy number[] and new ChunkMetadata[] formats. However, the pos parameter in the findIndex callback relies on type inference. For better type safety and clarity, consider explicitly typing it as number | ChunkMetadata.

Apply this diff for explicit typing:
-    const index = resp.fields.chunks_pos.findIndex((pos: number | ChunkMetadata) => {
+    const index = resp.fields.chunks_pos.findIndex((pos: number | ChunkMetadata): boolean => {
       // If it's a number (legacy format), compare directly
       if (typeof pos === "number") {
         return pos === chunkIndex
       }
       // If it's a ChunkMetadata object, compare the index field
       if (typeof pos === "object") {
         if (pos.chunk_index !== undefined) {
           return pos.chunk_index === chunkIndex
         } else {
           loggerWithChild({ email: userEmail }).warn(
             `Unexpected chunk position object format: ${JSON.stringify(pos)}`,
           )
         }
       }
       return false
     })
Note: The callback logic is correct and the warning for unexpected formats is a good defensive practice.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a4599f2 and be9d7b9.

📒 Files selected for processing (1)

server/api/knowledgeBase.ts (9 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/api/knowledgeBase.ts (2)

server/queue/api-server-queue.ts (2)

boss (9-12)

FileProcessingQueue (14-14)

server/types.ts (1)

ChunkMetadata (605-609)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (3)

server/api/knowledgeBase.ts (3)

193-203: LGTM! Improved readability.

The reformatting of boss.send calls to multi-line style improves readability without changing semantics.

Also applies to: 686-696, 836-846, 1213-1220

1208-1208: LGTM! Clearer status message.

The updated status message clearly indicates that the file has been uploaded and is awaiting processing.

1251-1254: LGTM! Better error logging.

The reformatted error logging follows structured logging best practices by passing the error object as the first parameter.

server/api/knowledgeBase.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/services/fileProcessor.ts (1)

24-25: Verify all consumers handle the new required fields.

The ProcessingResult interface now includes chunks_map and image_chunks_map as required fields. Ensure all consumers (particularly in server/queue/fileProcessor.ts and Vespa integration) correctly handle these new fields.

🧹 Nitpick comments (2)

server/services/fileProcessor.ts (2)

29-29: Logger instance defined but not used in this file.

The logger is instantiated but never referenced within FileProcessorService. The past review flagged console.log statements in the PDF processing path (lines 48-53), but those logs are not visible in the current code. If logging was removed or moved to chunkByOCRFromBuffer, consider removing this unused field. Otherwise, use it for error logging at line 144.

152-163: Clarify metadata creation scope.

The comment states "For non-PDF files" but this code executes for ALL files, including PDFs that encounter errors in the OCR path (lines 143-150). If PDFs should always produce metadata via OCR, consider moving this logic inside conditional branches for specific file types, or update the comment to reflect that it serves as a fallback for any processing failure.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between be9d7b9 and 81ea622.

📒 Files selected for processing (1)

server/services/fileProcessor.ts (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/services/fileProcessor.ts (2)

server/lib/chunkByOCR.ts (2)

ProcessingResult (14-21)

chunkByOCRFromBuffer (447-477)

server/types.ts (1)

ChunkMetadata (605-609)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (4)

server/services/fileProcessor.ts (4)

7-8: LGTM!

The new imports are correctly structured and align with the OCR processing enhancements.

17-17: LGTM!

The logger import enables structured logging as recommended in past reviews.

47-53: LGTM! PDF processing correctly uses OCR.

The PDF processing path now directly returns the result from chunkByOCRFromBuffer, which already includes chunks_map and image_chunks_map (as shown in the relevant code snippet from server/lib/chunkByOCR.ts). This correctly bypasses the fallback metadata creation at lines 152-163.

165-172: LGTM! Return statement now includes all required fields.

The return statement correctly includes chunks_map and image_chunks_map, addressing the issue flagged in the previous review.

server/services/fileProcessor.ts

server/scripts/testOCRFromProcessFile.ts

server/services/fileProcessor.ts

server/test-chunkByOCR.ts

server/lib/chunkByOCR.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (4)

server/lib/chunkByOCR.ts (4)
713-721: Remove commented-out code or use the more descriptive naming.

The deriveImageFileName function is commented out in favor of a simple sequential numbering scheme. If the commented code is not needed, remove it to improve clarity. If the more descriptive naming is valuable, uncomment and use it.

Apply this diff if removing:
-        // const fileName = deriveImageFileName(
-        //   metadata.fileName,
-        //   metadata.bboxKey ?? normalizeBBox(block.block_bbox),
-        //   imageBuffer,
-        //   block.image_index,
-        //   metadata.pageIndex ?? pageNumber,
-        // )
         const extension = detectImageExtension(imageBuffer)
         const fileName = String(globalSeq.value) + "." + extension
Based on past review comments.

16-23: Remove duplicate type definition and import from fileProcessor.

The ProcessingResult interface is already defined in server/services/fileProcessor.ts (as shown in relevant code snippets). This duplication violates DRY and creates a maintenance burden—changes must be synchronized across files.

Apply this diff:
-export interface ProcessingResult {
-  chunks: string[]
-  chunks_pos: number[]
-  image_chunks: string[]
-  image_chunks_pos: number[]
-  chunks_map: ChunkMetadata[]
-  image_chunks_map: ChunkMetadata[]
-}
+import type { ProcessingResult } from "../services/fileProcessor"
Based on past review comments.

99-125: Handle edge case where single sentence exceeds maxBytes.

If a single sentence is larger than maxBytes, the current logic adds it to chunks as-is (line 112), potentially violating the size constraint. This can cause downstream issues with storage or processing systems that enforce strict size limits.

Consider splitting oversized sentences at word or character boundaries to ensure no chunk exceeds maxBytes. For example:
function splitText(text: string, maxBytes: number): string[] {
  const sentences = text.match(/[^.!?]+[.!?]+|\S+/g) ?? []
  const chunks: string[] = []
  let currentChunk: string[] = []
  let currentBytes = 0

  for (const sentence of sentences) {
    const sentenceBytes = getByteLength(sentence) + 1

    // If single sentence exceeds limit, split it further
    if (sentenceBytes > maxBytes) {
      if (currentChunk.length > 0) {
        chunks.push(currentChunk.join(" "))
        currentChunk = []
        currentBytes = 0
      }
      
      // Split oversized sentence by words
      const words = sentence.split(/\s+/)
      let wordBuffer = ""
      
      for (const word of words) {
        const wordBytes = getByteLength(word) + 1
        if (getByteLength(wordBuffer) + wordBytes > maxBytes) {
          if (wordBuffer) chunks.push(wordBuffer.trim())
          wordBuffer = word
        } else {
          wordBuffer += (wordBuffer ? " " : "") + word
        }
      }
      if (wordBuffer) chunks.push(wordBuffer.trim())
      continue
    }

    if (currentBytes + sentenceBytes > maxBytes) {
      if (currentChunk.length > 0) {
        chunks.push(currentChunk.join(" "))
      }
      currentChunk = [sentence]
      currentBytes = sentenceBytes
    } else {
      currentChunk.push(sentence)
      currentBytes += sentenceBytes
    }
  }

  if (currentChunk.length > 0) {
    chunks.push(currentChunk.join(" "))
  }

  return chunks
}
Based on past review comments.

631-638: Fix parseInt radix argument.

The second argument to parseInt must be the radix (2-36), but 1024 is passed. This causes parseInt to return NaN, and the fallback logic on lines 635-638 accidentally works around it. However, this is incorrect and fragile.

Apply this diff:
   const maxChunkBytes = Number.parseInt(
     process.env.OCR_MAX_CHUNK_BYTES ?? "",
-    1024,
+    10,
   )
Based on past review comments.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81ea622 and f16ec65.

📒 Files selected for processing (1)

server/lib/chunkByOCR.ts (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

server/lib/chunkByOCR.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/services/fileProcessor.ts (1)

ProcessingResult (19-26)

server/types.ts (1)

ChunkMetadata (605-609)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

server/lib/chunkByOCR.ts

coderabbitai

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f16ec65 and ccfb5ef.

📒 Files selected for processing (2)

server/api/knowledgeBase.ts (9 hunks)
server/lib/chunkByOCR.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

server/lib/chunkByOCR.ts

🧰 Additional context used

🧬 Code graph analysis (1)

server/api/knowledgeBase.ts (2)

server/queue/api-server-queue.ts (2)

boss (9-12)

FileProcessingQueue (14-14)

server/types.ts (1)

ChunkMetadata (605-609)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (6)

server/api/knowledgeBase.ts (6)

109-126: LGTM - Return type annotation added.

The explicit : string return type annotation improves type safety and code clarity.

192-202: LGTM - Queue send calls reformatted for readability.

The multi-line format for boss.send() calls improves readability without changing semantics. All calls use consistent retry/expiry configuration.

Also applies to: 685-695, 835-845, 1212-1219

760-767: LGTM - Session cleanup formatting.

The session cleanup logic correctly removes sessions older than 1 hour. The formatting changes maintain the same behavior.

1207-1207: LGTM - Status message added for file upload.

Adding the status message "File uploaded successfully, queued for processing" provides clear feedback about the file's processing state and aligns with the UploadStatus.PENDING status set later.

1250-1253: LGTM - Error logging reformatted.

Multi-line format improves readability without changing behavior.

47-47: LGTM - Explicit crypto namespace import.

The explicit namespace import import * as crypto from "crypto" is the recommended pattern for Node.js built-in modules.

server/api/knowledgeBase.ts

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/api/files.ts (1)

267-292: Preserve the metadata returned by FileProcessorService.

The OCR pipeline already supplies chunks_map and image_chunks_map. Rebuilding placeholder maps here wipes page numbers and block labels, so Vespa never sees the enriched metadata. Destructure the maps from processingResult and pass them through unchanged (the service already backfills sensible defaults for non-OCR formats).

-          const { chunks, chunks_pos, image_chunks, image_chunks_pos } =
-            processingResult
+          const {
+            chunks,
+            chunks_pos,
+            image_chunks,
+            image_chunks_pos,
+            chunks_map,
+            image_chunks_map,
+          } = processingResult
...
-            chunks_map: chunks.map((_, index) => ({
-              chunk_index: index,
-              page_number: 0,
-              block_labels: [],
-            })),
-            image_chunks_map: image_chunks.map((_, index) => ({
-              chunk_index: index,
-              page_number: 0,
-              block_labels: [],
-            })),
+            chunks_map,
+            image_chunks_map,

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ccfb5ef and 34ae08f.

📒 Files selected for processing (2)

server/api/files.ts (3 hunks)
server/services/fileProcessor.ts (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

server/api/files.ts (1)

server/integrations/dataSource/errors.ts (1)

isDataSourceError (137-139)

server/services/fileProcessor.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/types.ts (1)

ChunkMetadata (605-609)

server/lib/chunkByOCR.ts (1)

chunkByOCRFromBuffer (536-609)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

server/services/fileProcessor.ts

Aayushjshah added 4 commits September 30, 2025 18:41

feat(paddle): feat/paddle vespa changes

bc40ef6

feat(paddle): feat/paddle vespa changes

1394edd

feat(paddle): feat/paddle vespa changes

a4934e6

feat(paddle): feat/paddle vespa changes

ac03796

Aayushjshah requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners September 30, 2025 13:49

gemini-code-assist bot reviewed Sep 30, 2025

View reviewed changes

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

Aayushjshah added 4 commits September 30, 2025 19:41

Merge branch 'main' into feat/paddle

cc1f21e

feat(paddle): feat/paddle vespa changes

2bbf4ec

feat(paddle): feat/paddle vespa changes

a4599f2

feat(paddle): feat/paddle vespa changes

be9d7b9

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

server/api/knowledgeBase.ts Outdated Show resolved Hide resolved

feat(paddle): feat/paddle vespa changes

81ea622

coderabbitai bot reviewed Sep 30, 2025

View reviewed changes

server/services/fileProcessor.ts Show resolved Hide resolved

shivamashtikar reviewed Sep 30, 2025

View reviewed changes

Aayushjshah added 2 commits October 1, 2025 12:08

feat(metadataPaddle): addition of chunkMetadata types

5ecccb5

feat(metadataPaddle): addition of chunkMetadata typesD

f16ec65

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

server/lib/chunkByOCR.ts Show resolved Hide resolved

feat(metadataPaddle): addition of chunkMetadata typesDD

ccfb5ef

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

server/api/knowledgeBase.ts Show resolved Hide resolved

server/api/knowledgeBase.ts Show resolved Hide resolved

feat(metadataPaddle): resolving comments

34ae08f

coderabbitai bot reviewed Oct 1, 2025

View reviewed changes

server/services/fileProcessor.ts Outdated Show resolved Hide resolved

Aayushjshah added 2 commits October 1, 2025 13:03

feat(metadataPaddle): adding updated vespa package

4310de0

feat(metadataPaddle): adding updated vespa package

e847037

Aayushjshah requested a review from shivamashtikar October 1, 2025 07:58

feat(metadataPaddle): adding updated vespa package

601945d

shivamashtikar approved these changes Oct 1, 2025

View reviewed changes

Aayushjshah merged commit 3822113 into main Oct 1, 2025
3 of 4 checks passed

This was referenced Oct 1, 2025

Feat/paddle #1026

Merged

Fix/process complete file in one go #1038

Merged

Fix/ocr chunking pages and images #1049

Merged

refactor: XYNE-103 Attachments are moved to file.sd schema #1065

Merged

This was referenced Oct 13, 2025

feat(cleanup-job-for-attachments): initial commit #1066

Merged

Fix/raw view sheets #1100

Merged

Feat/paddle for all pdf #1118

Merged

Feat/paddle #1019

Feat/paddle #1019

Uh oh!

Conversation

Aayushjshah commented Sep 30, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing

Additional Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

gemini-code-assist bot commented Sep 30, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Aayushjshah commented Sep 30, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 30, 2025 •

edited

Loading