Sheet chunking #1046

Himanshvarma · 2025-10-06T10:46:16Z

Description

improved sheet ingestion
added duckdb for performing rag on sheets

###Testing

tested locally

Summary by CodeRabbit

New Features
- Run SQL-style queries against spreadsheet attachments and return structured results.
- Public helper to build indexed context from fragments.
Improvements
- Header-preserving sheet chunking and per-sheet ingestion (each worksheet stored separately).
- Context generation is async and can incorporate a user query for richer answers.
- Unified spreadsheet size limits and support for multiple processing results per upload.
- Added SQL validation to improve safety of sheet queries.
Bug Fixes
- Granular deletion/cleanup of sheet-derived items (per-sheet IDs).

coderabbitai · 2025-10-06T10:46:40Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds header-aware spreadsheet chunking and per-sheet ingestion, threads an optional user query through async context-building, introduces DuckDB-backed sheet querying with LLM-driven SQL generation/validation, and surfaces new async flows and per-sheet Vespa doc handling plus size-based spreadsheet limits.

Changes

Cohort / File(s)	Summary
Spreadsheet chunking core `server/sheetChunk.ts`	New header-aware chunker `chunkSheetWithHeaders` with header inference, normalization, row_id augmentation, adaptive chunk creation, and public export.
AI context & sheet query `server/ai/context.ts`	`answerContextMap` made async and accepts optional `query`; adds header extraction, `processSheetQuery` integration using `querySheetChunks`, updates chunk summaries/matchfeatures, and exports `answerContextMapFromFragments`.
DuckDB & SQL inference/validation `server/lib/duckdb.ts`, `server/lib/sqlInference.ts`, `server/lib/sqlValidator.ts`, `server/types.ts`, `server/package.json`	New `querySheetChunks` (DuckDB runner), `analyzeQueryAndGenerateSQL` (LLM-driven SQL gen), AST-driven `SQLValidator`, DuckDB types, and added deps `duckdb-async` & `node-sql-parser`.
Agent/chat async propagation `server/api/chat/agents.ts`, `server/api/chat/chat.ts`, `server/api/chat/tools.ts`	Converted fragment generation to async: `vespaResultToMinimalAgentFragment` and `buildContext` are async and accept `query`; callers updated to await fragments; added `expandSheetIds`.
Per-sheet file processing & ingestion `server/services/fileProcessor.ts`, `server/queue/fileProcessor.ts`, `server/api/files.ts`, `server/integrations/ribbie/index.ts`	`processFile` now returns an array of (sheet) results; queue and APIs ingest multiple per-sheet Vespa docs, create per-result docIds/fileNames, and update metadata/vespaId accordingly.
Per-sheet delete handling `server/api/knowledgeBase.ts`	Delete flow expands `vespaDocId` via `expandSheetIds` and iterates to delete/log each derived per-sheet id.
Data source ingestion updates `server/integrations/dataSource/index.ts`, `server/integrations/dataSource/config.ts`, `server/integrations/dataSource/errors.ts`	Replaced row/text limits with numeric size checks (`checkFileSize(size, ...)`); added `MAX_SPREADSHEET_FILE_SIZE_MB`; adopted `chunkSheetWithHeaders`; removed legacy chunkSheetRows.
Google integrations `server/integrations/google/index.ts`, `server/integrations/google/config.ts`, `server/integrations/google/worker-utils.ts`	Consolidated to single sheet-size limits, added early size guards, adopted `chunkSheetWithHeaders`, and updated `listFiles` / `getSheetsListFromOneSpreadsheet` signatures.
Microsoft & Gmail integrations `server/integrations/microsoft/attachment-utils.ts`, `server/integrations/google/worker-utils.ts`	Replaced per-row/text guards with size caps and `chunkSheetWithHeaders`; centralized size checks via `checkFileSize`.
Tools & async callers `server/api/chat/tools.ts`, other callers	Converted many fragment/context assembly points to async (Promise.all) to await `answerContextMap`.
Config & small API surface `server/config.ts`, `server/types.ts`	Added `sqlInferenceModel` config export and new DuckDB-related types `DuckDBQuery` / `DuckDBResult`.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as User
  participant API as Chat API
  participant Agents as agents.ts
  participant Ctx as ai/context.ts
  participant Chunk as sheetChunkWithHeaders
  participant Duck as lib/duckdb.ts
  participant SQL as sqlInference/sqlValidator
  Note over U,API: User sends message (optional query)
  U->>API: message (query)
  API->>Agents: vespa results + query
  Agents->>Ctx: answerContextMap(..., query)
  alt Spreadsheet-like attachments
    Ctx->>Chunk: extract header-aware chunks
    opt Query provided
      Ctx->>Duck: querySheetChunks(chunks, query)
      Duck->>SQL: analyzeQueryAndGenerateSQL(schema, samples)
      SQL-->>Duck: SQL or null
      Duck-->>Ctx: rows/metadata or null
    end
  end
  Ctx-->>Agents: Context content (async)
  Agents-->>API: Minimal fragments
  API-->>U: Answer with context

sequenceDiagram
  autonumber
  participant U as Uploader
  participant FP as FileProcessorService
  participant Q as queue/fileProcessor
  participant Vespa as KB (kb_items)
  U->>FP: processFile(file)
  FP-->>Q: processingResults[] (per-sheet)
  loop per result
    Q->>Vespa: insert per-sheet vespaDoc (docId/fileName)
  end
  Note over Q,Vespa: If sheets → base vespaDocId gains sheet-index suffixes
  U->>Vespa: Delete item
  Vespa->>Vespa: expandSheetIds -> delete all derived IDs

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Context fixes #664 — Overlaps fragment generation and async fragment/context flow; touches agent/context interaction.
fix(ingestion): Mail attachement and data source ingestion fix #635 — Overlaps spreadsheet ingestion, per-sheet chunking, and per-sheet documentization.
added multi file attachment support #774 — Related changes to FileProcessorService.processFile and per-result handling across ingestion paths.

Suggested reviewers

zereraz
junaid-shirur
kalpadhwaryu
devesh-juspay
shivamashtikar

Poem

I hop through headers, rows, and light,
Nibble chunks by soft moonlight.
DuckDB hums beneath my paws,
SQL sprinkles sensible laws.
Per-sheet docs — a rabbit's delight! 🐇📊

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Title Check	⚠️ Warning	The title “Sheet chunking” is overly minimal and does not clearly convey the full scope of improvements, such as DuckDB integration for RAG on sheets and the related API changes. While it references a real aspect of the change, it isn’t a complete sentence and lacks context on the purpose and breadth of the update.	Consider renaming the PR to a concise sentence that highlights both sheet chunking and the new DuckDB-based query support, for example “Add spreadsheet chunking and DuckDB-based RAG query support.”

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch sheetChunking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-10-06T10:46:51Z

Summary of Changes

Hello @Himanshvarma, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly upgrades the system's capability to interact with and extract information from spreadsheet files. By integrating DuckDB, the system can now perform advanced analytical queries on tabular data, enhancing the quality of responses for metric-related questions. The underlying sheet ingestion and chunking mechanisms have been completely overhauled to intelligently structure spreadsheet data, ensuring that headers are correctly identified and preserved, and that multi-sheet files are handled as individual, queryable entities. These changes collectively lead to a more robust and intelligent handling of spreadsheet-based information within the AI context.

Highlights

DuckDB Integration for Sheet RAG: Introduced DuckDB to enable Retrieval Augmented Generation (RAG) directly on spreadsheet content, allowing for more complex and analytical queries against tabular data.
Improved Sheet Ingestion and Chunking: Refactored the sheet ingestion process with a new chunkSheetWithHeaders utility that intelligently detects and preserves headers, flattens multi-row headers, adds row IDs, and handles merged cells for more accurate data representation.
Multi-Sheet File Handling: Enhanced file processing to treat each sheet within a multi-sheet spreadsheet as a distinct document in Vespa, improving granularity and retrieval accuracy for complex workbooks.
Dynamic Context Mapping for Spreadsheets: Updated answerContextMap to dynamically process spreadsheet content, extract headers, and leverage DuckDB for query-specific data retrieval, falling back to traditional chunking if a metric query is not detected.
Asynchronous Context Building: Converted context building functions (buildContext, vespaResultToMinimalAgentFragment) to be asynchronous to accommodate the potentially long-running DuckDB queries and other data processing.
Expanded File ID Handling for Sheets: Implemented an expandSheetIds utility to generate all relevant Vespa document IDs for a multi-sheet file, ensuring comprehensive deletion and retrieval operations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature for performing RAG on spreadsheet data using DuckDB, along with a major refactoring of the sheet chunking and ingestion process. The changes are extensive and well-structured. My review focuses on the robustness of the new DuckDB integration, the clarity of the new sheet expansion logic, and the maintainability of the new data processing flows. I've identified a critical issue with error handling in the DuckDB logic that could lead to incorrect results, and several medium-severity issues related to code clarity, configuration, and async best practices.

server/ai/context.ts

server/lib/duckdb.ts

server/api/chat/chat.ts

server/lib/duckdb.ts

server/services/fileProcessor.ts

coderabbitai

Actionable comments posted: 14

🧹 Nitpick comments (13)

server/lib/sqlInference.ts (2)
6-6: Avoid provider-specific AWS Message typing here

This module uses Vertex/Anthropic via your provider. Importing Bedrock’s Message couples types incorrectly and can break typings. Prefer a provider-agnostic message type from your own abstraction (or a local type).

97-99: Fix logger.error argument order to capture stack consistently

Use error object as the first arg and message as the second for Pino. Current calls may lose stack/fields.

Apply this diff:
-      Logger.error("Failed to parse cleaned LLM response as JSON", { cleaned });
+      Logger.error({ cleaned, err: e }, "Failed to parse cleaned LLM response as JSON");
...
-  } catch (error) {
-    Logger.error("Failed to analyze query and generate SQL:", error);
+  } catch (error) {
+    Logger.error(error, "Failed to analyze query and generate SQL");
Also applies to: 117-119
server/queue/fileProcessor.ts (1)
219-227: Path concat can drop/add slashes; use posix join.

targetPath.substring(1) + file.fileName assumes a trailing slash; use path.posix.join for correctness.
+import path from "node:path"
@@
-      const reconstructedFilePath = targetPath === "/" 
-        ? file.fileName 
-        : targetPath.substring(1) + file.fileName // Remove leading "/" and add filename
+      const reconstructedFilePath = targetPath === "/"
+        ? file.fileName
+        : path.posix.join(targetPath.substring(1), file.fileName)
@@
-      let vespaFileName =
-        targetPath === "/"
-          ? file.collectionName + targetPath + reconstructedFilePath
-          : file.collectionName + targetPath + file.fileName
+      let vespaFileName =
+        targetPath === "/"
+          ? path.posix.join(file.collectionName, reconstructedFilePath)
+          : path.posix.join(file.collectionName, targetPath, file.fileName)
server/services/fileProcessor.ts (1)

95-139: Sheet handling: solid per-sheet results; minor resilience improvement.

Current flow skips empty sheets and throws if none have content. Good. Consider passing config to chunkSheetWithHeaders if you need stricter size/row caps.

Expose optional chunk config at call site to tune row and chunk size limits for large sheets.
server/sheetChunk.ts (3)
100-116: Header width may truncate columns when multi-row headers differ in length.

buildHeaders seeds width from rows[0].length. If subsequent header rows have more columns, they’re lost.
-function buildHeaders(rows: any[][], headerRows = 1): { header: string[], dataRows: any[][] } {
+function buildHeaders(rows: any[][], headerRows = 1): { header: string[], dataRows: any[][] } {
@@
-  const header = rows.slice(0, headerRows)
-    .reduce((acc, row) =>
-      acc.map((prev, i) => `${prev}_${(row[i] ?? "").toString().trim()}`), 
-      new Array(rows[0].length).fill("")
-    )
+  const width = Math.max(...rows.slice(0, headerRows).map(r => r.length))
+  const header = rows.slice(0, headerRows)
+    .reduce(
+      (acc, row) => acc.map((prev, i) => `${prev}_${(row[i] ?? "").toString().trim()}`),
+      new Array(width).fill("")
+    )
311-319: isHeaderRowValid too strict for real-world sheets.

Requiring all cells non-empty may force dummy headers frequently. Consider allowing some empties (e.g., ≥50% non-empty).
-  return row.every(cell => {
+  const filled = row.filter(cell => {
     if (cell === undefined || cell === null) return false
     const cellStr = cell.toString().trim()
     return cellStr.length > 0
-  })
+  }).length
+  const total = row.length
+  return total > 0 && filled / total >= 0.5
205-217: Prefer logger over console.error.

Use consistent logging for observability.
-  } catch (error) {
-    console.error("Error converting sheet to JSON:", error)
+  } catch (error) {
+    // eslint-disable-next-line no-console
+    // console.error("Error converting sheet to JSON:", error)
+    // Prefer structured logging:
+    // Logger.error(error, "Error converting sheet to JSON")
     return { headerRow: [], dataRows: [] }
   }
server/api/chat/chat.ts (4)
873-889: Deduplicate expanded IDs before deletion; avoid redundant DeleteDocument calls.

Same fileId can repeat across messages and expand to duplicate IDs.

Apply:
-            const vespaIds = expandSheetIds(fileId)
-            for (const id of vespaIds) {
+            const vespaIds = Array.from(new Set(expandSheetIds(fileId)))
+            for (const id of vespaIds) {
               try {
                 await DeleteDocument(id, KbItemsSchema)
Also consider performing these external deletions outside the DB transaction to avoid holding DB locks during network calls.

1222-1242: Async buildContext looks good; add fast‑path for empty results.

Preempt work when no results.

Apply:
 export async function buildContext(
   results: VespaSearchResult[],
@@
 ): Promise<string> {
-  const contextPromises = results?.map(
+  if (!results || results.length === 0) return ""
+  const contextPromises = results.map(
     async (v, i) =>
       `Index ${i + startIndex} \n ${await answerContextMap(
4211-4213: Deduplicate expanded non‑image attachment IDs.

Avoid duplicate fileIds and expanded IDs before search/RAG.

Apply:
-    const nonImageAttachmentFileIds = attachmentMetadata
-      .filter((m) => !m.isImage)
-      .flatMap((m) => expandSheetIds(m.fileId))
+    const nonImageAttachmentFileIds = Array.from(
+      new Set(
+        attachmentMetadata
+          .filter((m) => !m.isImage)
+          .flatMap((m) => expandSheetIds(m.fileId)),
+      ),
+    )
4267-4271: KB file ID expansion: deduplicate to remove repeats.

Prevent inflating fileIds with duplicates after expansion.

Apply:
-        fileIds = resp
-          .flatMap((file) => expandSheetIds(file.vespaDocId || ""))
-          .filter((id) => id !== "")
+        fileIds = Array.from(
+          new Set(
+            resp
+              .flatMap((file) => expandSheetIds(file.vespaDocId || ""))
+              .filter((id) => id !== ""),
+          ),
+        )
server/ai/context.ts (2)
123-180: Cap maxSummaryChunks from DuckDB to avoid oversized contexts.

Large result sets can balloon context. Cap to a sane limit (e.g., 100).

Apply:
   return {
     chunks_summary: newChunksSummary,
     matchfeatures: newMatchfeatures,
-    maxSummaryChunks: allChunks.length
+    maxSummaryChunks: Math.min(allChunks.length, 100)
   }
Optionally make the cap configurable.

886-924: MIME detection can miss common spreadsheet types; broaden for robustness.

Include xlsm and TSV at minimum.

Apply:
-    if(mimeType === "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" ||
-      mimeType === "application/vnd.ms-excel" ||
-      mimeType === "text/csv") {
+    if (
+      mimeType === "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" || // xlsx
+      mimeType === "application/vnd.ms-excel" || // xls
+      mimeType === "application/vnd.ms-excel.sheet.macroEnabled.12" || // xlsm
+      mimeType === "text/csv" ||
+      mimeType === "text/tab-separated-values" // tsv
+    ) {
Also consider gating DuckDB calls to top‑N spreadsheet results to manage latency.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45295a0 and ac2899c.

📒 Files selected for processing (19)

server/ai/context.ts (3 hunks)
server/api/chat/agents.ts (9 hunks)
server/api/chat/chat.ts (14 hunks)
server/api/files.ts (4 hunks)
server/api/knowledgeBase.ts (2 hunks)
server/integrations/dataSource/config.ts (1 hunks)
server/integrations/dataSource/index.ts (3 hunks)
server/integrations/google/config.ts (2 hunks)
server/integrations/google/index.ts (5 hunks)
server/integrations/google/worker-utils.ts (3 hunks)
server/integrations/microsoft/attachment-utils.ts (3 hunks)
server/integrations/ribbie/index.ts (3 hunks)
server/lib/duckdb.ts (1 hunks)
server/lib/sqlInference.ts (1 hunks)
server/package.json (1 hunks)
server/queue/fileProcessor.ts (2 hunks)
server/services/fileProcessor.ts (6 hunks)
server/sheetChunk.ts (1 hunks)
server/types.ts (1 hunks)

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-05-28T10:47:41.020Z

Learnt from: naSim087
PR: xynehq/xyne#484
File: server/integrations/google/sync.ts:222-222
Timestamp: 2025-05-28T10:47:41.020Z
Learning: The functions `handleGoogleDriveChange` and `getDriveChanges` in `server/integrations/google/sync.ts` are intentionally exported for future changes, even though they are not currently being imported by other modules.

Applied to files:

server/integrations/google/index.ts

📚 Learning: 2025-06-17T08:55:40.153Z

Learnt from: oindrila-b
PR: xynehq/xyne#557
File: server/integrations/google/gmail-worker.ts:313-317
Timestamp: 2025-06-17T08:55:40.153Z
Learning: In server/integrations/google/gmail-worker.ts, the log message intentionally shows totalMails (cumulative count) while sendStatsUpdate uses insertedMessagesInBatch (batch count). This is by design - logs show total progress for human readability while stats updates track batch progress for system functionality.

Applied to files:

server/integrations/google/index.ts

🧬 Code graph analysis (15)

server/lib/sqlInference.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/types.ts (1)

DuckDBQuery (615-618)

server/ai/provider/index.ts (1)

getProviderByModel (368-418)

server/integrations/microsoft/attachment-utils.ts (2)

server/integrations/google/config.ts (1)

MAX_ATTACHMENT_SHEET_SIZE (26-26)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/api/knowledgeBase.ts (3)

server/api/chat/chat.ts (1)

expandSheetIds (220-238)

server/search/vespa.ts (1)

DeleteDocument (23-23)

server/utils.ts (1)

getErrorMessage (103-106)

server/lib/duckdb.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/types.ts (1)

DuckDBResult (620-637)

server/lib/sqlInference.ts (1)

analyzeQueryAndGenerateSQL (20-121)

server/ai/context.ts (2)

server/lib/duckdb.ts (1)

querySheetChunks (14-202)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/dataSource/index.ts (2)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/dataSource/config.ts (1)

DATASOURCE_CONFIG (2-66)

server/api/chat/chat.ts (3)

server/search/vespa.ts (1)

DeleteDocument (23-23)

server/utils.ts (1)

getErrorMessage (103-106)

server/ai/context.ts (2)

answerContextMap (887-985)

cleanContext (1004-1006)

server/integrations/google/worker-utils.ts (2)

server/integrations/google/config.ts (1)

MAX_ATTACHMENT_SHEET_SIZE (26-26)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/api/files.ts (3)

server/services/fileProcessor.ts (2)

FileProcessorService (40-193)

SheetProcessingResult (33-38)

server/integrations/dataSource/config.ts (1)

getBaseMimeType (69-71)

server/search/vespa.ts (1)

insert (19-19)

server/queue/fileProcessor.ts (3)

server/services/fileProcessor.ts (2)

FileProcessorService (40-193)

SheetProcessingResult (33-38)

server/integrations/dataSource/config.ts (1)

getBaseMimeType (69-71)

server/db/schema/knowledgeBase.ts (1)

collectionItems (69-140)

server/services/fileProcessor.ts (1)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/google/index.ts (2)

server/integrations/google/config.ts (1)

MAX_GD_SHEET_SIZE (15-15)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/ribbie/index.ts (1)

server/services/fileProcessor.ts (1)

FileProcessorService (40-193)

server/sheetChunk.ts (1)

server/pdfChunks.ts (1)

cleanText (102-146)

server/api/chat/agents.ts (4)

server/types.ts (1)

UserMetadataType (605-605)

server/api/chat/types.ts (1)

MinimalAgentFragment (88-94)

server/ai/context.ts (1)

answerContextMap (887-985)

server/api/chat/chat.ts (1)

expandSheetIds (220-238)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (13)

server/integrations/google/worker-utils.ts (2)

325-331: Good guard: size-based skip for large spreadsheets

The MB cap avoids heavy processing and aligns with config.

421-421: Header-aware chunking is a solid simplification

Using chunkSheetWithHeaders(worksheet) improves consistency and memory usage.

server/integrations/google/config.ts (1)

15-15: Unified size caps for sheets look good

MB-based limits are clearer and consistent with usage.

Also applies to: 26-26

server/queue/fileProcessor.ts (2)

242-283: Per-result Vespa doc construction looks correct.

Properly sets per-result docId, fileName suffixes for sheets/indexed results, and aggregates totalChunksCount.

Metadata includes sheet fields when present.

Also applies to: 286-296

206-213: Confirm newVespaDocId suffix interpretation
No code found parsing _sheet_<n> suffix—please manually verify whether downstream consumers expect this suffix to denote the total sheet count or a zero-based sheet index to avoid off-by-one errors.

server/services/fileProcessor.ts (2)

184-192: Return shape for non-sheet types LGTM.

Wrapping single ProcessingResult in an array maintains uniformity.

10-17: Ensure xlsx version ≥ 0.20.x
I didn’t find any xlsx entries in your package.json or lockfiles—please verify the dependency declaration and upgrade from 0.18.5 to a secure 0.20.x release.

server/sheetChunk.ts (2)

426-436: Reasonable defaults and adaptive row limit. LGTM.

Header normalization and adaptive rows per chunk are sensible.

341-380: Chunking logic is robust.

Byte-size aware chunking with truncation guard looks good.

server/api/chat/agents.ts (3)

534-551: No changes needed: answerContextMap returns string AiContext is defined as string, so answerContextMap(...) already returns a string matching MinimalAgentFragment.content; ignore this suggestion.

Likely an incorrect or invalid review comment.

855-857: Approve changes: expandSheetIds transforms docId_sheet_<N> into [docId_sheet_0…docId_sheet_<N-1>] as intended.

1604-1622: Async Promise.all usage and types align

vespaResultToMinimalAgentFragment returns MinimalAgentFragment, matching the MinimalAgentFragment[] signature of answerContextMapFromFragments; no changes needed.

server/ai/context.ts (1)

722-724: LGTM: simpler maxSummaryChunks initialization for KB context.

Dropping isMsgWithSources from init reads clearly and keeps behavior predictable.

server/ai/context.ts

server/api/chat/agents.ts

server/api/chat/chat.ts

server/api/files.ts

server/integrations/google/worker-utils.ts

server/lib/duckdb.ts

server/package.json

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/api/files.ts (1)

261-338: Use a real ingested docId for metadata (vespaId is currently invalid).

FileProcessorService.processFile emits sheet results whose docId is ${fileId}_sheet_${sheetIndex}. Here you overwrite vespaId with ${fileId}_sheet_${totalSheets}, a suffix that never matches any inserted document (sheet indices are 0-based, totalSheets is a count). The returned metadata now advertises /api/v1/attachments/${vespaId} and fileId: vespaId, but no Vespa document or on-disk resource exists at that identifier. Downstream consumers expecting fileId to resolve to an actual sheet chunk record will fail.

Keep vespaId aligned with an existing docId—e.g., leave it as the base fileId, or choose one of the real sheetResult.docId values—and make the URL/fileId coherent with that choice.

♻️ Duplicate comments (2)

server/integrations/google/index.ts (1)
2189-2198: Critical: Unit mismatch causes spreadsheet rejection bug.

spreadsheet.size returns bytes but MAX_GD_SHEET_SIZE is 10 (for 10 MB per config comment). Line 2191 compares bytes directly to 10, causing spreadsheets larger than 10 bytes to be rejected. Line 2193 also incorrectly divides the MB value by 1024 * 1024 again, showing 0.0000095 MB in logs instead of 10 MB.

Apply this diff to convert the MB config value to bytes before comparison:
   // Early size check before fetching spreadsheet data
   const sizeInBytes = spreadsheet.size ? parseInt(spreadsheet.size, 10) : 0
-  if (!isNaN(sizeInBytes) && sizeInBytes > MAX_GD_SHEET_SIZE) {
+  const maxSheetSizeInBytes = MAX_GD_SHEET_SIZE * 1024 * 1024
+  if (!isNaN(sizeInBytes) && sizeInBytes > maxSheetSizeInBytes) {
     const sizeInMB = (sizeInBytes / (1024 * 1024)).toFixed(2)
-    const maxSizeInMB = (MAX_GD_SHEET_SIZE / (1024 * 1024)).toFixed(2)
+    const maxSizeInMB = MAX_GD_SHEET_SIZE.toFixed(2)
     loggerWithChild({ email: userEmail }).warn(
       `Ignoring ${spreadsheet.name} as its size (${sizeInMB} MB) exceeds the limit of ${maxSizeInMB} MB`,
     )
Based on configuration at server/integrations/google/config.ts.
server/api/chat/chat.ts (1)
220-236: expandSheetIds still skips the requested sheet ID

Line 234’s for (let i = 0; i < upper; i++) stops one iteration early, so _sheet_0 returns an empty array and _sheet_N omits _sheet_N. As a result, ChatDeleteApi never deletes the actual sheet document and flatMap on attachments silently drops the final sheet. Include the upper bound (and keep 0 inclusive) so every sheet ID is returned.
-  const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1
-  for (let i = 0; i < upper; i++) {
+  const upper = Number.isFinite(sheetNumber) ? Math.max(sheetNumber, 0) : 0
+  for (let i = 0; i <= upper; i++) {
     expandedIds.push(`${docId}_sheet_${i}`)
   }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ac2899c and b16d852.

📒 Files selected for processing (8)

server/ai/context.ts (3 hunks)
server/api/chat/chat.ts (14 hunks)
server/api/files.ts (5 hunks)
server/integrations/google/index.ts (5 hunks)
server/lib/duckdb.ts (1 hunks)
server/lib/sqlInference.ts (1 hunks)
server/lib/sqlValidator.ts (1 hunks)
server/package.json (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

server/lib/sqlInference.ts
server/lib/duckdb.ts

🧰 Additional context used

🧠 Learnings (2)

📚 Learning: 2025-05-28T10:47:41.020Z

Learnt from: naSim087
PR: xynehq/xyne#484
File: server/integrations/google/sync.ts:222-222
Timestamp: 2025-05-28T10:47:41.020Z
Learning: The functions `handleGoogleDriveChange` and `getDriveChanges` in `server/integrations/google/sync.ts` are intentionally exported for future changes, even though they are not currently being imported by other modules.

Applied to files:

server/integrations/google/index.ts

📚 Learning: 2025-06-17T08:55:40.153Z

Learnt from: oindrila-b
PR: xynehq/xyne#557
File: server/integrations/google/gmail-worker.ts:313-317
Timestamp: 2025-06-17T08:55:40.153Z
Learning: In server/integrations/google/gmail-worker.ts, the log message intentionally shows totalMails (cumulative count) while sendStatsUpdate uses insertedMessagesInBatch (batch count). This is by design - logs show total progress for human readability while stats updates track batch progress for system functionality.

Applied to files:

server/integrations/google/index.ts

🧬 Code graph analysis (5)

server/ai/context.ts (2)

server/lib/duckdb.ts (1)

querySheetChunks (15-223)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/api/files.ts (3)

server/services/fileProcessor.ts (2)

FileProcessorService (40-193)

SheetProcessingResult (33-38)

server/integrations/dataSource/config.ts (1)

getBaseMimeType (69-71)

server/search/vespa.ts (1)

insert (19-19)

server/api/chat/chat.ts (4)

server/search/vespa.ts (1)

DeleteDocument (23-23)

server/utils.ts (1)

getErrorMessage (103-106)

server/types.ts (1)

UserMetadataType (605-605)

server/ai/context.ts (2)

answerContextMap (887-985)

cleanContext (1004-1006)

server/integrations/google/index.ts (2)

server/integrations/google/config.ts (1)

MAX_GD_SHEET_SIZE (15-15)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/lib/sqlValidator.ts (1)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

🔇 Additional comments (3)

server/integrations/google/index.ts (3)

103-103: LGTM: Import consolidation improves maintainability.

The shift from multiple constants (MAX_GD_SHEET_ROWS, MAX_GD_SHEET_TEXT_LEN) to a single MAX_GD_SHEET_SIZE simplifies configuration. The import of chunkSheetWithHeaders centralizes sheet-processing logic with header awareness, which is a good architectural improvement.

Also applies to: 1076-1076

2244-2244: LGTM: Centralized chunking with header awareness.

The call to chunkSheetWithHeaders(finalRows) correctly replaces inline chunking logic. The finalRows type (string[][]) matches the function signature, and omitting the optional config parameter is fine—the function will apply sensible defaults for chunk size and rows per chunk.

2919-2919: LGTM: Query precedence issue resolved.

The addition of the optional q parameter and the parentheses wrapping in the query construction ((${q}) and trashed = false) correctly handles OR operators. This prevents the AND precedence bug where trashed files could slip through when q contains OR clauses. The change is backward-compatible since existing callers don't pass q.

Well done addressing the previous review feedback on query precedence!

Also applies to: 2925-2925

server/lib/sqlValidator.ts

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

server/ai/context.ts (1)
97-116: Fix chunk index shifting after header injection

You still leave every data chunk at its original index even though you insert the header at index 0 and shift chunk_scores by +1. Consumers that sort or join on index now see duplicate 0s and misaligned matchfeatures, so the wrong rows end up ahead/behind the header—exactly the bug called out earlier. Please bump each object chunk’s index when a header is present.
-        processedChunks.push({
-          chunk: dataContent,
-          score: originalChunk.score,
-          index: originalChunk.index
-        });
+        const baseIndex = originalChunk.index ?? i
+        processedChunks.push({
+          chunk: dataContent,
+          score: originalChunk.score,
+          index: headerChunk ? baseIndex + 1 : baseIndex,
+        });

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b85d835 and e73bd01.

📒 Files selected for processing (5)

server/ai/context.ts (2 hunks)
server/api/knowledgeBase.ts (2 hunks)
server/package.json (2 hunks)
server/services/fileProcessor.ts (6 hunks)
server/types.ts (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

server/types.ts
server/api/knowledgeBase.ts
server/package.json

🧰 Additional context used

🧬 Code graph analysis (2)

server/services/fileProcessor.ts (1)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/ai/context.ts (2)

server/lib/duckdb.ts (1)

querySheetChunks (15-223)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: build

server/services/fileProcessor.ts

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (2)

server/api/chat/agents.ts (1)

1544-1571: Past review issue remains unresolved: String concatenation on AiContext object.

This is the same issue flagged in the previous review. The code calls answerContextMap(...) which returns an AiContext object (not a string), then concatenates it as a string at line 1566 and 1569. This will produce "[object Object]" at runtime.

The previous review's suggested fix (using vespaResultToMinimalAgentFragment) won't resolve this because that helper function has the same underlying issue (see my comment on lines 525-542).

Apply this corrected fix:

             const contextPromises = results?.root?.children?.map(
               async (v, i) => {
-                let content = await answerContextMap(
+                const aiContext = await answerContextMap(
                   v as VespaSearchResults,
                   userMetadata,
                   0,
                   true,
                   undefined,
                   message,
                 )
+                // Extract textual content from AiContext
+                let content = typeof aiContext === 'string'
+                  ? aiContext
+                  : (aiContext.chunks?.join('\n') || '')
+                
                 const chatContainerFields =
                   isChatContainerFields(v.fields) &&
                   v.fields.sddocname === chatContainerSchema
                     ? v.fields
                     : undefined

                 if (chatContainerFields?.creator) {
                   const creator = await getDocumentOrNull(
                     chatUserSchema,
                     chatContainerFields.creator,
                   )
                   if (creator && isChatUserFields(creator.fields)) {
                     content += `\nCreator: ${creator.fields.name}`
                   }
                 }
                 return `Index ${i + 1} \n ${content}`
               },
             )

server/api/chat/chat.ts (1)

220-239: expandSheetIds off-by-one (omits last sheet) and lacks a safety cap

Loop builds 0..sheetNumber-1; docId_sheet_N is missing. Add inclusive upper bound and a sane cap to avoid pathological loops.

Apply:

 export function expandSheetIds(fileId: string): string[] {
   // Check if the fileId matches the pattern docId_sheet_number
   const sheetMatch = fileId.match(/^(.+)_sheet_(\d+)$/)
 
   if (!sheetMatch) {
     // Not a sheet ID, return as is
     return [fileId]
   }
 
   const [, docId, sheetNumberStr] = sheetMatch
   const sheetNumber = parseInt(sheetNumberStr, 10)
   // Generate IDs from docId_sheet_0 to docId_sheet_number
   const expandedIds: string[] = []
-  const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1
-  for (let i = 0; i < upper; i++) {
+  // Cap to prevent accidental huge loops; adjust if needed
+  const upper = Number.isFinite(sheetNumber)
+    ? Math.min(Math.max(sheetNumber, 0), 1000)
+    : 0
+  for (let i = 0; i <= upper; i++) {
     expandedIds.push(`${docId}_sheet_${i}`)
   }
 
   return expandedIds
 }

Optional: move this helper to a shared utils module for reuse.

🧹 Nitpick comments (3)

server/api/chat/chat.ts (3)

3357-3357: Avoid building full context solely for span attributes (heavy and noisy)

These lines compute full context only to attach to spans. This adds latency and large attributes. Prefer IDs/counts or a truncated preview.

Example:

-iterationSpan?.setAttribute(`context`, await buildContext(items, 20, userMetadata, 0, input))
+iterationSpan?.setAttribute('context_ids', items.map((v:any) => v.fields?.docId).filter(Boolean))
+iterationSpan?.setAttribute('context_count', items.length)

Also applies to: 3680-3680

877-890: Per‑ID Vespa deletion: handle 404 as warn and keep others as errors

Inner catch logs all failures as errors and bypasses the outer 404 handler. Downgrade 404s to warn here and keep loop going.

-            for (const id of vespaIds) {
-              try {
-                await DeleteDocument(id, KbItemsSchema)
-                loggerWithChild({ email }).info(
-                  `Successfully deleted non-image attachment ${id} from Vespa kb_items schema`,
-                )
-              } catch (error) {
-                loggerWithChild({ email }).error(
-                  `Failed to delete non-image attachment ${id} from Vespa kb_items schema`,
-                  { error: getErrorMessage(error) }
-                )
-              }
-            }
+            for (const id of vespaIds) {
+              try {
+                await DeleteDocument(id, KbItemsSchema)
+                loggerWithChild({ email }).info(
+                  `Successfully deleted non-image attachment ${id} from Vespa kb_items schema`,
+                )
+              } catch (error) {
+                const msg = getErrorMessage(error)
+                if (msg.includes("404")) {
+                  loggerWithChild({ email }).warn(
+                    `Non-image attachment ${id} not found in Vespa kb_items schema (may have been already deleted)`,
+                  )
+                } else {
+                  loggerWithChild({ email }).error(
+                    `Failed to delete non-image attachment ${id} from Vespa kb_items schema`,
+                    { error: msg },
+                  )
+                }
+              }
+            }

Optional: use Promise.allSettled for small batches to speed up, with rate limiting if needed.

4269-4269: Guard falsy IDs and dedupe when expanding attachment IDs

If any attachment lacks fileId, expandSheetIds("") yields [""], which can leak empty IDs into fileIds. Also dedupe after expansion.

-    const nonImageAttachmentFileIds = attachmentMetadata
-      .filter((m) => !m.isImage)
-      .flatMap((m) => expandSheetIds(m.fileId))
+    const nonImageAttachmentFileIds = Array.from(
+      new Set(
+        attachmentMetadata
+          .filter((m) => !m.isImage && !!m.fileId)
+          .flatMap((m) => expandSheetIds(m.fileId!))
+      )
+    )

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e73bd01 and 3c7534b.

📒 Files selected for processing (4)

server/api/chat/agents.ts (8 hunks)
server/api/chat/chat.ts (14 hunks)
server/api/knowledgeBase.ts (2 hunks)
server/package.json (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (2)

server/package.json
server/api/knowledgeBase.ts

🧰 Additional context used

🧬 Code graph analysis (2)

server/api/chat/chat.ts (3)

server/search/vespa.ts (1)

DeleteDocument (24-24)

server/utils.ts (1)

getErrorMessage (103-106)

server/ai/context.ts (2)

answerContextMap (887-985)

cleanContext (1004-1006)

server/api/chat/agents.ts (4)

server/types.ts (1)

UserMetadataType (605-605)

server/api/chat/types.ts (1)

MinimalAgentFragment (88-94)

server/ai/context.ts (1)

answerContextMap (887-985)

server/api/chat/chat.ts (1)

expandSheetIds (220-239)

🔇 Additional comments (8)

server/api/chat/agents.ts (5)

846-849: LGTM: Correct usage of expandSheetIds for sheet attachments.

The code correctly uses flatMap with expandSheetIds to expand sheet file IDs (matching pattern docId_sheet_N) into arrays containing all sheet IDs from 0 to N. Non-sheet file IDs pass through unchanged. This ensures that when a user attaches a spreadsheet, all individual sheets are available for retrieval and context building.

1622-1641: Async pattern is correct but inherits upstream type mismatch.

The async/await usage and Promise.all pattern are correctly implemented here:

Properly awaits the async vespaResultToMinimalAgentFragment function

Correctly passes all required parameters including the message as query

Uses Promise.all for efficient parallel execution

However, this code will inherit the critical type mismatch bug from vespaResultToMinimalAgentFragment (lines 525-542) where the content field receives an AiContext object instead of a string. Once that upstream function is fixed, this code will work correctly.

2912-2914: Consistent async pattern across multiple endpoints.

The async fragment generation follows the same correct pattern in both AgentMessageApiRagOff (lines 2912-2914) and the non-streaming path (lines 3182-3184):

Properly awaited Promise.all

Correct parameter passing including message as query

Efficient parallel execution

These implementations are consistent with the pattern at lines 1622-1641. All will inherit the type mismatch issue from vespaResultToMinimalAgentFragment and will work correctly once that upstream function is fixed.

Also applies to: 3182-3184

3396-3399: LGTM: Consistent sheet ID expansion.

The usage of expandSheetIds here is identical to line 848 and correctly expands sheet attachment IDs. The implementation is consistent across different API endpoints.

124-124: LGTM: Import added for sheet ID expansion.

The expandSheetIds import is correctly added to support the new sheet attachment handling functionality used at lines 848 and 3398.

server/api/chat/chat.ts (3)

1223-1243: Async buildContext threading query into answerContextMap looks good

Signature change and propagation of builtUserQuery are consistent; join/clean behavior preserved.

2217-2224: Query‑aware answerContextMap is correctly wired

Passing message as the optional query enables sheet‑aware processing. Looks correct.

4325-4326: KB file ID expansion LGTM

Flattening per‑sheet IDs and filtering empties is correct for selectedKbItems.

server/api/chat/agents.ts

server/api/chat/chat.ts

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

server/api/chat/tools.ts (2)
449-472: Consider error handling for concurrent fragment processing.

The Promise.all pattern with async map correctly implements concurrent fragment processing. However, it will fail fast if any single answerContextMap call throws an error, potentially losing all successfully processed fragments.

Consider adding individual error handling within the map callback to make the operation more resilient:
 const fragments: MinimalAgentFragment[] = await Promise.all(children.map(async (r) => {
+  try {
     if (r.fields.sddocname === dataSourceFileSchema) {
       const fields = r.fields as VespaDataSourceFile
       return {
         id: `${fields.docId}`,
         content: await answerContextMap(r, userMetadata, maxDefaultSummary),
         source: {
           docId: fields.docId,
           title: fields.fileName || "Untitled",
           url: `/dataSource/${(fields as VespaDataSourceFile).docId}`,
           app: fields.app || Apps.DataSource,
           entity: DataSourceEntity.DataSourceFile,
         },
         confidence: r.relevance || 0.7,
       }
     }
     const citation = searchToCitation(r)
     return {
       id: `${citation.docId}`,
       content: await answerContextMap(r, userMetadata, maxDefaultSummary),
       source: citation,
       confidence: r.relevance || 0.7,
     }
+  } catch (error) {
+    Logger.error(error, `Failed to process fragment: ${r.id}`)
+    const citation = searchToCitation(r)
+    return {
+      id: `${citation.docId}`,
+      content: "Content unavailable due to processing error",
+      source: citation,
+      confidence: 0.5,
+    }
+  }
 }))
This pattern should be applied consistently across all similar implementations in this file (lines 1037-1053, 1251-1267, 1491-1507, 1828-1844, 2035-2051).

1828-1844: Consider using more robust fragment ID generation.

The fragment IDs use Date.now() and Math.random() which could theoretically collide under high concurrency or rapid execution.

Consider using a more robust ID generation approach:
// At the top of the file, if not already imported
import { randomUUID } from 'crypto'

// In the fragment creation
id: `${citation.docId}-${randomUUID()}`
Or use a counter-based approach:
id: `${citation.docId}-${Date.now()}-${index}`
However, given that citation.docId is likely already unique and the current approach includes timestamp and random components, the collision risk is very low in practice.

Also applies to: 2035-2051

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3c7534b and 2fbf577.

📒 Files selected for processing (1)

server/api/chat/tools.ts (12 hunks)

🔇 Additional comments (4)

server/api/chat/tools.ts (4)

1037-1053: LGTM: Async fragment processing correctly implemented.

The conversion to async fragment processing using Promise.all with async map is correctly implemented. The pattern properly awaits answerContextMap and maintains the expected return type.

Note: Consider applying the error handling suggestion from the earlier comment for improved resilience.

1251-1267: LGTM: Consistent async pattern applied.

The async fragment processing pattern is consistently applied here as well.

1491-1507: LGTM: Async pattern correctly applied.

The async fragment processing is correctly implemented with proper typing.

262-262: Verify hardcoded timezone for user metadata.

The userMetadata uses a hardcoded timezone "Asia/Kolkata" which seems incorrect for a variable named userMetadata. This should likely be dynamically determined from the actual user's settings or context.

Please verify whether this hardcoded value is:

A temporary placeholder that should be replaced with actual user timezone

A default fallback that should be documented as such

Or if there's a specific reason for this hardcoding

server/ai/context.ts

server/api/chat/agents.ts

server/api/chat/chat.ts

server/integrations/dataSource/index.ts

server/integrations/google/index.ts

server/integrations/microsoft/attachment-utils.ts

server/lib/sqlInference.ts

server/services/fileProcessor.ts

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

server/services/fileProcessor.ts (1)
91-99: Add defensive validation for workbook.

After XLSX.read or XLSX.readFile, validate that workbook is defined before accessing workbook.SheetNames. While these methods typically throw on failure, defensive checks improve resilience against unexpected edge cases.

Apply this diff:
     if (!storagePath) {
       workbook = XLSX.read(buffer, { type: "buffer" })
     } else {
       workbook = XLSX.readFile(storagePath)
     }

-    if (!workbook.SheetNames || workbook.SheetNames.length === 0) {
+    if (!workbook || !workbook.SheetNames || workbook.SheetNames.length === 0) {
       throw new Error("No worksheets found in spreadsheet")
     }

♻️ Duplicate comments (2)

server/api/knowledgeBase.ts (1)
1604-1617: Fix expandSheetIds or sheet docs stay behind

expandSheetIds(itemToDelete.vespaDocId) still omits the _sheet_N document because the helper stops at i < sheetNumber. The new loop therefore deletes _sheet_0..N-1 yet leaves the original _sheet_N doc in Vespa, so sheet files are never fully purged.

Please make the helper iterate inclusively:
--- a/server/api/chat/chat.ts
+++ b/server/api/chat/chat.ts
@@
-  const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1
-  for (let i = 0; i < upper; i++) {
+  const upper = Number.isFinite(sheetNumber) ? Math.max(sheetNumber, 0) : 0
+  for (let i = 0; i <= upper; i++) {
     expandedIds.push(`${docId}_sheet_${i}`)
   }
server/lib/sqlInference.ts (1)

86-92: Address the previous review comment about storing model configuration.

A previous review comment (visible in past_review_comments) suggested storing the model name as a constant for easier future updates. This has not been addressed. The model is currently retrieved from config at runtime.

To address the previous feedback, consider either:

Defining a dedicated constant for SQL inference model defaults in a shared constants file, or

Documenting why the current approach (reading from config.sqlInferenceModel) is preferred over constants

The current approach of reading from config is actually more flexible than hardcoded constants, as it allows runtime configuration via environment variables (once the suggestion in config.ts is implemented). If this is the intended design, please respond to the previous comment explaining this rationale.

🧹 Nitpick comments (6)

server/config.ts (1)
78-78: Consider allowing SQL inference model override via environment variable.

The SQL inference model is hardcoded to Claude_Sonnet_4 for AWS and Vertex_Claude_Sonnet_4 for Vertex AI. Consider allowing users to override this choice via an environment variable (e.g., SQL_INFERENCE_MODEL) for flexibility in testing different models or managing costs.

Apply this diff to add environment variable override support:
 if (process.env["AWS_ACCESS_KEY"] && process.env["AWS_SECRET_KEY"]) {
   AwsAccessKey = process.env["AWS_ACCESS_KEY"]
   AwsSecretKey = process.env["AWS_SECRET_KEY"]
   defaultFastModel = Models.Claude_3_5_Haiku
   defaultBestModel = Models.Claude_Sonnet_4
-  sqlInferenceModel = Models.Claude_Sonnet_4
+  sqlInferenceModel = process.env["SQL_INFERENCE_MODEL"] 
+    ? (process.env["SQL_INFERENCE_MODEL"] as Models)
+    : Models.Claude_Sonnet_4
Apply similar changes at line 141 for Vertex AI.
server/lib/sqlInference.ts (5)
27-31: Improve the warning message to indicate which providers support SQL inference.

The warning "SQL inference model not set" doesn't inform users which providers support this feature. Given that only AWS and Vertex AI currently have SQL inference models configured (see config.ts), users might be confused why the feature isn't working.

Apply this diff to improve the warning:
   if (!model) {
-    Logger.warn("SQL inference model not set, returning null");
+    Logger.warn("SQL inference model not configured for current provider. SQL inference is only available for AWS (Claude) and Vertex AI providers.");
     return null;
   }
34-43: The stripNoise function could be more robust.

The current implementation uses simple indexOf/lastIndexOf to extract JSON, which could fail with nested objects or arrays that have multiple braces. While this might work for the expected response format, it's fragile.

Consider using a more robust approach that handles nested structures:
const stripNoise = (s: string) => {
  let t = s.trim();
  // remove code fences
  t = t.replace(/```(?:json)?/gi, "").replace(/```/g, "");
  
  // Find the first valid JSON object by tracking brace depth
  let depth = 0;
  let start = -1;
  let end = -1;
  
  for (let i = 0; i < t.length; i++) {
    if (t[i] === '{') {
      if (depth === 0) start = i;
      depth++;
    } else if (t[i] === '}') {
      depth--;
      if (depth === 0 && start !== -1) {
        end = i;
        break;
      }
    }
  }
  
  if (start !== -1 && end !== -1) {
    return t.slice(start, end + 1).trim();
  }
  
  return t.trim();
};
89-89: Consider increasing max_new_tokens for complex queries.

The current limit of 512 tokens may be insufficient for complex SQL queries involving multiple JOINs, aggregations, or wide tables with many columns. Complex analytical queries can easily exceed this limit.

Consider increasing to 1024 or 2048:
     const modelParams = {
       modelId: model,
       temperature: 0.1,
-      max_new_tokens: 512,
+      max_new_tokens: 1024,
       stream: false,
       systemPrompt: "You are a helpful assistant that analyzes queries and generates SQL when appropriate."
     }
94-94: Add timeout handling for the LLM call.

The provider.converse() call has no explicit timeout, which could cause the request to hang indefinitely if the LLM service is unresponsive. This is particularly important for user-facing features.

Add a timeout wrapper:
const timeout = 30000; // 30 seconds
const timeoutPromise = new Promise((_, reject) => 
  setTimeout(() => reject(new Error('LLM request timed out')), timeout)
);

const response = await Promise.race([
  provider.converse(messages, modelParams),
  timeoutPromise
]) as Awaited<ReturnType<typeof provider.converse>>;
123-127: Consider more granular error handling.

The current catch-all error handler logs and returns null for any error. While this is a safe fallback, more specific error handling could provide better diagnostics and potentially allow recovery for some error types (e.g., retrying on timeout, falling back to a simpler query on token limit exceeded).

Consider categorizing errors:
  } catch (error) {
    if (error instanceof SyntaxError) {
      Logger.error("Failed to parse LLM response as valid JSON", { error });
    } else if (error.message?.includes('timeout')) {
      Logger.error("LLM request timed out", { error });
    } else if (error.message?.includes('token')) {
      Logger.error("Token limit exceeded in LLM request", { error });
    } else {
      Logger.error("Failed to analyze query and generate SQL", { error });
    }
    return null;
  }

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fbf577 and 931abde.

📒 Files selected for processing (9)

server/api/knowledgeBase.ts (4 hunks)
server/config.ts (4 hunks)
server/integrations/dataSource/errors.ts (1 hunks)
server/integrations/dataSource/index.ts (5 hunks)
server/integrations/google/index.ts (5 hunks)
server/integrations/google/worker-utils.ts (7 hunks)
server/integrations/microsoft/attachment-utils.ts (7 hunks)
server/lib/sqlInference.ts (1 hunks)
server/services/fileProcessor.ts (6 hunks)

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-10-07T05:35:17.369Z

Learnt from: Himanshvarma
PR: xynehq/xyne#1046
File: server/services/fileProcessor.ts:118-131
Timestamp: 2025-10-07T05:35:17.369Z
Learning: In server/services/fileProcessor.ts, SheetProcessingResult objects intentionally use empty chunks_map and image_chunks_map arrays because each sheet is ingested as a separate Vespa document with sheet-level metadata (sheetName, sheetIndex, totalSheets, docId), making per-chunk metadata redundant for sheet files.

Applied to files:

server/services/fileProcessor.ts

🧬 Code graph analysis (7)

server/integrations/dataSource/index.ts (3)

server/integrations/dataSource/errors.ts (1)

createFileSizeError (94-100)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/dataSource/config.ts (5)

DATASOURCE_CONFIG (2-66)

isDocxFile (89-92)

isPptxFile (94-97)

isSheetFile (79-82)

isTextFile (74-77)

server/integrations/google/index.ts (3)

server/integrations/dataSource/index.ts (1)

checkFileSize (94-99)

server/integrations/google/config.ts (1)

MAX_GD_SHEET_SIZE (15-15)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/lib/sqlInference.ts (3)

server/logger/index.ts (2)

getLogger (36-93)

Subsystem (15-15)

server/types.ts (1)

DuckDBQuery (615-618)

server/ai/provider/index.ts (1)

getProviderByModel (368-418)

server/api/knowledgeBase.ts (4)

server/integrations/dataSource/index.ts (1)

checkFileSize (94-99)

server/api/chat/chat.ts (1)

expandSheetIds (220-239)

server/search/vespa.ts (1)

DeleteDocument (24-24)

server/utils.ts (1)

getErrorMessage (103-106)

server/integrations/microsoft/attachment-utils.ts (3)

server/integrations/dataSource/index.ts (1)

checkFileSize (94-99)

server/integrations/google/config.ts (5)

MAX_ATTACHMENT_PDF_SIZE (22-22)

MAX_ATTACHMENT_DOCX_SIZE (24-24)

MAX_ATTACHMENT_PPTX_SIZE (25-25)

MAX_ATTACHMENT_TEXT_SIZE (23-23)

MAX_ATTACHMENT_SHEET_SIZE (26-26)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/services/fileProcessor.ts (1)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

server/integrations/google/worker-utils.ts (3)

server/integrations/dataSource/index.ts (1)

checkFileSize (94-99)

server/integrations/google/config.ts (1)

MAX_ATTACHMENT_SHEET_SIZE (26-26)

server/sheetChunk.ts (1)

chunkSheetWithHeaders (408-465)

🔇 Additional comments (7)

server/services/fileProcessor.ts (5)

33-52: Interface design looks good.

The SheetProcessingResult interface appropriately extends ProcessingResult with sheet-specific metadata (sheetName, sheetIndex, totalSheets, docId), and the array return type enables per-sheet document handling. The type alias ProcessingResultArray improves readability.

Note: A previous reviewer already flagged the ergonomics of the array return type for single-result files.

60-63: LGTM!

Wrapping the PDF processing result in an array maintains consistency with the new return signature.

103-140: Per-sheet processing logic is sound.

The implementation correctly:

Iterates each sheet and generates unique docId values (${vespaDocId}_sheet_${sheetIndex})

Uses chunkSheetWithHeaders for header-preserving chunking

Filters out empty content and skips sheets with no valid data

Aggregates results into SheetProcessingResult objects

Errors if no sheets yield content

Note: The empty chunks_map and image_chunks_map arrays are intentional, as each sheet is ingested as a separate Vespa document with sheet-level metadata. Based on learnings.

164-171: LGTM!

The enhanced error handling provides detailed logging with context (filename, mimeType, buffer size) and re-throws with a descriptive message, enabling proper error propagation and upstream handling.

173-194: LGTM!

The backward compatibility handling correctly creates default chunk metadata for non-PDF/non-sheet files and wraps the result in an array for consistency with the new signature.

server/config.ts (1)

193-193: LGTM!

The export of sqlInferenceModel is consistent with the pattern used for other configuration values.

server/lib/sqlInference.ts (1)

1-11: LGTM!

Imports and logger setup follow the established patterns in the codebase.

server/config.ts

server/lib/sqlInference.ts

Himanshvarma added 3 commits October 6, 2025 14:37

fix: sheet ingestion and rag on sheets

8a15aa7

fix: resolved ai comments

6614da0

fix: resolved ai comments 1

2911321

Himanshvarma requested review from devesh-juspay, junaid-shirur, kalpadhwaryu, shivamashtikar and zereraz as code owners October 6, 2025 10:46

Merge branch 'main' into sheetChunking

ac2899c

gemini-code-assist bot reviewed Oct 6, 2025

View reviewed changes

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

Himanshvarma added 2 commits October 6, 2025 17:51

fix: added sql validation

2f55375

fix: resolved ai comments

b16d852

coderabbitai bot reviewed Oct 6, 2025

View reviewed changes

server/lib/sqlValidator.ts Show resolved Hide resolved

Himanshvarma added 2 commits October 6, 2025 18:22

Merge branch 'main' into sheetChunking

b85d835

Merge branch 'main' into sheetChunking

e73bd01

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

server/services/fileProcessor.ts Show resolved Hide resolved

Merge branch 'main' into sheetChunking

3c7534b

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes

server/api/chat/agents.ts Show resolved Hide resolved

server/api/chat/chat.ts Show resolved Hide resolved

fix: fixed tests

2fbf577

coderabbitai bot reviewed Oct 7, 2025

View reviewed changes