-
Notifications
You must be signed in to change notification settings - Fork 56
Sheet chunking #1046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sheet chunking #1046
Conversation
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds header-aware spreadsheet chunking and per-sheet ingestion, threads an optional user query through async context-building, introduces DuckDB-backed sheet querying with LLM-driven SQL generation/validation, and surfaces new async flows and per-sheet Vespa doc handling plus size-based spreadsheet limits. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as User
participant API as Chat API
participant Agents as agents.ts
participant Ctx as ai/context.ts
participant Chunk as sheetChunkWithHeaders
participant Duck as lib/duckdb.ts
participant SQL as sqlInference/sqlValidator
Note over U,API: User sends message (optional query)
U->>API: message (query)
API->>Agents: vespa results + query
Agents->>Ctx: answerContextMap(..., query)
alt Spreadsheet-like attachments
Ctx->>Chunk: extract header-aware chunks
opt Query provided
Ctx->>Duck: querySheetChunks(chunks, query)
Duck->>SQL: analyzeQueryAndGenerateSQL(schema, samples)
SQL-->>Duck: SQL or null
Duck-->>Ctx: rows/metadata or null
end
end
Ctx-->>Agents: Context content (async)
Agents-->>API: Minimal fragments
API-->>U: Answer with context
sequenceDiagram
autonumber
participant U as Uploader
participant FP as FileProcessorService
participant Q as queue/fileProcessor
participant Vespa as KB (kb_items)
U->>FP: processFile(file)
FP-->>Q: processingResults[] (per-sheet)
loop per result
Q->>Vespa: insert per-sheet vespaDoc (docId/fileName)
end
Note over Q,Vespa: If sheets → base vespaDocId gains sheet-index suffixes
U->>Vespa: Delete item
Vespa->>Vespa: expandSheetIds -> delete all derived IDs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @Himanshvarma, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the system's capability to interact with and extract information from spreadsheet files. By integrating DuckDB, the system can now perform advanced analytical queries on tabular data, enhancing the quality of responses for metric-related questions. The underlying sheet ingestion and chunking mechanisms have been completely overhauled to intelligently structure spreadsheet data, ensuring that headers are correctly identified and preserved, and that multi-sheet files are handled as individual, queryable entities. These changes collectively lead to a more robust and intelligent handling of spreadsheet-based information within the AI context. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for performing RAG on spreadsheet data using DuckDB, along with a major refactoring of the sheet chunking and ingestion process. The changes are extensive and well-structured. My review focuses on the robustness of the new DuckDB integration, the clarity of the new sheet expansion logic, and the maintainability of the new data processing flows. I've identified a critical issue with error handling in the DuckDB logic that could lead to incorrect results, and several medium-severity issues related to code clarity, configuration, and async best practices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 14
🧹 Nitpick comments (13)
server/lib/sqlInference.ts (2)
6-6
: Avoid provider-specific AWSMessage
typing hereThis module uses Vertex/Anthropic via your provider. Importing Bedrock’s
Message
couples types incorrectly and can break typings. Prefer a provider-agnostic message type from your own abstraction (or a local type).
97-99
: Fix logger.error argument order to capture stack consistentlyUse error object as the first arg and message as the second for Pino. Current calls may lose stack/fields.
Apply this diff:
- Logger.error("Failed to parse cleaned LLM response as JSON", { cleaned }); + Logger.error({ cleaned, err: e }, "Failed to parse cleaned LLM response as JSON"); ... - } catch (error) { - Logger.error("Failed to analyze query and generate SQL:", error); + } catch (error) { + Logger.error(error, "Failed to analyze query and generate SQL");Also applies to: 117-119
server/queue/fileProcessor.ts (1)
219-227
: Path concat can drop/add slashes; use posix join.
targetPath.substring(1) + file.fileName
assumes a trailing slash; usepath.posix.join
for correctness.+import path from "node:path" @@ - const reconstructedFilePath = targetPath === "/" - ? file.fileName - : targetPath.substring(1) + file.fileName // Remove leading "/" and add filename + const reconstructedFilePath = targetPath === "/" + ? file.fileName + : path.posix.join(targetPath.substring(1), file.fileName) @@ - let vespaFileName = - targetPath === "/" - ? file.collectionName + targetPath + reconstructedFilePath - : file.collectionName + targetPath + file.fileName + let vespaFileName = + targetPath === "/" + ? path.posix.join(file.collectionName, reconstructedFilePath) + : path.posix.join(file.collectionName, targetPath, file.fileName)server/services/fileProcessor.ts (1)
95-139
: Sheet handling: solid per-sheet results; minor resilience improvement.Current flow skips empty sheets and throws if none have content. Good. Consider passing
config
tochunkSheetWithHeaders
if you need stricter size/row caps.Expose optional chunk config at call site to tune row and chunk size limits for large sheets.
server/sheetChunk.ts (3)
100-116
: Header width may truncate columns when multi-row headers differ in length.buildHeaders seeds width from
rows[0].length
. If subsequent header rows have more columns, they’re lost.-function buildHeaders(rows: any[][], headerRows = 1): { header: string[], dataRows: any[][] } { +function buildHeaders(rows: any[][], headerRows = 1): { header: string[], dataRows: any[][] } { @@ - const header = rows.slice(0, headerRows) - .reduce((acc, row) => - acc.map((prev, i) => `${prev}_${(row[i] ?? "").toString().trim()}`), - new Array(rows[0].length).fill("") - ) + const width = Math.max(...rows.slice(0, headerRows).map(r => r.length)) + const header = rows.slice(0, headerRows) + .reduce( + (acc, row) => acc.map((prev, i) => `${prev}_${(row[i] ?? "").toString().trim()}`), + new Array(width).fill("") + )
311-319
:isHeaderRowValid
too strict for real-world sheets.Requiring all cells non-empty may force dummy headers frequently. Consider allowing some empties (e.g., ≥50% non-empty).
- return row.every(cell => { + const filled = row.filter(cell => { if (cell === undefined || cell === null) return false const cellStr = cell.toString().trim() return cellStr.length > 0 - }) + }).length + const total = row.length + return total > 0 && filled / total >= 0.5
205-217
: Prefer logger over console.error.Use consistent logging for observability.
- } catch (error) { - console.error("Error converting sheet to JSON:", error) + } catch (error) { + // eslint-disable-next-line no-console + // console.error("Error converting sheet to JSON:", error) + // Prefer structured logging: + // Logger.error(error, "Error converting sheet to JSON") return { headerRow: [], dataRows: [] } }server/api/chat/chat.ts (4)
873-889
: Deduplicate expanded IDs before deletion; avoid redundant DeleteDocument calls.Same fileId can repeat across messages and expand to duplicate IDs.
Apply:
- const vespaIds = expandSheetIds(fileId) - for (const id of vespaIds) { + const vespaIds = Array.from(new Set(expandSheetIds(fileId))) + for (const id of vespaIds) { try { await DeleteDocument(id, KbItemsSchema)Also consider performing these external deletions outside the DB transaction to avoid holding DB locks during network calls.
1222-1242
: Async buildContext looks good; add fast‑path for empty results.Preempt work when no results.
Apply:
export async function buildContext( results: VespaSearchResult[], @@ ): Promise<string> { - const contextPromises = results?.map( + if (!results || results.length === 0) return "" + const contextPromises = results.map( async (v, i) => `Index ${i + startIndex} \n ${await answerContextMap(
4211-4213
: Deduplicate expanded non‑image attachment IDs.Avoid duplicate fileIds and expanded IDs before search/RAG.
Apply:
- const nonImageAttachmentFileIds = attachmentMetadata - .filter((m) => !m.isImage) - .flatMap((m) => expandSheetIds(m.fileId)) + const nonImageAttachmentFileIds = Array.from( + new Set( + attachmentMetadata + .filter((m) => !m.isImage) + .flatMap((m) => expandSheetIds(m.fileId)), + ), + )
4267-4271
: KB file ID expansion: deduplicate to remove repeats.Prevent inflating fileIds with duplicates after expansion.
Apply:
- fileIds = resp - .flatMap((file) => expandSheetIds(file.vespaDocId || "")) - .filter((id) => id !== "") + fileIds = Array.from( + new Set( + resp + .flatMap((file) => expandSheetIds(file.vespaDocId || "")) + .filter((id) => id !== ""), + ), + )server/ai/context.ts (2)
123-180
: Cap maxSummaryChunks from DuckDB to avoid oversized contexts.Large result sets can balloon context. Cap to a sane limit (e.g., 100).
Apply:
return { chunks_summary: newChunksSummary, matchfeatures: newMatchfeatures, - maxSummaryChunks: allChunks.length + maxSummaryChunks: Math.min(allChunks.length, 100) }Optionally make the cap configurable.
886-924
: MIME detection can miss common spreadsheet types; broaden for robustness.Include xlsm and TSV at minimum.
Apply:
- if(mimeType === "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" || - mimeType === "application/vnd.ms-excel" || - mimeType === "text/csv") { + if ( + mimeType === "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet" || // xlsx + mimeType === "application/vnd.ms-excel" || // xls + mimeType === "application/vnd.ms-excel.sheet.macroEnabled.12" || // xlsm + mimeType === "text/csv" || + mimeType === "text/tab-separated-values" // tsv + ) {Also consider gating DuckDB calls to top‑N spreadsheet results to manage latency.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
server/ai/context.ts
(3 hunks)server/api/chat/agents.ts
(9 hunks)server/api/chat/chat.ts
(14 hunks)server/api/files.ts
(4 hunks)server/api/knowledgeBase.ts
(2 hunks)server/integrations/dataSource/config.ts
(1 hunks)server/integrations/dataSource/index.ts
(3 hunks)server/integrations/google/config.ts
(2 hunks)server/integrations/google/index.ts
(5 hunks)server/integrations/google/worker-utils.ts
(3 hunks)server/integrations/microsoft/attachment-utils.ts
(3 hunks)server/integrations/ribbie/index.ts
(3 hunks)server/lib/duckdb.ts
(1 hunks)server/lib/sqlInference.ts
(1 hunks)server/package.json
(1 hunks)server/queue/fileProcessor.ts
(2 hunks)server/services/fileProcessor.ts
(6 hunks)server/sheetChunk.ts
(1 hunks)server/types.ts
(1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-28T10:47:41.020Z
Learnt from: naSim087
PR: xynehq/xyne#484
File: server/integrations/google/sync.ts:222-222
Timestamp: 2025-05-28T10:47:41.020Z
Learning: The functions `handleGoogleDriveChange` and `getDriveChanges` in `server/integrations/google/sync.ts` are intentionally exported for future changes, even though they are not currently being imported by other modules.
Applied to files:
server/integrations/google/index.ts
📚 Learning: 2025-06-17T08:55:40.153Z
Learnt from: oindrila-b
PR: xynehq/xyne#557
File: server/integrations/google/gmail-worker.ts:313-317
Timestamp: 2025-06-17T08:55:40.153Z
Learning: In server/integrations/google/gmail-worker.ts, the log message intentionally shows totalMails (cumulative count) while sendStatsUpdate uses insertedMessagesInBatch (batch count). This is by design - logs show total progress for human readability while stats updates track batch progress for system functionality.
Applied to files:
server/integrations/google/index.ts
🧬 Code graph analysis (15)
server/lib/sqlInference.ts (3)
server/logger/index.ts (2)
getLogger
(36-93)Subsystem
(15-15)server/types.ts (1)
DuckDBQuery
(615-618)server/ai/provider/index.ts (1)
getProviderByModel
(368-418)
server/integrations/microsoft/attachment-utils.ts (2)
server/integrations/google/config.ts (1)
MAX_ATTACHMENT_SHEET_SIZE
(26-26)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/api/knowledgeBase.ts (3)
server/api/chat/chat.ts (1)
expandSheetIds
(220-238)server/search/vespa.ts (1)
DeleteDocument
(23-23)server/utils.ts (1)
getErrorMessage
(103-106)
server/lib/duckdb.ts (3)
server/logger/index.ts (2)
getLogger
(36-93)Subsystem
(15-15)server/types.ts (1)
DuckDBResult
(620-637)server/lib/sqlInference.ts (1)
analyzeQueryAndGenerateSQL
(20-121)
server/ai/context.ts (2)
server/lib/duckdb.ts (1)
querySheetChunks
(14-202)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/integrations/dataSource/index.ts (2)
server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)server/integrations/dataSource/config.ts (1)
DATASOURCE_CONFIG
(2-66)
server/api/chat/chat.ts (3)
server/search/vespa.ts (1)
DeleteDocument
(23-23)server/utils.ts (1)
getErrorMessage
(103-106)server/ai/context.ts (2)
answerContextMap
(887-985)cleanContext
(1004-1006)
server/integrations/google/worker-utils.ts (2)
server/integrations/google/config.ts (1)
MAX_ATTACHMENT_SHEET_SIZE
(26-26)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/api/files.ts (3)
server/services/fileProcessor.ts (2)
FileProcessorService
(40-193)SheetProcessingResult
(33-38)server/integrations/dataSource/config.ts (1)
getBaseMimeType
(69-71)server/search/vespa.ts (1)
insert
(19-19)
server/queue/fileProcessor.ts (3)
server/services/fileProcessor.ts (2)
FileProcessorService
(40-193)SheetProcessingResult
(33-38)server/integrations/dataSource/config.ts (1)
getBaseMimeType
(69-71)server/db/schema/knowledgeBase.ts (1)
collectionItems
(69-140)
server/services/fileProcessor.ts (1)
server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/integrations/google/index.ts (2)
server/integrations/google/config.ts (1)
MAX_GD_SHEET_SIZE
(15-15)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/integrations/ribbie/index.ts (1)
server/services/fileProcessor.ts (1)
FileProcessorService
(40-193)
server/sheetChunk.ts (1)
server/pdfChunks.ts (1)
cleanText
(102-146)
server/api/chat/agents.ts (4)
server/types.ts (1)
UserMetadataType
(605-605)server/api/chat/types.ts (1)
MinimalAgentFragment
(88-94)server/ai/context.ts (1)
answerContextMap
(887-985)server/api/chat/chat.ts (1)
expandSheetIds
(220-238)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (13)
server/integrations/google/worker-utils.ts (2)
325-331
: Good guard: size-based skip for large spreadsheetsThe MB cap avoids heavy processing and aligns with config.
421-421
: Header-aware chunking is a solid simplificationUsing
chunkSheetWithHeaders(worksheet)
improves consistency and memory usage.server/integrations/google/config.ts (1)
15-15
: Unified size caps for sheets look goodMB-based limits are clearer and consistent with usage.
Also applies to: 26-26
server/queue/fileProcessor.ts (2)
242-283
: Per-result Vespa doc construction looks correct.
- Properly sets per-result docId, fileName suffixes for sheets/indexed results, and aggregates totalChunksCount.
- Metadata includes sheet fields when present.
Also applies to: 286-296
206-213
: Confirm newVespaDocId suffix interpretation
No code found parsing_sheet_<n>
suffix—please manually verify whether downstream consumers expect this suffix to denote the total sheet count or a zero-based sheet index to avoid off-by-one errors.server/services/fileProcessor.ts (2)
184-192
: Return shape for non-sheet types LGTM.Wrapping single ProcessingResult in an array maintains uniformity.
10-17
: Ensurexlsx
version ≥ 0.20.x
I didn’t find anyxlsx
entries in your package.json or lockfiles—please verify the dependency declaration and upgrade from 0.18.5 to a secure 0.20.x release.server/sheetChunk.ts (2)
426-436
: Reasonable defaults and adaptive row limit. LGTM.Header normalization and adaptive rows per chunk are sensible.
341-380
: Chunking logic is robust.Byte-size aware chunking with truncation guard looks good.
server/api/chat/agents.ts (3)
534-551
: No changes needed: answerContextMap returns string AiContext is defined asstring
, soanswerContextMap(...)
already returns astring
matchingMinimalAgentFragment.content
; ignore this suggestion.Likely an incorrect or invalid review comment.
855-857
: Approve changes: expandSheetIds transformsdocId_sheet_<N>
into[docId_sheet_0…docId_sheet_<N-1>]
as intended.
1604-1622
: Async Promise.all usage and types align
vespaResultToMinimalAgentFragment
returnsMinimalAgentFragment
, matching theMinimalAgentFragment[]
signature ofanswerContextMapFromFragments
; no changes needed.server/ai/context.ts (1)
722-724
: LGTM: simpler maxSummaryChunks initialization for KB context.Dropping isMsgWithSources from init reads clearly and keeps behavior predictable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/api/files.ts (1)
261-338
: Use a real ingested docId for metadata (vespaId
is currently invalid).
FileProcessorService.processFile
emits sheet results whosedocId
is${fileId}_sheet_${sheetIndex}
. Here you overwritevespaId
with${fileId}_sheet_${totalSheets}
, a suffix that never matches any inserted document (sheet indices are 0-based, totalSheets is a count). The returned metadata now advertises/api/v1/attachments/${vespaId}
andfileId: vespaId
, but no Vespa document or on-disk resource exists at that identifier. Downstream consumers expectingfileId
to resolve to an actual sheet chunk record will fail.Keep
vespaId
aligned with an existing docId—e.g., leave it as the basefileId
, or choose one of the realsheetResult.docId
values—and make the URL/fileId coherent with that choice.
♻️ Duplicate comments (2)
server/integrations/google/index.ts (1)
2189-2198
: Critical: Unit mismatch causes spreadsheet rejection bug.
spreadsheet.size
returns bytes butMAX_GD_SHEET_SIZE
is 10 (for 10 MB per config comment). Line 2191 compares bytes directly to 10, causing spreadsheets larger than 10 bytes to be rejected. Line 2193 also incorrectly divides the MB value by1024 * 1024
again, showing 0.0000095 MB in logs instead of 10 MB.Apply this diff to convert the MB config value to bytes before comparison:
// Early size check before fetching spreadsheet data const sizeInBytes = spreadsheet.size ? parseInt(spreadsheet.size, 10) : 0 - if (!isNaN(sizeInBytes) && sizeInBytes > MAX_GD_SHEET_SIZE) { + const maxSheetSizeInBytes = MAX_GD_SHEET_SIZE * 1024 * 1024 + if (!isNaN(sizeInBytes) && sizeInBytes > maxSheetSizeInBytes) { const sizeInMB = (sizeInBytes / (1024 * 1024)).toFixed(2) - const maxSizeInMB = (MAX_GD_SHEET_SIZE / (1024 * 1024)).toFixed(2) + const maxSizeInMB = MAX_GD_SHEET_SIZE.toFixed(2) loggerWithChild({ email: userEmail }).warn( `Ignoring ${spreadsheet.name} as its size (${sizeInMB} MB) exceeds the limit of ${maxSizeInMB} MB`, )Based on configuration at
server/integrations/google/config.ts
.server/api/chat/chat.ts (1)
220-236
: expandSheetIds still skips the requested sheet IDLine 234’s
for (let i = 0; i < upper; i++)
stops one iteration early, so_sheet_0
returns an empty array and_sheet_N
omits_sheet_N
. As a result,ChatDeleteApi
never deletes the actual sheet document andflatMap
on attachments silently drops the final sheet. Include the upper bound (and keep 0 inclusive) so every sheet ID is returned.- const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1 - for (let i = 0; i < upper; i++) { + const upper = Number.isFinite(sheetNumber) ? Math.max(sheetNumber, 0) : 0 + for (let i = 0; i <= upper; i++) { expandedIds.push(`${docId}_sheet_${i}`) }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
server/ai/context.ts
(3 hunks)server/api/chat/chat.ts
(14 hunks)server/api/files.ts
(5 hunks)server/integrations/google/index.ts
(5 hunks)server/lib/duckdb.ts
(1 hunks)server/lib/sqlInference.ts
(1 hunks)server/lib/sqlValidator.ts
(1 hunks)server/package.json
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- server/lib/sqlInference.ts
- server/lib/duckdb.ts
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-28T10:47:41.020Z
Learnt from: naSim087
PR: xynehq/xyne#484
File: server/integrations/google/sync.ts:222-222
Timestamp: 2025-05-28T10:47:41.020Z
Learning: The functions `handleGoogleDriveChange` and `getDriveChanges` in `server/integrations/google/sync.ts` are intentionally exported for future changes, even though they are not currently being imported by other modules.
Applied to files:
server/integrations/google/index.ts
📚 Learning: 2025-06-17T08:55:40.153Z
Learnt from: oindrila-b
PR: xynehq/xyne#557
File: server/integrations/google/gmail-worker.ts:313-317
Timestamp: 2025-06-17T08:55:40.153Z
Learning: In server/integrations/google/gmail-worker.ts, the log message intentionally shows totalMails (cumulative count) while sendStatsUpdate uses insertedMessagesInBatch (batch count). This is by design - logs show total progress for human readability while stats updates track batch progress for system functionality.
Applied to files:
server/integrations/google/index.ts
🧬 Code graph analysis (5)
server/ai/context.ts (2)
server/lib/duckdb.ts (1)
querySheetChunks
(15-223)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/api/files.ts (3)
server/services/fileProcessor.ts (2)
FileProcessorService
(40-193)SheetProcessingResult
(33-38)server/integrations/dataSource/config.ts (1)
getBaseMimeType
(69-71)server/search/vespa.ts (1)
insert
(19-19)
server/api/chat/chat.ts (4)
server/search/vespa.ts (1)
DeleteDocument
(23-23)server/utils.ts (1)
getErrorMessage
(103-106)server/types.ts (1)
UserMetadataType
(605-605)server/ai/context.ts (2)
answerContextMap
(887-985)cleanContext
(1004-1006)
server/integrations/google/index.ts (2)
server/integrations/google/config.ts (1)
MAX_GD_SHEET_SIZE
(15-15)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/lib/sqlValidator.ts (1)
server/logger/index.ts (2)
getLogger
(36-93)Subsystem
(15-15)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (3)
server/integrations/google/index.ts (3)
103-103
: LGTM: Import consolidation improves maintainability.The shift from multiple constants (
MAX_GD_SHEET_ROWS
,MAX_GD_SHEET_TEXT_LEN
) to a singleMAX_GD_SHEET_SIZE
simplifies configuration. The import ofchunkSheetWithHeaders
centralizes sheet-processing logic with header awareness, which is a good architectural improvement.Also applies to: 1076-1076
2244-2244
: LGTM: Centralized chunking with header awareness.The call to
chunkSheetWithHeaders(finalRows)
correctly replaces inline chunking logic. ThefinalRows
type (string[][]
) matches the function signature, and omitting the optionalconfig
parameter is fine—the function will apply sensible defaults for chunk size and rows per chunk.
2919-2919
: LGTM: Query precedence issue resolved.The addition of the optional
q
parameter and the parentheses wrapping in the query construction ((${q}) and trashed = false
) correctly handles OR operators. This prevents the AND precedence bug where trashed files could slip through whenq
contains OR clauses. The change is backward-compatible since existing callers don't passq
.Well done addressing the previous review feedback on query precedence!
Also applies to: 2925-2925
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
server/ai/context.ts (1)
97-116
: Fix chunk index shifting after header injectionYou still leave every data chunk at its original index even though you insert the header at index 0 and shift
chunk_scores
by +1. Consumers that sort or join onindex
now see duplicate0
s and misalignedmatchfeatures
, so the wrong rows end up ahead/behind the header—exactly the bug called out earlier. Please bump each object chunk’s index when a header is present.- processedChunks.push({ - chunk: dataContent, - score: originalChunk.score, - index: originalChunk.index - }); + const baseIndex = originalChunk.index ?? i + processedChunks.push({ + chunk: dataContent, + score: originalChunk.score, + index: headerChunk ? baseIndex + 1 : baseIndex, + });
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
server/ai/context.ts
(2 hunks)server/api/knowledgeBase.ts
(2 hunks)server/package.json
(2 hunks)server/services/fileProcessor.ts
(6 hunks)server/types.ts
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- server/types.ts
- server/api/knowledgeBase.ts
- server/package.json
🧰 Additional context used
🧬 Code graph analysis (2)
server/services/fileProcessor.ts (1)
server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/ai/context.ts (2)
server/lib/duckdb.ts (1)
querySheetChunks
(15-223)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (2)
server/api/chat/agents.ts (1)
1544-1571
: Past review issue remains unresolved: String concatenation on AiContext object.This is the same issue flagged in the previous review. The code calls
answerContextMap(...)
which returns anAiContext
object (not a string), then concatenates it as a string at line 1566 and 1569. This will produce "[object Object]" at runtime.The previous review's suggested fix (using
vespaResultToMinimalAgentFragment
) won't resolve this because that helper function has the same underlying issue (see my comment on lines 525-542).Apply this corrected fix:
const contextPromises = results?.root?.children?.map( async (v, i) => { - let content = await answerContextMap( + const aiContext = await answerContextMap( v as VespaSearchResults, userMetadata, 0, true, undefined, message, ) + // Extract textual content from AiContext + let content = typeof aiContext === 'string' + ? aiContext + : (aiContext.chunks?.join('\n') || '') + const chatContainerFields = isChatContainerFields(v.fields) && v.fields.sddocname === chatContainerSchema ? v.fields : undefined if (chatContainerFields?.creator) { const creator = await getDocumentOrNull( chatUserSchema, chatContainerFields.creator, ) if (creator && isChatUserFields(creator.fields)) { content += `\nCreator: ${creator.fields.name}` } } return `Index ${i + 1} \n ${content}` }, )server/api/chat/chat.ts (1)
220-239
: expandSheetIds off-by-one (omits last sheet) and lacks a safety capLoop builds 0..sheetNumber-1; docId_sheet_N is missing. Add inclusive upper bound and a sane cap to avoid pathological loops.
Apply:
export function expandSheetIds(fileId: string): string[] { // Check if the fileId matches the pattern docId_sheet_number const sheetMatch = fileId.match(/^(.+)_sheet_(\d+)$/) if (!sheetMatch) { // Not a sheet ID, return as is return [fileId] } const [, docId, sheetNumberStr] = sheetMatch const sheetNumber = parseInt(sheetNumberStr, 10) // Generate IDs from docId_sheet_0 to docId_sheet_number const expandedIds: string[] = [] - const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1 - for (let i = 0; i < upper; i++) { + // Cap to prevent accidental huge loops; adjust if needed + const upper = Number.isFinite(sheetNumber) + ? Math.min(Math.max(sheetNumber, 0), 1000) + : 0 + for (let i = 0; i <= upper; i++) { expandedIds.push(`${docId}_sheet_${i}`) } return expandedIds }Optional: move this helper to a shared utils module for reuse.
🧹 Nitpick comments (3)
server/api/chat/chat.ts (3)
3357-3357
: Avoid building full context solely for span attributes (heavy and noisy)These lines compute full context only to attach to spans. This adds latency and large attributes. Prefer IDs/counts or a truncated preview.
Example:
-iterationSpan?.setAttribute(`context`, await buildContext(items, 20, userMetadata, 0, input)) +iterationSpan?.setAttribute('context_ids', items.map((v:any) => v.fields?.docId).filter(Boolean)) +iterationSpan?.setAttribute('context_count', items.length)Also applies to: 3680-3680
877-890
: Per‑ID Vespa deletion: handle 404 as warn and keep others as errorsInner catch logs all failures as errors and bypasses the outer 404 handler. Downgrade 404s to warn here and keep loop going.
- for (const id of vespaIds) { - try { - await DeleteDocument(id, KbItemsSchema) - loggerWithChild({ email }).info( - `Successfully deleted non-image attachment ${id} from Vespa kb_items schema`, - ) - } catch (error) { - loggerWithChild({ email }).error( - `Failed to delete non-image attachment ${id} from Vespa kb_items schema`, - { error: getErrorMessage(error) } - ) - } - } + for (const id of vespaIds) { + try { + await DeleteDocument(id, KbItemsSchema) + loggerWithChild({ email }).info( + `Successfully deleted non-image attachment ${id} from Vespa kb_items schema`, + ) + } catch (error) { + const msg = getErrorMessage(error) + if (msg.includes("404")) { + loggerWithChild({ email }).warn( + `Non-image attachment ${id} not found in Vespa kb_items schema (may have been already deleted)`, + ) + } else { + loggerWithChild({ email }).error( + `Failed to delete non-image attachment ${id} from Vespa kb_items schema`, + { error: msg }, + ) + } + } + }Optional: use Promise.allSettled for small batches to speed up, with rate limiting if needed.
4269-4269
: Guard falsy IDs and dedupe when expanding attachment IDsIf any attachment lacks fileId, expandSheetIds("") yields [""], which can leak empty IDs into fileIds. Also dedupe after expansion.
- const nonImageAttachmentFileIds = attachmentMetadata - .filter((m) => !m.isImage) - .flatMap((m) => expandSheetIds(m.fileId)) + const nonImageAttachmentFileIds = Array.from( + new Set( + attachmentMetadata + .filter((m) => !m.isImage && !!m.fileId) + .flatMap((m) => expandSheetIds(m.fileId!)) + ) + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
server/api/chat/agents.ts
(8 hunks)server/api/chat/chat.ts
(14 hunks)server/api/knowledgeBase.ts
(2 hunks)server/package.json
(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- server/package.json
- server/api/knowledgeBase.ts
🧰 Additional context used
🧬 Code graph analysis (2)
server/api/chat/chat.ts (3)
server/search/vespa.ts (1)
DeleteDocument
(24-24)server/utils.ts (1)
getErrorMessage
(103-106)server/ai/context.ts (2)
answerContextMap
(887-985)cleanContext
(1004-1006)
server/api/chat/agents.ts (4)
server/types.ts (1)
UserMetadataType
(605-605)server/api/chat/types.ts (1)
MinimalAgentFragment
(88-94)server/ai/context.ts (1)
answerContextMap
(887-985)server/api/chat/chat.ts (1)
expandSheetIds
(220-239)
🔇 Additional comments (8)
server/api/chat/agents.ts (5)
846-849
: LGTM: Correct usage of expandSheetIds for sheet attachments.The code correctly uses
flatMap
withexpandSheetIds
to expand sheet file IDs (matching patterndocId_sheet_N
) into arrays containing all sheet IDs from 0 to N. Non-sheet file IDs pass through unchanged. This ensures that when a user attaches a spreadsheet, all individual sheets are available for retrieval and context building.
1622-1641
: Async pattern is correct but inherits upstream type mismatch.The async/await usage and Promise.all pattern are correctly implemented here:
- Properly awaits the async
vespaResultToMinimalAgentFragment
function- Correctly passes all required parameters including the
message
as query- Uses Promise.all for efficient parallel execution
However, this code will inherit the critical type mismatch bug from
vespaResultToMinimalAgentFragment
(lines 525-542) where thecontent
field receives an AiContext object instead of a string. Once that upstream function is fixed, this code will work correctly.
2912-2914
: Consistent async pattern across multiple endpoints.The async fragment generation follows the same correct pattern in both
AgentMessageApiRagOff
(lines 2912-2914) and the non-streaming path (lines 3182-3184):
- Properly awaited Promise.all
- Correct parameter passing including message as query
- Efficient parallel execution
These implementations are consistent with the pattern at lines 1622-1641. All will inherit the type mismatch issue from
vespaResultToMinimalAgentFragment
and will work correctly once that upstream function is fixed.Also applies to: 3182-3184
3396-3399
: LGTM: Consistent sheet ID expansion.The usage of
expandSheetIds
here is identical to line 848 and correctly expands sheet attachment IDs. The implementation is consistent across different API endpoints.
124-124
: LGTM: Import added for sheet ID expansion.The
expandSheetIds
import is correctly added to support the new sheet attachment handling functionality used at lines 848 and 3398.server/api/chat/chat.ts (3)
1223-1243
: Async buildContext threading query into answerContextMap looks goodSignature change and propagation of builtUserQuery are consistent; join/clean behavior preserved.
2217-2224
: Query‑aware answerContextMap is correctly wiredPassing message as the optional query enables sheet‑aware processing. Looks correct.
4325-4326
: KB file ID expansion LGTMFlattening per‑sheet IDs and filtering empties is correct for selectedKbItems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
server/api/chat/tools.ts (2)
449-472
: Consider error handling for concurrent fragment processing.The
Promise.all
pattern withasync map
correctly implements concurrent fragment processing. However, it will fail fast if any singleanswerContextMap
call throws an error, potentially losing all successfully processed fragments.Consider adding individual error handling within the map callback to make the operation more resilient:
const fragments: MinimalAgentFragment[] = await Promise.all(children.map(async (r) => { + try { if (r.fields.sddocname === dataSourceFileSchema) { const fields = r.fields as VespaDataSourceFile return { id: `${fields.docId}`, content: await answerContextMap(r, userMetadata, maxDefaultSummary), source: { docId: fields.docId, title: fields.fileName || "Untitled", url: `/dataSource/${(fields as VespaDataSourceFile).docId}`, app: fields.app || Apps.DataSource, entity: DataSourceEntity.DataSourceFile, }, confidence: r.relevance || 0.7, } } const citation = searchToCitation(r) return { id: `${citation.docId}`, content: await answerContextMap(r, userMetadata, maxDefaultSummary), source: citation, confidence: r.relevance || 0.7, } + } catch (error) { + Logger.error(error, `Failed to process fragment: ${r.id}`) + const citation = searchToCitation(r) + return { + id: `${citation.docId}`, + content: "Content unavailable due to processing error", + source: citation, + confidence: 0.5, + } + } }))This pattern should be applied consistently across all similar implementations in this file (lines 1037-1053, 1251-1267, 1491-1507, 1828-1844, 2035-2051).
1828-1844
: Consider using more robust fragment ID generation.The fragment IDs use
Date.now()
andMath.random()
which could theoretically collide under high concurrency or rapid execution.Consider using a more robust ID generation approach:
// At the top of the file, if not already imported import { randomUUID } from 'crypto' // In the fragment creation id: `${citation.docId}-${randomUUID()}`Or use a counter-based approach:
id: `${citation.docId}-${Date.now()}-${index}`
However, given that
citation.docId
is likely already unique and the current approach includes timestamp and random components, the collision risk is very low in practice.Also applies to: 2035-2051
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/api/chat/tools.ts
(12 hunks)
🔇 Additional comments (4)
server/api/chat/tools.ts (4)
1037-1053
: LGTM: Async fragment processing correctly implemented.The conversion to async fragment processing using
Promise.all
withasync map
is correctly implemented. The pattern properly awaitsanswerContextMap
and maintains the expected return type.Note: Consider applying the error handling suggestion from the earlier comment for improved resilience.
1251-1267
: LGTM: Consistent async pattern applied.The async fragment processing pattern is consistently applied here as well.
1491-1507
: LGTM: Async pattern correctly applied.The async fragment processing is correctly implemented with proper typing.
262-262
: Verify hardcoded timezone for user metadata.The
userMetadata
uses a hardcoded timezone"Asia/Kolkata"
which seems incorrect for a variable nameduserMetadata
. This should likely be dynamically determined from the actual user's settings or context.Please verify whether this hardcoded value is:
- A temporary placeholder that should be replaced with actual user timezone
- A default fallback that should be documented as such
- Or if there's a specific reason for this hardcoding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/services/fileProcessor.ts (1)
91-99
: Add defensive validation for workbook.After
XLSX.read
orXLSX.readFile
, validate thatworkbook
is defined before accessingworkbook.SheetNames
. While these methods typically throw on failure, defensive checks improve resilience against unexpected edge cases.Apply this diff:
if (!storagePath) { workbook = XLSX.read(buffer, { type: "buffer" }) } else { workbook = XLSX.readFile(storagePath) } - if (!workbook.SheetNames || workbook.SheetNames.length === 0) { + if (!workbook || !workbook.SheetNames || workbook.SheetNames.length === 0) { throw new Error("No worksheets found in spreadsheet") }
♻️ Duplicate comments (2)
server/api/knowledgeBase.ts (1)
1604-1617
: FixexpandSheetIds
or sheet docs stay behind
expandSheetIds(itemToDelete.vespaDocId)
still omits the_sheet_N
document because the helper stops ati < sheetNumber
. The new loop therefore deletes_sheet_0..N-1
yet leaves the original_sheet_N
doc in Vespa, so sheet files are never fully purged.Please make the helper iterate inclusively:
--- a/server/api/chat/chat.ts +++ b/server/api/chat/chat.ts @@ - const upper = Number.isFinite(sheetNumber) ? sheetNumber : 1 - for (let i = 0; i < upper; i++) { + const upper = Number.isFinite(sheetNumber) ? Math.max(sheetNumber, 0) : 0 + for (let i = 0; i <= upper; i++) { expandedIds.push(`${docId}_sheet_${i}`) }server/lib/sqlInference.ts (1)
86-92
: Address the previous review comment about storing model configuration.A previous review comment (visible in past_review_comments) suggested storing the model name as a constant for easier future updates. This has not been addressed. The model is currently retrieved from config at runtime.
To address the previous feedback, consider either:
- Defining a dedicated constant for SQL inference model defaults in a shared constants file, or
- Documenting why the current approach (reading from config.sqlInferenceModel) is preferred over constants
The current approach of reading from config is actually more flexible than hardcoded constants, as it allows runtime configuration via environment variables (once the suggestion in config.ts is implemented). If this is the intended design, please respond to the previous comment explaining this rationale.
🧹 Nitpick comments (6)
server/config.ts (1)
78-78
: Consider allowing SQL inference model override via environment variable.The SQL inference model is hardcoded to
Claude_Sonnet_4
for AWS andVertex_Claude_Sonnet_4
for Vertex AI. Consider allowing users to override this choice via an environment variable (e.g.,SQL_INFERENCE_MODEL
) for flexibility in testing different models or managing costs.Apply this diff to add environment variable override support:
if (process.env["AWS_ACCESS_KEY"] && process.env["AWS_SECRET_KEY"]) { AwsAccessKey = process.env["AWS_ACCESS_KEY"] AwsSecretKey = process.env["AWS_SECRET_KEY"] defaultFastModel = Models.Claude_3_5_Haiku defaultBestModel = Models.Claude_Sonnet_4 - sqlInferenceModel = Models.Claude_Sonnet_4 + sqlInferenceModel = process.env["SQL_INFERENCE_MODEL"] + ? (process.env["SQL_INFERENCE_MODEL"] as Models) + : Models.Claude_Sonnet_4Apply similar changes at line 141 for Vertex AI.
server/lib/sqlInference.ts (5)
27-31
: Improve the warning message to indicate which providers support SQL inference.The warning "SQL inference model not set" doesn't inform users which providers support this feature. Given that only AWS and Vertex AI currently have SQL inference models configured (see config.ts), users might be confused why the feature isn't working.
Apply this diff to improve the warning:
if (!model) { - Logger.warn("SQL inference model not set, returning null"); + Logger.warn("SQL inference model not configured for current provider. SQL inference is only available for AWS (Claude) and Vertex AI providers."); return null; }
34-43
: The stripNoise function could be more robust.The current implementation uses simple indexOf/lastIndexOf to extract JSON, which could fail with nested objects or arrays that have multiple braces. While this might work for the expected response format, it's fragile.
Consider using a more robust approach that handles nested structures:
const stripNoise = (s: string) => { let t = s.trim(); // remove code fences t = t.replace(/```(?:json)?/gi, "").replace(/```/g, ""); // Find the first valid JSON object by tracking brace depth let depth = 0; let start = -1; let end = -1; for (let i = 0; i < t.length; i++) { if (t[i] === '{') { if (depth === 0) start = i; depth++; } else if (t[i] === '}') { depth--; if (depth === 0 && start !== -1) { end = i; break; } } } if (start !== -1 && end !== -1) { return t.slice(start, end + 1).trim(); } return t.trim(); };
89-89
: Consider increasing max_new_tokens for complex queries.The current limit of 512 tokens may be insufficient for complex SQL queries involving multiple JOINs, aggregations, or wide tables with many columns. Complex analytical queries can easily exceed this limit.
Consider increasing to 1024 or 2048:
const modelParams = { modelId: model, temperature: 0.1, - max_new_tokens: 512, + max_new_tokens: 1024, stream: false, systemPrompt: "You are a helpful assistant that analyzes queries and generates SQL when appropriate." }
94-94
: Add timeout handling for the LLM call.The
provider.converse()
call has no explicit timeout, which could cause the request to hang indefinitely if the LLM service is unresponsive. This is particularly important for user-facing features.Add a timeout wrapper:
const timeout = 30000; // 30 seconds const timeoutPromise = new Promise((_, reject) => setTimeout(() => reject(new Error('LLM request timed out')), timeout) ); const response = await Promise.race([ provider.converse(messages, modelParams), timeoutPromise ]) as Awaited<ReturnType<typeof provider.converse>>;
123-127
: Consider more granular error handling.The current catch-all error handler logs and returns null for any error. While this is a safe fallback, more specific error handling could provide better diagnostics and potentially allow recovery for some error types (e.g., retrying on timeout, falling back to a simpler query on token limit exceeded).
Consider categorizing errors:
} catch (error) { if (error instanceof SyntaxError) { Logger.error("Failed to parse LLM response as valid JSON", { error }); } else if (error.message?.includes('timeout')) { Logger.error("LLM request timed out", { error }); } else if (error.message?.includes('token')) { Logger.error("Token limit exceeded in LLM request", { error }); } else { Logger.error("Failed to analyze query and generate SQL", { error }); } return null; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
server/api/knowledgeBase.ts
(4 hunks)server/config.ts
(4 hunks)server/integrations/dataSource/errors.ts
(1 hunks)server/integrations/dataSource/index.ts
(5 hunks)server/integrations/google/index.ts
(5 hunks)server/integrations/google/worker-utils.ts
(7 hunks)server/integrations/microsoft/attachment-utils.ts
(7 hunks)server/lib/sqlInference.ts
(1 hunks)server/services/fileProcessor.ts
(6 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-07T05:35:17.369Z
Learnt from: Himanshvarma
PR: xynehq/xyne#1046
File: server/services/fileProcessor.ts:118-131
Timestamp: 2025-10-07T05:35:17.369Z
Learning: In server/services/fileProcessor.ts, SheetProcessingResult objects intentionally use empty chunks_map and image_chunks_map arrays because each sheet is ingested as a separate Vespa document with sheet-level metadata (sheetName, sheetIndex, totalSheets, docId), making per-chunk metadata redundant for sheet files.
Applied to files:
server/services/fileProcessor.ts
🧬 Code graph analysis (7)
server/integrations/dataSource/index.ts (3)
server/integrations/dataSource/errors.ts (1)
createFileSizeError
(94-100)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)server/integrations/dataSource/config.ts (5)
DATASOURCE_CONFIG
(2-66)isDocxFile
(89-92)isPptxFile
(94-97)isSheetFile
(79-82)isTextFile
(74-77)
server/integrations/google/index.ts (3)
server/integrations/dataSource/index.ts (1)
checkFileSize
(94-99)server/integrations/google/config.ts (1)
MAX_GD_SHEET_SIZE
(15-15)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/lib/sqlInference.ts (3)
server/logger/index.ts (2)
getLogger
(36-93)Subsystem
(15-15)server/types.ts (1)
DuckDBQuery
(615-618)server/ai/provider/index.ts (1)
getProviderByModel
(368-418)
server/api/knowledgeBase.ts (4)
server/integrations/dataSource/index.ts (1)
checkFileSize
(94-99)server/api/chat/chat.ts (1)
expandSheetIds
(220-239)server/search/vespa.ts (1)
DeleteDocument
(24-24)server/utils.ts (1)
getErrorMessage
(103-106)
server/integrations/microsoft/attachment-utils.ts (3)
server/integrations/dataSource/index.ts (1)
checkFileSize
(94-99)server/integrations/google/config.ts (5)
MAX_ATTACHMENT_PDF_SIZE
(22-22)MAX_ATTACHMENT_DOCX_SIZE
(24-24)MAX_ATTACHMENT_PPTX_SIZE
(25-25)MAX_ATTACHMENT_TEXT_SIZE
(23-23)MAX_ATTACHMENT_SHEET_SIZE
(26-26)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/services/fileProcessor.ts (1)
server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
server/integrations/google/worker-utils.ts (3)
server/integrations/dataSource/index.ts (1)
checkFileSize
(94-99)server/integrations/google/config.ts (1)
MAX_ATTACHMENT_SHEET_SIZE
(26-26)server/sheetChunk.ts (1)
chunkSheetWithHeaders
(408-465)
🔇 Additional comments (7)
server/services/fileProcessor.ts (5)
33-52
: Interface design looks good.The
SheetProcessingResult
interface appropriately extendsProcessingResult
with sheet-specific metadata (sheetName, sheetIndex, totalSheets, docId), and the array return type enables per-sheet document handling. The type aliasProcessingResultArray
improves readability.Note: A previous reviewer already flagged the ergonomics of the array return type for single-result files.
60-63
: LGTM!Wrapping the PDF processing result in an array maintains consistency with the new return signature.
103-140
: Per-sheet processing logic is sound.The implementation correctly:
- Iterates each sheet and generates unique
docId
values (${vespaDocId}_sheet_${sheetIndex}
)- Uses
chunkSheetWithHeaders
for header-preserving chunking- Filters out empty content and skips sheets with no valid data
- Aggregates results into
SheetProcessingResult
objects- Errors if no sheets yield content
Note: The empty
chunks_map
andimage_chunks_map
arrays are intentional, as each sheet is ingested as a separate Vespa document with sheet-level metadata. Based on learnings.
164-171
: LGTM!The enhanced error handling provides detailed logging with context (filename, mimeType, buffer size) and re-throws with a descriptive message, enabling proper error propagation and upstream handling.
173-194
: LGTM!The backward compatibility handling correctly creates default chunk metadata for non-PDF/non-sheet files and wraps the result in an array for consistency with the new signature.
server/config.ts (1)
193-193
: LGTM!The export of
sqlInferenceModel
is consistent with the pattern used for other configuration values.server/lib/sqlInference.ts (1)
1-11
: LGTM!Imports and logger setup follow the established patterns in the codebase.
Description
improved sheet ingestion
added duckdb for performing rag on sheets
###Testing
tested locally
Summary by CodeRabbit
New Features
Improvements
Bug Fixes