-
Notifications
You must be signed in to change notification settings - Fork 56
fix(fileProcessing-status): updated backend for upload-status #1024
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds a POST /cl/poll-status endpoint to fetch non-deleted items for provided collections, new DB helpers to compute and roll up child upload statuses (exported updateParentStatus, markParentAsProcessing, and getCollectionItemsStatusByCollections), updates file processing/worker flows to defer parent completion and propagate status via updateParentStatus, and raises file-processing concurrency defaults. Note: new API handler and some helpers appear duplicated within files. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant Server as API Server
participant DB as Database
Client->>Server: POST /cl/poll-status { collectionIds[] }
Server->>Server: Authenticate & validate input
Server->>DB: getCollectionItemsStatusByCollections(collectionIds, userId)
DB-->>Server: items[]
Server-->>Client: 200 OK { items }
sequenceDiagram
autonumber
participant Worker
participant Processor as FileProcessor
participant DB as Database
Worker->>Processor: processJob(job)
Processor->>DB: internal processing (mark PROCESSING, do work)
alt success
Processor->>DB: persist results
Processor->>DB: updateParentStatus(parentId, isCollection=false) or updateParentStatus(collectionId, true)
else failure (terminal)
Processor->>Processor: handleRetryFailure(err, parentId?, collectionId?)
Processor->>DB: updateParentStatus(...)\n(if applicable for FILE)
end
note right of Processor: Folders/collections are left pending until updateParentStatus\nmarks them COMPLETED when all children are terminal
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @SahilKumar000, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the backend's handling of file upload and processing statuses, particularly for hierarchical structures like collections and folders. It introduces a new API endpoint for clients to efficiently poll the processing status of items within multiple collections. A key improvement is the implementation of a robust parent status update mechanism, ensuring that collections and folders accurately reflect their overall processing state only after all their contained items have been processed, whether successfully or with failures. Additionally, the default worker thread configuration for file processing has been increased to improve throughput. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the file processing status update logic to correctly handle hierarchical statuses for collections and folders. A new API endpoint is introduced for polling these statuses. The changes correctly address a race condition and improve the overall robustness of the file processing workflow. My review includes suggestions to refactor duplicated code for better maintainability, use more idiomatic ORM features, and simplify a conditional check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/config.ts (1)
14-15: DATABASE_URL fallback is invalid ("$").Returning "$" will break DB init whenever DATABASE_URL is unset. Provide a real fallback or throw with a clear error.
Apply one of:
-function getDatabaseUrl(): string { - return process.env.DATABASE_URL || `$` -} +function getDatabaseUrl(): string { + const url = process.env.DATABASE_URL?.trim() + if (url) return url + // Construct from pieces if provided, otherwise fail loudly + const host = process.env.DATABASE_HOST || "0.0.0.0" + const port = process.env.DATABASE_PORT || "5432" + const user = process.env.DATABASE_USER || "postgres" + const pass = process.env.DATABASE_PASSWORD || "postgres" + const db = process.env.DATABASE_NAME || "postgres" + if (!host || !db) { + throw new Error("DATABASE_URL not set and insufficient parts to construct it") + } + return `postgresql://${encodeURIComponent(user)}:${encodeURIComponent(pass)}@${host}:${port}/${db}` +}
🧹 Nitpick comments (4)
server/config.ts (1)
223-224: Use configured vespaPort instead of hardcoded 8080.- vespaEndpoint: `http://${vespaBaseHost}:8080`, + vespaEndpoint: `http://${vespaBaseHost}:${vespaPort}`,server/db/knowledgeBase.ts (2)
832-949: Parent status marked COMPLETED even when some children FAILED.This can mislead UIs/automations. Prefer FAILED if any child failed; COMPLETED only when all completed.
Apply:
- await trx - .update(collections) - .set({ - uploadStatus: UploadStatus.COMPLETED, - statusMessage: `Successfully processed collection: ${completedCount} completed, ${failedCount} failed`, - updatedAt: sql`NOW()`, - }) + await trx + .update(collections) + .set({ + uploadStatus: failedCount > 0 ? UploadStatus.FAILED : UploadStatus.COMPLETED, + statusMessage: `Processed collection: ${completedCount} completed, ${failedCount} failed`, + updatedAt: sql`NOW()`, + }) .where(eq(collections.id, parentId))And for folders:
- await trx - .update(collectionItems) - .set({ - uploadStatus: UploadStatus.COMPLETED, - statusMessage: `Successfully processed folder: ${completedCount} completed, ${failedCount} failed`, - updatedAt: sql`NOW()`, - }) + await trx + .update(collectionItems) + .set({ + uploadStatus: failedCount > 0 ? UploadStatus.FAILED : UploadStatus.COMPLETED, + statusMessage: `Processed folder: ${completedCount} completed, ${failedCount} failed`, + updatedAt: sql`NOW()`, + }) .where(eq(collectionItems.id, parentId))If “partial success” needs its own state, introduce UploadStatus.PARTIAL.
842-852: Reduce DB round‑trips: compute counts in SQL.Instead of fetching every child row and counting in JS, select total/completed/failed via SUM(CASE …) in a single query for each parent.
Example pattern (Postgres + Drizzle):
const [agg] = await trx .select({ total: sql<number>`count(*)`, completed: sql<number>`sum(case when ${collectionItems.uploadStatus} = ${UploadStatus.COMPLETED} then 1 else 0 end)`, failed: sql<number>`sum(case when ${collectionItems.uploadStatus} = ${UploadStatus.FAILED} then 1 else 0 end)`, }) .from(collectionItems) .where(and(/* same filters as current */))Then use agg.total/agg.completed/agg.failed for decisions.
Also applies to: 887-896
server/server.ts (1)
1118-1118: Add body validation for poll-status.Guard against empty/huge payloads and bad IDs with Zod on the route (consistent with other endpoints).
Apply:
- .post("/cl/poll-status", PollCollectionsStatusApi) + // Validate collectionIds payload + .post( + "/cl/poll-status", + zValidator( + "json", + z.object({ + collectionIds: z.array(z.string().uuid()).min(1).max(100), + }), + ), + PollCollectionsStatusApi, + )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
server/api/knowledgeBase.ts(1 hunks)server/config.ts(1 hunks)server/db/knowledgeBase.ts(2 hunks)server/queue/fileProcessor.ts(7 hunks)server/server.ts(2 hunks)server/worker.ts(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
server/api/knowledgeBase.ts (4)
server/db/schema/users.ts (1)
users(27-57)server/db/user.ts (1)
getUserByEmail(147-156)server/db/schema/knowledgeBase.ts (1)
collectionItems(69-140)server/utils.ts (1)
getErrorMessage(103-106)
server/worker.ts (1)
server/queue/fileProcessor.ts (2)
processJob(100-122)ProcessingJob(32-35)
server/server.ts (1)
server/api/knowledgeBase.ts (1)
PollCollectionsStatusApi(1649-1700)
server/queue/fileProcessor.ts (1)
server/db/knowledgeBase.ts (1)
updateParentStatus(833-949)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
TxnOrClient(316-316)server/db/schema/knowledgeBase.ts (2)
collectionItems(69-140)collections(20-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
server/db/knowledgeBase.ts (1)
832-913: Consider refactoring to reduce duplication.As noted in a previous review, the logic for handling collections vs folders is duplicated (fetching children, counting statuses, updating parent). Extracting common logic would improve maintainability.
Based on learnings: A past review suggested abstracting the common pattern for both branches. While not blocking, this refactor would make future changes (e.g., adding retry logic or additional status types) easier to maintain consistently across both code paths.
🧹 Nitpick comments (1)
server/db/knowledgeBase.ts (1)
895-910: Clarify recursive propagation logic.Line 907 uses
folder.parentId || folder.collectionIdto determine the next parent, but both values may be non-null simultaneously. If a folder has a parent folder,folder.parentIdwill be truthy andfolder.collectionIdis ignored, which is correct. However, the fallback relies on JavaScript's||short-circuit evaluation, which is implicit.The logic is correct but would benefit from an explicit comment or restructuring for clarity.
Consider making the intent explicit:
if (folder) { - // Recursively update parent (either another folder or the collection) - await updateParentStatus( - trx, - folder.parentId || folder.collectionId, - !folder.parentId, // isCollection = true if no parentId - ) + // Recursively update parent folder or fall back to collection + if (folder.parentId) { + await updateParentStatus(trx, folder.parentId, false) + } else { + await updateParentStatus(trx, folder.collectionId, true) + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
server/db/knowledgeBase.ts(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
TxnOrClient(316-316)server/db/schema/knowledgeBase.ts (2)
collectionItems(69-140)collections(20-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
server/db/knowledgeBase.ts (2)
942-942: Replacesql\true`withundefined` to omit the condition.Drizzle ORM's
and()function will omitundefinedconditions from the query, which is cleaner than generating a literaltruein SQL.Apply this diff:
and( isCollection ? eq(collectionItems.collectionId, parentId) : eq(collectionItems.parentId, parentId), - isCollection ? isNull(collectionItems.parentId) : sql`true`, + isCollection ? isNull(collectionItems.parentId) : undefined, isNull(collectionItems.deletedAt), ),
966-1005: Race condition: Multiple concurrent completions may cause inconsistent parent status.When multiple child files complete simultaneously, each triggers
updateParentStatus. These concurrent calls can race, leading to stalestatusMessagecounts or redundant recursive updates.Consider using PostgreSQL advisory locks (e.g.,
pg_advisory_xact_lock) keyed on the parentId to serialize status checks per parent, or add idempotency checks to skip recursive calls if the parent is already COMPLETED.
🧹 Nitpick comments (1)
server/api/knowledgeBase.ts (1)
1650-1689: Access control is enforced at the DB layer.The handler correctly passes
user.idtogetCollectionItemsStatusByCollections, which filters items to only those from collections owned by the user. Unauthorized collection IDs are silently excluded from the result rather than raising 403, which is acceptable for a polling endpoint.Consider adding a clarifying comment at line 1669 to document this silent filtering behavior for future maintainers.
Apply this diff to add a clarifying comment:
+ // Note: The DB helper filters items by userId, so unauthorized collection IDs + // are silently excluded from the result (no 403 error). // Fetch items only for collections owned by the user (enforced in DB function) const items = await getCollectionItemsStatusByCollections(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
server/api/knowledgeBase.ts(2 hunks)server/db/knowledgeBase.ts(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
TxnOrClient(316-316)server/db/schema/knowledgeBase.ts (2)
collectionItems(69-140)collections(20-66)
server/api/knowledgeBase.ts (3)
server/db/user.ts (1)
getUserByEmail(147-156)server/db/knowledgeBase.ts (1)
getCollectionItemsStatusByCollections(755-785)server/utils.ts (1)
getErrorMessage(103-106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (6)
server/api/knowledgeBase.ts (1)
38-39: LGTM!The new imports support the PollCollectionsStatusApi handler and are correctly sourced from the DB module.
server/db/knowledgeBase.ts (5)
15-15: LGTM!The import is necessary for the new status management helpers.
450-455: LGTM!The logic correctly marks the parent (folder or collection) as PROCESSING when a new folder is created, ensuring status propagates up the hierarchy.
539-544: LGTM!The logic correctly marks the parent (folder or collection) as PROCESSING when a new file is uploaded, ensuring status propagates up the hierarchy.
754-785: LGTM!The function correctly enforces access control by filtering collections to those owned by the user (line 778), preventing unauthorized access to items in other users' collections.
879-923: LGTM!The recursive propagation correctly marks all ancestors up to the collection as PROCESSING, ensuring the entire hierarchy reflects ongoing processing.
Description
Testing
Additional Notes
Summary by CodeRabbit
New Features
Bug Fixes
Chores