Skip to content

Conversation

@SahilKumar000
Copy link
Contributor

@SahilKumar000 SahilKumar000 commented Oct 1, 2025

Description

Testing

Additional Notes

Summary by CodeRabbit

  • New Features

    • Added a POST endpoint to poll collection processing status and return aggregated, up-to-date results.
  • Bug Fixes

    • Parent (folder/collection) statuses now wait for all children before marking complete.
    • Failures and retries correctly propagate to parent statuses so overall processing state is more reliable.
    • Parent status updates are triggered after individual file processing completes or fails.
  • Chores

    • Increased default file-processing parallelism (worker threads and team size) from 1 to 4.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 1, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

Walkthrough

Adds a POST /cl/poll-status endpoint to fetch non-deleted items for provided collections, new DB helpers to compute and roll up child upload statuses (exported updateParentStatus, markParentAsProcessing, and getCollectionItemsStatusByCollections), updates file processing/worker flows to defer parent completion and propagate status via updateParentStatus, and raises file-processing concurrency defaults. Note: new API handler and some helpers appear duplicated within files.

Changes

Cohort / File(s) Summary
API: Poll collections status
server/api/knowledgeBase.ts
Adds exported PollCollectionsStatusApi handler: authenticates user, validates collectionIds from POST body, calls getCollectionItemsStatusByCollections, and returns aggregated items with error handling. (Handler appears duplicated in file.) Imports getCollectionItemsStatusByCollections and markParentAsProcessing.
Server routing
server/server.ts
Registers new POST /cl/poll-status route wired to PollCollectionsStatusApi.
DB: Parent status & queries
server/db/knowledgeBase.ts
Adds getCollectionItemsStatusByCollections(trx, collectionIds, userId) and helpers markParentAsProcessing(trx, parentId, isCollection) and updateParentStatus(trx, parentId, isCollection) to set/propagate PROCESSING and COMPLETED states and compute child terminal counts. Imports UploadStatus. (Helpers appear duplicated in file in places.) Exports markParentAsProcessing and updateParentStatus.
Queue processor
server/queue/fileProcessor.ts
Propagates parent status updates: calls updateParentStatus on file success/failure, updates handleRetryFailure signature to accept parentId? and collectionId?, removes premature COMPLETED marks for collections/folders, and adjusts logs to indicate waiting for children. Imports updateParentStatus.
Worker simplification
server/worker.ts
Removes per-job DB status updates in worker; relies on processJob/processors to manage PROCESSING/COMPLETED and to call updateParentStatus; simplifies job control flow and reduces DB interactions.
Config defaults
server/config.ts
Changes defaults: fileProcessingWorkerThreads and fileProcessingTeamSize from 1 to 4.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  actor Client
  participant Server as API Server
  participant DB as Database

  Client->>Server: POST /cl/poll-status { collectionIds[] }
  Server->>Server: Authenticate & validate input
  Server->>DB: getCollectionItemsStatusByCollections(collectionIds, userId)
  DB-->>Server: items[]
  Server-->>Client: 200 OK { items }
Loading
sequenceDiagram
  autonumber
  participant Worker
  participant Processor as FileProcessor
  participant DB as Database

  Worker->>Processor: processJob(job)
  Processor->>DB: internal processing (mark PROCESSING, do work)
  alt success
    Processor->>DB: persist results
    Processor->>DB: updateParentStatus(parentId, isCollection=false) or updateParentStatus(collectionId, true)
  else failure (terminal)
    Processor->>Processor: handleRetryFailure(err, parentId?, collectionId?)
    Processor->>DB: updateParentStatus(...)\n(if applicable for FILE)
  end
  note right of Processor: Folders/collections are left pending until updateParentStatus\nmarks them COMPLETED when all children are terminal
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • shivamashtikar
  • devesh-juspay
  • junaid-shirur
  • zereraz

Poem

Thump—jobs hop in, the burrow hums bright,
Parents wait patient through day and night.
Threads quadrupled, the kits all play,
Statuses roll up, neat as hay.
I nibble logs and whisper: “Done—hooray!” 🥕🐇

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title Check ❓ Inconclusive The title correctly identifies that the pull request relates to file processing and upload status, but it remains too generic and does not highlight the primary changes such as the addition of a status polling API, the propagation of parent status updates, or the refactoring across multiple backend modules. This lack of specificity makes it harder for reviewers to quickly understand the key updates made. Consider revising the title to clearly reflect the main changes, for example by mentioning the new collection status polling endpoint and the parent status propagation logic, so that the scope and impact of the update are immediately apparent.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/fileProcessing-status

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @SahilKumar000, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the backend's handling of file upload and processing statuses, particularly for hierarchical structures like collections and folders. It introduces a new API endpoint for clients to efficiently poll the processing status of items within multiple collections. A key improvement is the implementation of a robust parent status update mechanism, ensuring that collections and folders accurately reflect their overall processing state only after all their contained items have been processed, whether successfully or with failures. Additionally, the default worker thread configuration for file processing has been increased to improve throughput.

Highlights

  • New API Endpoint for Status Polling: A new API endpoint, /cl/poll-status, has been introduced to allow clients to efficiently query the upload and processing status of items within multiple collections.
  • Enhanced Hierarchical Status Tracking: A robust mechanism has been implemented to update the uploadStatus of parent entities (collections and folders) based on the completion or failure of their child items. This ensures that parent statuses accurately reflect the overall processing state of their contents.
  • Increased File Processing Concurrency: The default number of worker threads and team size for file processing has been increased from 1 to 4, potentially improving the throughput and efficiency of file processing operations.
  • Centralized Status Management: Status updates for file processing jobs have been centralized within the processJob function, removing redundant logic from the worker initialization and ensuring consistent status handling across the system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the file processing status update logic to correctly handle hierarchical statuses for collections and folders. A new API endpoint is introduced for polling these statuses. The changes correctly address a race condition and improve the overall robustness of the file processing workflow. My review includes suggestions to refactor duplicated code for better maintainability, use more idiomatic ORM features, and simplify a conditional check.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
server/config.ts (1)

14-15: DATABASE_URL fallback is invalid ("$").

Returning "$" will break DB init whenever DATABASE_URL is unset. Provide a real fallback or throw with a clear error.

Apply one of:

-function getDatabaseUrl(): string {
-    return process.env.DATABASE_URL || `$`
-}
+function getDatabaseUrl(): string {
+  const url = process.env.DATABASE_URL?.trim()
+  if (url) return url
+  // Construct from pieces if provided, otherwise fail loudly
+  const host = process.env.DATABASE_HOST || "0.0.0.0"
+  const port = process.env.DATABASE_PORT || "5432"
+  const user = process.env.DATABASE_USER || "postgres"
+  const pass = process.env.DATABASE_PASSWORD || "postgres"
+  const db   = process.env.DATABASE_NAME || "postgres"
+  if (!host || !db) {
+    throw new Error("DATABASE_URL not set and insufficient parts to construct it")
+  }
+  return `postgresql://${encodeURIComponent(user)}:${encodeURIComponent(pass)}@${host}:${port}/${db}`
+}
🧹 Nitpick comments (4)
server/config.ts (1)

223-224: Use configured vespaPort instead of hardcoded 8080.

-  vespaEndpoint: `http://${vespaBaseHost}:8080`,
+  vespaEndpoint: `http://${vespaBaseHost}:${vespaPort}`,
server/db/knowledgeBase.ts (2)

832-949: Parent status marked COMPLETED even when some children FAILED.

This can mislead UIs/automations. Prefer FAILED if any child failed; COMPLETED only when all completed.

Apply:

-      await trx
-        .update(collections)
-        .set({
-          uploadStatus: UploadStatus.COMPLETED,
-          statusMessage: `Successfully processed collection: ${completedCount} completed, ${failedCount} failed`,
-          updatedAt: sql`NOW()`,
-        })
+      await trx
+        .update(collections)
+        .set({
+          uploadStatus: failedCount > 0 ? UploadStatus.FAILED : UploadStatus.COMPLETED,
+          statusMessage: `Processed collection: ${completedCount} completed, ${failedCount} failed`,
+          updatedAt: sql`NOW()`,
+        })
         .where(eq(collections.id, parentId))

And for folders:

-      await trx
-        .update(collectionItems)
-        .set({
-          uploadStatus: UploadStatus.COMPLETED,
-          statusMessage: `Successfully processed folder: ${completedCount} completed, ${failedCount} failed`,
-          updatedAt: sql`NOW()`,
-        })
+      await trx
+        .update(collectionItems)
+        .set({
+          uploadStatus: failedCount > 0 ? UploadStatus.FAILED : UploadStatus.COMPLETED,
+          statusMessage: `Processed folder: ${completedCount} completed, ${failedCount} failed`,
+          updatedAt: sql`NOW()`,
+        })
         .where(eq(collectionItems.id, parentId))

If “partial success” needs its own state, introduce UploadStatus.PARTIAL.


842-852: Reduce DB round‑trips: compute counts in SQL.

Instead of fetching every child row and counting in JS, select total/completed/failed via SUM(CASE …) in a single query for each parent.

Example pattern (Postgres + Drizzle):

const [agg] = await trx
  .select({
    total: sql<number>`count(*)`,
    completed: sql<number>`sum(case when ${collectionItems.uploadStatus} = ${UploadStatus.COMPLETED} then 1 else 0 end)`,
    failed: sql<number>`sum(case when ${collectionItems.uploadStatus} = ${UploadStatus.FAILED} then 1 else 0 end)`,
  })
  .from(collectionItems)
  .where(and(/* same filters as current */))

Then use agg.total/agg.completed/agg.failed for decisions.

Also applies to: 887-896

server/server.ts (1)

1118-1118: Add body validation for poll-status.

Guard against empty/huge payloads and bad IDs with Zod on the route (consistent with other endpoints).

Apply:

-  .post("/cl/poll-status", PollCollectionsStatusApi)
+  // Validate collectionIds payload
+  .post(
+    "/cl/poll-status",
+    zValidator(
+      "json",
+      z.object({
+        collectionIds: z.array(z.string().uuid()).min(1).max(100),
+      }),
+    ),
+    PollCollectionsStatusApi,
+  )
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d654815 and 9ad4028.

📒 Files selected for processing (6)
  • server/api/knowledgeBase.ts (1 hunks)
  • server/config.ts (1 hunks)
  • server/db/knowledgeBase.ts (2 hunks)
  • server/queue/fileProcessor.ts (7 hunks)
  • server/server.ts (2 hunks)
  • server/worker.ts (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (5)
server/api/knowledgeBase.ts (4)
server/db/schema/users.ts (1)
  • users (27-57)
server/db/user.ts (1)
  • getUserByEmail (147-156)
server/db/schema/knowledgeBase.ts (1)
  • collectionItems (69-140)
server/utils.ts (1)
  • getErrorMessage (103-106)
server/worker.ts (1)
server/queue/fileProcessor.ts (2)
  • processJob (100-122)
  • ProcessingJob (32-35)
server/server.ts (1)
server/api/knowledgeBase.ts (1)
  • PollCollectionsStatusApi (1649-1700)
server/queue/fileProcessor.ts (1)
server/db/knowledgeBase.ts (1)
  • updateParentStatus (833-949)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
  • TxnOrClient (316-316)
server/db/schema/knowledgeBase.ts (2)
  • collectionItems (69-140)
  • collections (20-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (1)
server/db/knowledgeBase.ts (1)

832-913: Consider refactoring to reduce duplication.

As noted in a previous review, the logic for handling collections vs folders is duplicated (fetching children, counting statuses, updating parent). Extracting common logic would improve maintainability.

Based on learnings: A past review suggested abstracting the common pattern for both branches. While not blocking, this refactor would make future changes (e.g., adding retry logic or additional status types) easier to maintain consistently across both code paths.

🧹 Nitpick comments (1)
server/db/knowledgeBase.ts (1)

895-910: Clarify recursive propagation logic.

Line 907 uses folder.parentId || folder.collectionId to determine the next parent, but both values may be non-null simultaneously. If a folder has a parent folder, folder.parentId will be truthy and folder.collectionId is ignored, which is correct. However, the fallback relies on JavaScript's || short-circuit evaluation, which is implicit.

The logic is correct but would benefit from an explicit comment or restructuring for clarity.

Consider making the intent explicit:

       if (folder) {
-        // Recursively update parent (either another folder or the collection)
-        await updateParentStatus(
-          trx,
-          folder.parentId || folder.collectionId,
-          !folder.parentId, // isCollection = true if no parentId
-        )
+        // Recursively update parent folder or fall back to collection
+        if (folder.parentId) {
+          await updateParentStatus(trx, folder.parentId, false)
+        } else {
+          await updateParentStatus(trx, folder.collectionId, true)
+        }
       }
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9ad4028 and ee61d83.

📒 Files selected for processing (1)
  • server/db/knowledgeBase.ts (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
  • TxnOrClient (316-316)
server/db/schema/knowledgeBase.ts (2)
  • collectionItems (69-140)
  • collections (20-66)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
server/db/knowledgeBase.ts (2)

942-942: Replace sql\true`withundefined` to omit the condition.

Drizzle ORM's and() function will omit undefined conditions from the query, which is cleaner than generating a literal true in SQL.

Apply this diff:

       and(
         isCollection
           ? eq(collectionItems.collectionId, parentId)
           : eq(collectionItems.parentId, parentId),
-        isCollection ? isNull(collectionItems.parentId) : sql`true`,
+        isCollection ? isNull(collectionItems.parentId) : undefined,
         isNull(collectionItems.deletedAt),
       ),

966-1005: Race condition: Multiple concurrent completions may cause inconsistent parent status.

When multiple child files complete simultaneously, each triggers updateParentStatus. These concurrent calls can race, leading to stale statusMessage counts or redundant recursive updates.

Consider using PostgreSQL advisory locks (e.g., pg_advisory_xact_lock) keyed on the parentId to serialize status checks per parent, or add idempotency checks to skip recursive calls if the parent is already COMPLETED.

🧹 Nitpick comments (1)
server/api/knowledgeBase.ts (1)

1650-1689: Access control is enforced at the DB layer.

The handler correctly passes user.id to getCollectionItemsStatusByCollections, which filters items to only those from collections owned by the user. Unauthorized collection IDs are silently excluded from the result rather than raising 403, which is acceptable for a polling endpoint.

Consider adding a clarifying comment at line 1669 to document this silent filtering behavior for future maintainers.

Apply this diff to add a clarifying comment:

+    // Note: The DB helper filters items by userId, so unauthorized collection IDs
+    // are silently excluded from the result (no 403 error).
     // Fetch items only for collections owned by the user (enforced in DB function)
     const items = await getCollectionItemsStatusByCollections(
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8048c31 and 840cc3f.

📒 Files selected for processing (2)
  • server/api/knowledgeBase.ts (2 hunks)
  • server/db/knowledgeBase.ts (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
server/db/knowledgeBase.ts (2)
server/types.ts (1)
  • TxnOrClient (316-316)
server/db/schema/knowledgeBase.ts (2)
  • collectionItems (69-140)
  • collections (20-66)
server/api/knowledgeBase.ts (3)
server/db/user.ts (1)
  • getUserByEmail (147-156)
server/db/knowledgeBase.ts (1)
  • getCollectionItemsStatusByCollections (755-785)
server/utils.ts (1)
  • getErrorMessage (103-106)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (6)
server/api/knowledgeBase.ts (1)

38-39: LGTM!

The new imports support the PollCollectionsStatusApi handler and are correctly sourced from the DB module.

server/db/knowledgeBase.ts (5)

15-15: LGTM!

The import is necessary for the new status management helpers.


450-455: LGTM!

The logic correctly marks the parent (folder or collection) as PROCESSING when a new folder is created, ensuring status propagates up the hierarchy.


539-544: LGTM!

The logic correctly marks the parent (folder or collection) as PROCESSING when a new file is uploaded, ensuring status propagates up the hierarchy.


754-785: LGTM!

The function correctly enforces access control by filtering collections to those owned by the user (line 778), preventing unauthorized access to items in other users' collections.


879-923: LGTM!

The recursive propagation correctly marks all ancestors up to the collection as PROCESSING, ensuring the entire hierarchy reflects ongoing processing.

@shivamashtikar shivamashtikar merged commit 53a9f78 into main Oct 1, 2025
3 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants