Skip to content

Enhanced Batching and Progress Reporting in SearchIndexingBufferedSender #33469

Open
@farzad528

Description

@farzad528

Is your feature request related to a problem? Please describe.
When using SearchIndexingBufferedSender to upload a large number of documents (such as 1 billion vectors with text), the current implementation doesn't automatically handle the 64 MB payload size limit and the 32,000 documents count limit per batch. This leads to a need for manual intervention to ensure that documents are indexed successfully, which can be particularly frustrating when running long indexing jobs, potentially overnight.

Describe the solution you'd like
I propose the following enhancements to the SearchIndexingBufferedSender to make the experience of indexing large datasets more seamless and robust:

  • Automatic Handling of Batch Size: Implement an internal mechanism that calculates the cumulative size of the document payload and the count, and automatically flushes the batch when either limit is approached. This should prevent users from encountering issues related to payload size or document count limits.
  • Customizable Batch Size: Expose a parameter that allows users to specify their desired batch size or payload size limit. This would give users more control over the batching process and could be used to adjust performance based on their specific use case.
  • Progress Reporting: Introduce built-in progress reporting for the upload operation, similar to the tqdm library in Python, which provides a visual progress bar. This feature would be extremely helpful for long-running indexing jobs, giving users a clear indication of the operation's progress.

Describe alternatives you've considered
An alternative approach would be to manually chunk the data and implement custom logic for progress reporting and error handling. However, this adds complexity and requires additional development effort, which could be avoided if the SDK provided these capabilities out of the box.

Additional context
The goal is to enable users to confidently run a single function to index a massive number of documents without needing to worry about the internal limitations of the service. Enhancing the SearchIndexingBufferedSender with the above features would significantly improve the developer experience and the reliability of the indexing process for large datasets.

Metadata

Metadata

Assignees

No one assigned

    Labels

    ClientThis issue points to a problem in the data-plane of the library.Searchneeds-team-attentionWorkflow: This issue needs attention from Azure service team or SDK teamquestionThe issue doesn't require a change to the product in order to be resolved. Most issues start as that

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions