Some questions about RemoteBulkInsert #41263

CodinSheep · 2025-04-12T15:17:22Z

CodinSheep
Apr 12, 2025

Hi Team! I've deployed a 2.5.6 4-node cluster. I want to insert a 200G dataset and I have some questions about remote_bulk_import.

How will the data be distributed among 4 MinIO nodes during bulkwriter appending rows and committing (use the MinIO service started with the Milvus)?
What's the recommended bulkwriter segment_size for inserting 200G data? Does this matter?
What does bulk_import do? Will it create data replica of the stuff in the MinIO bucket?
Should I adjust the sizing-tool config which includes only one datanode(8CPU + 32G)?

Any help or explanation will be very appreciated, THANKS! : )

yhmo · 2025-04-14T03:09:25Z

yhmo
Apr 14, 2025
Collaborator

RemoteBulkWriter is a smart tool of pymilvus, it helps people to generate data files that can be imported into milvus by bulk_import() interface.
RemoteBulkWriter class has the following parameters:

schema -------------------- the target collection's schema
remote_path -------------- the remote path to store the data files generated by RemoteBulkWriter
connect_param ---------- connect parameter to minio/s3
segment_size ------------ expected size of each data file
file_type -------------------- file type, options: csv, parquet, json

How it works:

user calls RemoteBulkWriter.append_row() to insert data row by row, the data can be parsed from external data files, or fetched from an external database. The append_row() accepts a JON dict object which contains the values of all required fields.
The RemoteBulkWriter object maintains a cache to accumulate the data rows. With more and more rows appended, once the size of the cache exceeds "segment_size", the RemoteBulkWriter will flush the cache to the local disk to be a data file.
The RemoteBulkWriter creates a background thread to upload the data files to minio/s3

bulk_import() interface is a wrapper of this RESTful API: https://milvus.io/api-reference/restful/v2.5.x/v2/Import%20(v2)/Create.md
This interface tells milvus where to find the data files on minio/s3, and milvus will read the data files to generate new segments.
bulk_import() interface requires the data files to be in the same s3 bucket that the milvus is using. So, the connect_param of RemoteBulkWriter should be pointed to the correct s3 bucket.
After the RemoteBulkWriter has successfully generated/uploaded the data files to the correct s3 bucket, we can call bulk_import() interface to import the files.
4. once all of your data has been appended to RemoteBulkWriter, you can call RemoteBulkWriter.commit() to finish it work. The commit() will return a list of list[str] for you, containing all the paths of the data files, then you can call bulk_import(), pass the list of list[str] to bulk_import().
5. milvus will read the data files from s3/minio
6. milvus generates new segments from the data, and persists the new segments to s3/minio, now the entire import process is completed.

Note: after the entire import process is done, the data files generated by RemoteBulkWriter will not be deleted, you can manually delete them by yourself.

After bulk_import() is called, each data file is an import task, milvus automatically assigns a new import task to an idle data node. More data nodes can process more import tasks parallelly. If you have only one data file to import, one data node is ok. If you have 1000 data files to import, should we create 1000 data nodes? I don't think so, it is just a trade-off. Once all data files are imported, the data nodes will be idle and waste your resources.

1 reply

CodinSheep Apr 14, 2025
Author

So detailed and helpful！ Thank you！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Some questions about RemoteBulkInsert #41263

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Some questions about RemoteBulkInsert #41263

Uh oh!

CodinSheep Apr 12, 2025

Replies: 1 comment · 1 reply

Uh oh!

Uh oh!

yhmo Apr 14, 2025 Collaborator

Uh oh!

CodinSheep Apr 14, 2025 Author

CodinSheep
Apr 12, 2025

Replies: 1 comment 1 reply

yhmo
Apr 14, 2025
Collaborator

CodinSheep Apr 14, 2025
Author