Why is MinIO storage requirement so high for HNSW index type? #42529
Replies: 3 comments 4 replies
-
So far as I know, we don't have an index type named "HNSW32". The available index types are listed here: https://milvus.io/docs/index.md?tab=floating#In-memory-Index In cluster, minio/kafka/pulsar/etcd are supposed to be distributed services. minio service has its replica machinery. In our milvus-helm deployment, 4 pods are deployed for minio: https://github.com/zilliztech/milvus-helm/blob/1c053e888b855f72583b896b32eb5d2e457938f7/charts/minio/values.yaml#L101 Pulsar/kafka is like a WAL component for milvus. All the dml operations(insert/upsert/delete) are first stored into pulsar/kafka. So, once you have inserted 100GB data into milvus, there must be 100GB data in pulsar/kafka. 100M vectors of 512 dim, the size is 100M * 512 * 4bytes = 200GB Milvus manages data by segments. When you continually insert data into milvus, there are small segments generated. Milvus internally triggers compact action to merge the small segments into large segments, here is an old article about this machinery: https://milvus.io/blog/2022-2-21-compact.md After compaction, small segments will be marked as "soft-deleted", and wait for garbage collection to delete them. GC is triggered in an interval of a few hours. So the small segments still occupy disk space before they are deleted by GC. Based on the points, the disk usage could be much higher than the original data size. It is recommended to assign 3X ~ 5X of disk space to avoid unexpected error of out of disk. Our calculator's magnification ratio is a relatively conservative estimate, as it generally requires 3-5 times storage amplification under normal circumstances, including logs, raw data, and indexes. Nowadays, since disks are basically worthless, it is recommended to assign more disk space. |
Beta Was this translation helpful? Give feedback.
-
the recommended sizing is fairly conservative. Rememeber Minio need at least 2 replica and we would recommended to leave 20-30% buffer. But yes I belive 4x300 GB disk space might works for your case as well. |
Beta Was this translation helpful? Give feedback.
-
@yhmo: Sorry, I meant HNSW with
Well, I'm choosing a simple HNSW index type with According to the Faiss documentation, the index overhead per vector is According to the Milvus documentation, the index overhead per vector is
@xiaofan-luan: Can you please explain again why you think this is conservative? Let me present the sizing calculations again: Raw data size: 100M vectors of 512 dimensions = 200 GB So, total size = 212.8 GB. Even with a MinIO replication factor of 2, the total size would be 2 x 212.8 = 425.6 GB. Whereas the recommended size is 4x440 = 1760 GB. Please let me know if I'm making any calculation mistakes. Thanks! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
According to the Milvus sizing tool, I would need 4x440 GB of MinIO storage for ingesting 100M vectors of 512 dimensions with an HNSW32 index built.
The MinIO storage requirement does not align with the size of raw and index data.
Each vector is 512*4 = 2KB in size. So 100M vectors would be 200GB in size.
According to this article, the index file size for an HNSW32 index should be 100M * 32 * 4 = 12.8 GB.
So, overall, the raw data + index file size is 212.8 GB.
With 4 pods of MinIO, erasure coding of
2
essentially means a replication factor of 2, so the total size would be 2x212.8 GB = 425.6 GB.However, MinIO asks for 4x440 GB.
Why are MinIO requirements so high?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions