When would Lance V2 Page has more than one buffer? #3679
-
In the lance v2 doc, it says: https://github.com/lancedb/lance/blob/main/protos/file2.proto#L42
And when I try to go through the code, I find the:
However, I found most of
Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 5 replies
-
When performing random access we don't read the entire page. This means we need to know where different buffers are located so that we can read into them appropriately. For example, with string data, we first read the offsets (first buffer) and then use those offsets to read into the string data (second buffer). In 2.1 we introduce an "initialize" step where we load (and cache) various small metadata (e.g dictionaries, chunk sizes, etc.) and we need a separate buffer for this metadata.
I think in 2.0 we use multiple buffers for binary / string data. List data is encoded as two different columns and so we only need a single buffer there. In 2.1 things are a bit more complex. Small types are encoded using a "mini block" encoding that is more similar to parquet (chunks of data with small read amplification). We always have a tiny "block metadata" buffer that tells us the size of each block (2 bytes per block) and then the data buffer itself (in this case there is only one data buffer, no matter what data type or compression). Large types are encoded using a "full zip" encoding that will have, at a minimum, at least one buffer (e.g. this is how vector embeddings are encoded). There will be two buffers if it is a variable-length type (one buffer for something called the "repetition index"). In both "mini block" and "full zip" we may have additional buffers for metadata like "dictionaries". In the future I think we might combine all these extra metadata buffers into a single metadata buffer (reduce the IOPS on initialization) but this isn't too critical. |
Beta Was this translation helpful? Give feedback.
When performing random access we don't read the entire page. This means we need to know where different buffers are located so that we can read into them appropriately. For example, with string data, we first read the offsets (first buffer) and then use those offsets to read into the string data (second buffer). In 2.1 we introduce an "initialize" step where we load (and cache) various small metadata (e.g dictionaries, chunk sizes, etc.) and we need a separate buffer for this metadata.
I think in 2.0 we use multiple buffers for binary / string data. List data is encode…