Existing Parquet with uint8 vectors to Lance - size grows 10x? #3705
Unanswered
liquidcarbon
asked this question in
Q&A
Replies: 1 comment 7 replies
-
I imagine this is where it's particularly slow. You are converting from PyArrow array -> Python list -> PyArrow array. pa.array([item for vec in table[vector_column].to_pylist() for item in vec], type=pa.uint8()), I would recommend instead using the
This is surprising to us. We haven't tested much with uint8 vectors, so we will look more into this soon when we have some time. |
Beta Was this translation helpful? Give feedback.
7 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi, I'm trying to convert a large Parquet dataset containing a 1024-dim uint8 vectors to lance.
To create an index, lance wants FixedSizeListArray, so...
This conversion is 1) very slow 2) the resulting files are almost 10x the size of original ones (~200K row files were ~34MB in snappy Parquet, becoming ~300MB in Lance)
What am I doing wrong?
Beta Was this translation helpful? Give feedback.
All reactions