You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am experimenting with using Ibis in some ETL pipelines where I want to reduce memory usage as much as possible by using Arrow data structures and libraries.
For context what I am actually trying to do is to incrementally process "append blobs" in Azure that contain security logs in JSON objects, perform transformations and append the resulting events to a Delta table.
The reason why it is so convoluted will become clear below.
buffer = pa.allocate_buffer(length)
output = pa.output_stream(buffer)
download = client.download_blob(
offset=current_offset,
length=current_chunk_size,
progress_hook=progress_callback,
)
bytes_read = download.readinto(output)
jsonl_stream, updated_offset = get_jsonl_stream(input_stream)
# the reason i do this here is because duckdb.read_json accepts a file like buffer and ibis only accepts file paths
table_name = ibis.util.gen_name("read_json")
con = duckdb.connect()
con.read_json(
jsonl_stream,
format="newline_delimited",
ignore_errors=True,
columns=from_pyarrow(schema),
).create(table_name)
events = (
ibis.duckdb.from_connection(con)
.table(table_name)
.select(
_event_uuid=uuid(),
_ingested_at=_.time,
_processed_at=ibis.now(),
properties=_.properties,
)
.unpack("properties")
.mutate(p_date=ibis.date(_.Timestamp))
)
events.to_delta(
table_uri,
mode="append",
engine="rust",
commit_properties=_commit_properties,
storage_options=storage_options,
partition_by=["p_date"],
schema_mode="merge",
)
However, it is not clear to me how to properly pass Arrow tables between Ibis and DuckDB so it avoids copying. In fact, any pointers on how to correctly instrument memory usage for Python programs that use Arrow would be super helpful!
OK, so to illustrate the issue I have, I will provide a simplified example:
Let's say I want to parse a JSONL file and then use DuckDB and Ibis to perform some transformations before writing to Parquet.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I am experimenting with using Ibis in some ETL pipelines where I want to reduce memory usage as much as possible by using Arrow data structures and libraries.
For context what I am actually trying to do is to incrementally process "append blobs" in Azure that contain security logs in JSON objects, perform transformations and append the resulting events to a Delta table.
The reason why it is so convoluted will become clear below.
However, it is not clear to me how to properly pass Arrow tables between Ibis and DuckDB so it avoids copying. In fact, any pointers on how to correctly instrument memory usage for Python programs that use Arrow would be super helpful!
OK, so to illustrate the issue I have, I will provide a simplified example:
Let's say I want to parse a JSONL file and then use DuckDB and Ibis to perform some transformations before writing to Parquet.
if I now call
ibis.memtable(events)
what I get is an InMemoryTable, but one that is using the PyArrow table backend (?)when using the DuckDB package you can do this:
but they only way I have managed to do same with DuckDB and Ibis is to setup the connection first and the call
register
to create a view:at this point we have the Ibis table 🌟
however what if I want to go back to arrow again? If I call
to_pyarrow()
it seems to perform a copy?whereas if I use DuckDB directly it uses less memory (???)
Is there a better way to achieve this?
Beta Was this translation helpful? Give feedback.
All reactions