Memory usage with PyArrow, DuckDB and Ibis #11294

kyrre · 2025-06-03T08:41:12Z

kyrre
Jun 3, 2025

I am experimenting with using Ibis in some ETL pipelines where I want to reduce memory usage as much as possible by using Arrow data structures and libraries.

For context what I am actually trying to do is to incrementally process "append blobs" in Azure that contain security logs in JSON objects, perform transformations and append the resulting events to a Delta table.

The reason why it is so convoluted will become clear below.

buffer = pa.allocate_buffer(length)
output = pa.output_stream(buffer)

download = client.download_blob(
                  offset=current_offset,
                  length=current_chunk_size,
                  progress_hook=progress_callback,
              )

bytes_read = download.readinto(output)
jsonl_stream, updated_offset = get_jsonl_stream(input_stream)


# the reason i do this here is because duckdb.read_json accepts a file like buffer and ibis only accepts file paths

table_name = ibis.util.gen_name("read_json")
con = duckdb.connect()
con.read_json(
    jsonl_stream,
    format="newline_delimited",
    ignore_errors=True,
    columns=from_pyarrow(schema),
).create(table_name)

events = (
    ibis.duckdb.from_connection(con)
    .table(table_name)
    .select(
        _event_uuid=uuid(),
        _ingested_at=_.time,
        _processed_at=ibis.now(),
        properties=_.properties,
    )
    .unpack("properties")
    .mutate(p_date=ibis.date(_.Timestamp))
)

events.to_delta(
          table_uri,
          mode="append",
          engine="rust",
          commit_properties=_commit_properties,
          storage_options=storage_options,
          partition_by=["p_date"],
          schema_mode="merge",
      )

However, it is not clear to me how to properly pass Arrow tables between Ibis and DuckDB so it avoids copying. In fact, any pointers on how to correctly instrument memory usage for Python programs that use Arrow would be super helpful!

OK, so to illustrate the issue I have, I will provide a simplified example:

Let's say I want to parse a JSONL file and then use DuckDB and Ibis to perform some transformations before writing to Parquet.

jsonl_stream  = pa.input_stream(json_file_path)
events = pa.json.read_json(
    jsonl_stream,
    parse_options=pa.json.ParseOptions(explicit_schema=schema),
)

if I now call ibis.memtable(events) what I get is an InMemoryTable, but one that is using the PyArrow table backend (?)

InMemoryTable
  data:
    PyArrowTableProxy:
      pyarrow.Table
...

when using the DuckDB package you can do this:

t = duckdb.arrow(events)

but they only way I have managed to do same with DuckDB and Ibis is to setup the connection first and the call register to create a view:

c = duckdb.connect()
c.register("events", events)
t = ibis.duckdb.from_connection(c).table("events")

at this point we have the Ibis table 🌟

@ibis.udf.scalar.builtin
def uuid() -> str: ...

events = (
  t
  .select(
      _event_uuid=uuid(),
      _ingested_at=_.time,
      _processed_at=ibis.now(),
      properties=_.properties,
  )
  .unpack("properties")
  .mutate(p_date=ibis.date(_.Timestamp))
)

however what if I want to go back to arrow again? If I call to_pyarrow() it seems to perform a copy?

import pyarrow.parquet as pq
pq.write_table(events.to_pyarrow(), "example.parquet")

whereas if I use DuckDB directly it uses less memory (???)

import pyarrow.parquet as pq
pq.write_table(c.table("events").arrow(), "example.parquet")

Is there a better way to achieve this?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory usage with PyArrow, DuckDB and Ibis #11294

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Memory usage with PyArrow, DuckDB and Ibis #11294

Uh oh!

kyrre Jun 3, 2025

Replies: 0 comments

kyrre
Jun 3, 2025