Skip to content

Potential memory issue when using COPY with PARTITIONED BY #11042

@hveiga

Description

@hveiga

Describe the bug

Memory does not get freed after executing multiple COPY ... TO ... PARTITIONED BY ... queries. I have not been able to identify what is causing this behavior.

To Reproduce

The behavior can be observed using datafusion-cli. I have been monitoring the memory usage through Activity Monitor.

  1. Download test parquet file (120MB): https://file.io/eKiHwu4waHVN
  2. Run datafusion-cli
  3. Create a external table:
CREATE EXTERNAL TABLE my_table
        (
            col1 VARCHAR NOT NULL,
            timestamp BIGINT NOT NULL,
            col2 VARCHAR NOT NULL,
            col3 VARCHAR NOT NULL,
            col4 VARCHAR NOT NULL,
            col5 VARCHAR NOT NULL,
            col6 VARCHAR NOT NULL,
            col7 VARCHAR NOT NULL,
            col8 VARCHAR NOT NULL,
            col9 VARCHAR NOT NULL,
            col10 VARCHAR NOT NULL,
            col11 VARCHAR NOT NULL,
            col12 DOUBLE
        )
        WITH ORDER (col1 ASC, timestamp ASC) STORED AS PARQUET LOCATION 'test_file.parquet';
  1. Execute COPY .. PARTITIONED BY query:
COPY (SELECT col1, timestamp, col10, col12 FROM my_table ORDER BY col1 ASC, timestamp ASC)
TO './output' STORED AS PARQUET PARTITIONED BY (col1) OPTIONS (compression 'uncompressed');
  1. Monitor memory usage.
  2. Repeat execution of COPY .. PARTITIONED BY query and continue monitoring memory usage.
  3. Observation: memory does not get released.

Expected behavior

My expectation is to be able to run the COPY command multiple times without having the memory usage increasing every time.

Additional context

There is more context of what I am trying to do in Discord: https://discord.com/channels/885562378132000778/1166447479609376850/1253419900043526236

I am also experiencing the same behavior when running my application in Kubernetes. K8s terminates my pod once it exceeds the pod memory limits:

Screenshot 2024-06-20 at 8 46 20 PM

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions