Skip to content

Conversation

@danielbeach
Copy link

Add a new method called read_deltalake(). Simply read the Delta Lake table and get the list of parquet files for that Delta Lake table so we can reuse ParquetDataSet. (doing this by calling dt.files() which returns a list of parquet files).

@wangrunji0408
Copy link
Collaborator

As far as I know, Delta Lake is not simply a list of parquet files, right? I think we need a dedicated DeltaLakeDataset to handle them properly.

@danielbeach
Copy link
Author

As far as I know, Delta Lake is not simply a list of parquet files, right? I think we need a dedicated DeltaLakeDataset to handle them properly.

Yes, but we are using the deltalake package, and specifically an instance of a DeltaTable to give a list of the parquet files that currently make up the current version of the Delta Lake table via files() method they provide. I mean in theory we could add another dependecy of polars or daft or something else to return the Delta Lake as a Dataframe and the dump it to an arrow table, but that seems more indirect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants