Skip to content

refactor: TableScan file plan generation now implemented purely in streams rather than channels #1486

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sdd
Copy link
Contributor

@sdd sdd commented Jul 4, 2025

The current implementation of scan planning leaves a lot to be desired. The channel-based approach is hard-to-follow, error-prone, and does not properly support backpressure.

Not only that, but in real-life usage I've been experiencing intermittent deadlock with it, and haven't been able to track down the cause.

There's a source of deadlock present that I have spotted in the existing implementation too. fetch_manifest_and_stream_manifest_entries attempts to push to bounded channels, and so will block if the channel it is pushing to is full. If the head-of-line data file context is awaiting on the DeleteFileIndex but the delete files that it depends upon are in a manifest that is yet to enter the delete file manifest context channel, then the pipeline is in deadlock.

@Xuanwo's refactor of the Arrow reader showed that a stream-based approach can be more elegant, address the lack of backpressure, and be more reliable. This refactor brings those same advantages to the plan phase.

@sdd sdd force-pushed the refactor-scan-plan branch 2 times, most recently from 45099ba to 9946641 Compare July 4, 2025 08:05
@sdd sdd force-pushed the refactor-scan-plan branch from 9946641 to b1a4447 Compare July 4, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant