refactor: TableScan file plan generation now implemented purely in streams rather than channels #1486
+190
−322
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The current implementation of scan planning leaves a lot to be desired. The channel-based approach is hard-to-follow, error-prone, and does not properly support backpressure.
Not only that, but in real-life usage I've been experiencing intermittent deadlock with it, and haven't been able to track down the cause.
There's a source of deadlock present that I have spotted in the existing implementation too.
fetch_manifest_and_stream_manifest_entries
attempts to push to bounded channels, and so will block if the channel it is pushing to is full. If the head-of-line data file context is awaiting on the DeleteFileIndex but the delete files that it depends upon are in a manifest that is yet to enter the delete file manifest context channel, then the pipeline is in deadlock.@Xuanwo's refactor of the Arrow reader showed that a stream-based approach can be more elegant, address the lack of backpressure, and be more reliable. This refactor brings those same advantages to the plan phase.