-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Introduction
This ticket is a weekly-ish summary of interesting things happening in DataFusion. Note this is not a complete list (it is what I remember / can find). Please leave comments on this ticket about things that I may have missed or you think should get wider attention by the community. Follow on to #13760
Loosely inspired by https://this-week-in-rust.org/
Reminder, find new content (and please post some!) to
Community Highlights
- @jonahgao became a committer: https://lists.apache.org/thread/tmmg8g1lfxwkbsbxlyfl6n29h5bn4hky
- We had a meetup in Chicago 2024 Dec 18: https://lu.ma/eq5myc5i @adriangb @timsaucer
Performance
DataFusion's core value proposition is great performance without having to re-implement it yourself
- @XiangpengHao is working to complete the last parquet optimization: Experimental parquet decoder with first-class selection pushdown support arrow-rs#6921
- @adriangb and @korowa Improved predicate pruning: fix: pruning by bloom filters for dictionary columns #13768 replace CASE expressions in predicate pruning with boolean algebra #13795
- @adriangb also implemented very clever pruning for
LIKE
predicates: Implement predicate pruning forlike
expressions (prefix matching) #12978 (prefix match)
Materialized Views
@suremarc is cranking along with a materialized view implementation 🚀. See the https://github.com/datafusion-contrib/datafusion-materialized-views repo and PRs like
Features
- SHOW FUNCTIONs thanks to @goldmedal Implement
SHOW FUNCTIONS
#13799 - @dhegberg did some great engineering and stuck with several PRs to get null regexs for CSV: Support 'NULL' as Null in csv parser. #13228
- @berkaysynnada and @ozankabak are looking into improving statistics representations Introduce a way to represent constrained statistics / bounds on values in Statistics #8078
- @alamb is trying to make it easier to write DataFusion based optimizers: Make it easier to make optimizers: Move join input swapping and related methods into PhysicalOperators #13910
- @zhuqi-lucas made sorted dataframe writing easier: Support (order by / sort) for DataFrameWriteOptions #13874
- @UBarney improved
generate_series
Support 1 or 3 arg in generate_series() UDTF #13856 - @tlm365 made
initcap
work with unicode as well as ascii: Support unicode character forinitcap
function #13752 - @wiedld better supported schemas when writing parquet in parallel: Supporting writing schema metadata when writing Parquet in parallel #13866
Substrait!
The substrait implementation is getting some significant upgrades thanks to @Blizzara @robtandy and @vbarua. For example:
- Minor: fix: Include FetchRel when producing LogicalPlan from Sort #13862
- Nice code reorg: feat(substrait): modular substrait consumer #13803
BTW it would be great if someone could fix the docs:
Sort improvements
- @gokselk Preserve constant values across union operations #13805 and Support n-ary monotonic functions in ordering equivalence #13841
- @jayzhan-synnada woth Replace
execution_mode
withemission_type
andboundedness
#13823
Quality
- Faster CI: @Omega359 made some major quality of life improvements to CI. Thanks to his work we are now at around 25 minutes for each run chore: temporarily disable Windows Rust flow #13833 ci improvements, update protoc #13876
- Invariants: @wiedld added code to explicitly checking logical plan invariants: Introduce LogicalPlan invariants, begin automatically checking them #13651
- @rluvaton added a dev container: chore: Create devcontainer.json #13520
sqlite
test suite
Also, @Omega359 just landed a major project to start running the sqlite test suite #13936 which is huge
Documentation
- @zhuqi-lucas @takaebato @zjregee and others helped make the examples more accessaible Consolidate Example: dataframe_output.rs into dataframe.rs #13877 Consolidate example: simplify_udaf_expression.rs into advanced_udaf.rs #13905 Consolidate example dataframe_subquery.rs into dataframe.rs #13950
- Huge thanks to @Chen-Yuan-Lai @comphead and @delamarch3 for migrating function docs to macros, like doc-gen: migrate scalar functions (string) documentation 3/4 #13926, doc-gen: migrate scalar functions (datetime) documentation 1/2 #13920, Support multiple alternative syntaxes in the
user_doc
macro, porttrim
to use new macro #13952
Bugs / testing
DataFusion is in the "we are finding all the corner case bugs now" phase of its life and people
Huge shout out to everyone who helped test the 44.0.0
release, especially @andygrove @timsaucer and @shehabgamin from sail #13855
- @zhuqi-lucas fixed limit + pushdown in some cases fix: Limit together with pushdown_filters #13788
- @kylebarron fixed ScalarValue for Dense union: Fix
ScalarValue::to_array_of_size
for DenseUnion #13797 - @berkaysynnada helped improve test coverage with [minor] add missing slt tests for count(partitioned,aggregated, aggregated cube) #13790
- @findepi with array functions: Fix get_type for higher-order array functions #13756
- @cht42 fixed empty rows Handle empty rows for
array_distinct
#13810 - @Blizzara ignoring empty files; fix: Ignore empty files in ListingTable when listing files with or without partition filters, as well as when inferring schema #13750
- @wiedld and @jdockerty hardening the code with Handle possible overflows in StringArrayBuilder / LargeStringArrayBuilder #13802 fix: add
null_buffer
length check toStringArrayBuilder
/LargeStringArrayBuilder
#13758 - @blaginin fixed value noramlization: Add configurable normalization for configuration options and preserve case for S3 paths #13576
- @richox fixed fix case_column_or_null with nullable when conditions #13886
- @Spaarsh Downloading IMDB dataset for benchmarks gives 404 Not Found #13896 ❤️
Easier to use remote /async
catalogs
- @westonpace is working on making it easier to implement aysnc catalogs: feat: add
AsyncCatalogProvider
helpers for asynchronous catalogs #13800 - I added an example showing how to do so: Add example of interacting with a remote catalog #13722
Releases
- 📣 Looking for help organizing an arrow release: Release arrow-rs / parquet minor version 53.4.0 (Jan 2025) arrow-rs-object-store#30
- DataFusion 44.0.0 was released: Release DataFusion
44.0.0
#13334 🎉 - sqlparser released: Release sqlparser-rs version
0.53.0
/ sqlparser_derive0.3.0
datafusion-sqlparser-rs#1517 - arrow-rs major release Release arrow-rs / parquet major version
54.0.0
(December 2024) arrow-rs#6342 - DataFusion python release: chore: Upgrade to DataFusion 44 datafusion-python#972
- Release arrow-rs / parquet minor version 54.1.0 (Jan 2025) arrow-rs-object-store#27
Looking to get more involved? Please help review code! 🎣
DataFusion has a long history of community members contributing in all aspects of the project. Reviewing PRs is an especially great way to get introduced to the project, help the community and grow your own knowledge -- researching and understanding the code enough to review PRs also often inspires additional ideas for improvements.
We have docs about reviews. TLDR is: look for test coverage, if the change is understandable and well documented, and if the code can be improved. When you think the PR looks good to merge, try @
mentioning one of the committers.
Help wanted
- I would love to see the community offer additional help testing, triaging bugs helping to make DataFusion a more stable foundation for building systems
Please feel leave your own comments on this ticket if you are looking for help
Community
- Weekly Call
- Slack/Discord: info links