Skip to content

Detect new functions in the Spark module when up version Datafusion #797

@davidlghellin

Description

@davidlghellin

We want to automatically detect when new functions appear in the spark module (or when existing ones are removed/renamed) so that we can update documentation, tests, and changelogs accordingly.

Currently, each mod.rs typically exposes a pub fn functions() -> Vec<...> returning a vec![ ... ] with registered functions. The idea is to inspect these mod.rs files and list the inner contents of that vec![].

Example in zsh

cd datafusion/spark
for f in **/mod.rs(.N); do
  # empty vec
  if rg -nUP 'pub fn functions\(\)[\s\S]*?vec!\[\s*\S' "$f" >/dev/null; then
    echo "---- $f ----"
    rg -nUP 'pub fn functions\(\)[\s\S]*?vec!\[([\s\S]*?\S[\s\S]*?)\]' -or '$1' "$f"
    echo
  fi
done

output

cd datafusion/spark
for f in **/mod.rs(.N); do
  # ¿Tiene un vec![ ... ] NO vacío dentro de pub fn functions() ?
  if rg -nUP 'pub fn functions\(\)[\s\S]*?vec!\[\s*\S' "$f" >/dev/null; then
    echo "---- $f ----"
    # Imprime solo el contenido interno del vec![ ... ]
    rg -nUP 'pub fn functions\(\)[\s\S]*?vec!\[([\s\S]*?\S[\s\S]*?)\]' -or '$1' "$f"
    echo
  fi
done
---- src/function/aggregate/mod.rs ----

---- src/function/array/mod.rs ----
32:array()

---- src/function/bitmap/mod.rs ----
36:bitmap_count()

---- src/function/bitwise/mod.rs ----
39:bit_get(), bit_count()

---- src/function/collection/mod.rs ----

---- src/function/conditional/mod.rs ----

---- src/function/conversion/mod.rs ----

---- src/function/csv/mod.rs ----

---- src/function/datetime/mod.rs ----
59:date_add(), date_sub(), last_day(), next_day()

---- src/function/generator/mod.rs ----

---- src/function/hash/mod.rs ----
39:crc32(), sha1(), sha2()

---- src/function/json/mod.rs ----

---- src/function/lambda/mod.rs ----

---- src/function/map/mod.rs ----

---- src/function/math/mod.rs ----
54:        expm1(),
55:        factorial(),
56:        hex(),
57:        modulus(),
58:        pmod(),
59:        rint(),
60:        

---- src/function/misc/mod.rs ----

---- src/function/predicate/mod.rs ----

---- src/function/string/mod.rs ----
64:ascii(), char(), ilike(), like(), luhn_check()

---- src/function/struct/mod.rs ----

---- src/function/table/mod.rs ----

---- src/function/url/mod.rs ----
32:parse_url()

---- src/function/window/mod.rs ----

---- src/function/xml/mod.rs ----
cat Cargo.toml | grep "Define DataFusion version" -A 1
# Define DataFusion version
version = "49.0.2"

For this we can migrate or add new function

P.S. Ideally, we should also have a complete overview of all the functions that are already implemented. I noticed there is something like expr::function, which might help us detect them systematically. With that, we could eventually build a bot that automatically checks whenever datafusion is updated and opens a new issue for every new implementation or refactor we need to handle.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions