Skip to content

Performance issue with Tables.columns fallback functions at 1st run with Tables.Schema = nothing and Base.IteratorSize = SizeUnknown() #273

@mathieu17g

Description

@mathieu17g

Hello,

I'm not aware if any user implementing the Tables interface and relying on fallback functions with no Schema has already encountered a performance issue at compilation, but here is below a particular case encountered in ArchGDAL.jl

Using the Tables.jl interface on ArchGDAL.jl layer objects, @visr stumbled on a performance issue with Tables.columns fallback functions at 1st run, with Tables.Schema = nothing and Base.IteratorSize = SizeUnknown(). The performance bottleneck at first run is located in Tables.add_or_widen! (seen when profiling). Probably around the columm allocation machinery based on Tuple.

There is no problem when providing detailed types in the Schema. Unfortunately, we cannot know the tightest schema for sure from a layer feature definition.

The problem can be reproduced with sequence below: ~170s at first run for a 10 row table with more than 300 columns

Download BasinATLAS_v10_lev01.zip and decompress it in ./BasinATLAS_v10_lev01/

julia> import ArchGDAL as AG

julia> using DataFrames

julia> using Profile

julia> @time @profile DataFrame(AG.getlayer(AG.read("BasinATLAS_v10_lev01")))
172.879812 seconds (53.38 M allocations: 2.992 GiB, 0.44% gc time, 99.38% compilation time)
10×295 DataFrame
 Row │                            HYBAS_ID    NEXT_DOWN  NEXT_SINK   MAIN_BAS    DIST_SINK  DIST_MAIN  SUB_AREA   UP_AREA    PFAF_ID  EN 
     │ IGeometr                  Int64       Int64      Int64       Int64       Float64    Float64    Float64    Float64    Int64    In 
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Geometry: wkbMultiPolygon  1010000010          0  1010000010  1010000010        0.0        0.0  2.99531e7  2.99531e7        1     
   2 │ Geometry: wkbMultiPolygon  2010000010          0  2010000010  2010000010        0.0        0.0  1.78589e7  1.78589e7        2
   3 │ Geometry: wkbMultiPolygon  3010000010          0  3010000010  3010000010        0.0        0.0  1.29491e7  1.29491e7        3
   4 │ Geometry: wkbMultiPolygon  4010000010          0  4010000010  4010000010        0.0        0.0  2.08274e7  2.08274e7        4
   5 │ Geometry: wkbMultiPolygon  5010000010          0  5010000010  5010000010        0.0        0.0  1.08038e7  1.08038e7        5     
   6 │ Geometry: wkbMultiPolygon  6010000010          0  6010000010  6010000010        0.0        0.0  1.78535e7  1.78535e7        6
   7 │ Geometry: wkbMultiPolygon  7010000010          0  7010000010  7010000010        0.0        0.0  1.59146e7  1.59146e7        7
   8 │ Geometry: wkbMultiPolygon  8010000010          0  8010000010  8010000010        0.0        0.0  6.19709e6  6.19709e6        8
   9 │ Geometry: wkbMultiPolygon  8010020760          0  8010020760  8010020760        0.0        0.0  1.17002e5  1.17002e5        3     
  10 │ Geometry: wkbMultiPolygon  9010000010          0  9010000010  9010000010        0.0        0.0  2.14672e6  2.14672e6        9
                                                                                                                       285 columns omitted

julia> Profile.print(;mincount=5000)
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
      ╎123273 @Base/client.jl:495; _start()
      ╎ 123273 @Base/client.jl:309; exec_options(opts::Base.JLOptions)
      ╎  123273 @Base/client.jl:379; run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, hi...123273 @Base/essentials.jl:714; invokelatest
      ╎    123273 @Base/essentials.jl:716; #invokelatest#2123273 @Base/client.jl:394; (::Base.var"#930#932"{Bool, Bool, Bool})(REPL::Module)
      ╎    ╎ 123273 ...hare/julia/stdlib/v1.7/REPL/src/REPL.jl:349; run_repl(repl::REPL.AbstractREPL, consumer::Any)
      ╎    ╎  123273 ...are/julia/stdlib/v1.7/REPL/src/REPL.jl:362; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on...
      ╎    ╎   123273 ...are/julia/stdlib/v1.7/REPL/src/REPL.jl:229; start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
      ╎    ╎    123273 ...re/julia/stdlib/v1.7/REPL/src/REPL.jl:244; repl_backend_loop(backend::REPL.REPLBackend)
      ╎    ╎     123273 ...re/julia/stdlib/v1.7/REPL/src/REPL.jl:150; eval_user_input(ast::Any, backend::REPL.REPLBackend)
      ╎    ╎    ╎ 123273 @Base/boot.jl:373; eval
      ╎    ╎    ╎  123273 REPL[4]:0; top-level scope
      ╎    ╎    ╎   123273 @Base/timing.jl:220; top-level scope
      ╎    ╎    ╎    123273 ...a/stdlib/v1.7/Profile/src/Profile.jl:28; macro expansion
      ╎    ╎    ╎     123273 @DataFrames/src/other/tables.jl:49; DataFrame
      ╎    ╎    ╎    ╎ 123073 @DataFrames/src/other/tables.jl:58; DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
   147╎    ╎    ╎    ╎  123073 @Tables/src/fallbacks.jl:253; columns
   192╎    ╎    ╎    ╎   122879 @Tables/src/fallbacks.jl:217; buildcolumns
      ╎    ╎    ╎    ╎    122029 @Tables/src/fallbacks.jl:187; _buildcolumns(rowitr::ArchGDAL.IFeatureLayer, row::A...
    58╎    ╎    ╎    ╎     122029 @Tables/src/utils.jl:132; eachcolumns
101464╎    ╎    ╎    ╎    ╎ 121929 @Tables/src/fallbacks.jl:153; add_or_widen!(val::ArchGDAL.IGeometry{ArchGDAL.wkbM...
      ╎    ╎    ╎    ╎    ╎  5732   @Base/compiler/typeinfer.jl:938; typeinf_ext_toplevel(mi::Core.MethodInstance, worl...
      ╎    ╎    ╎    ╎    ╎   5731   @Base/compiler/typeinfer.jl:942; typeinf_ext_toplevel(interp::Core.Compiler.Native...
      ╎    ╎    ╎    ╎    ╎    5721   @Base/compiler/typeinfer.jl:909; typeinf_ext(interp::Core.Compiler.NativeInterpret...
      ╎    ╎    ╎    ╎    ╎     5721   @Base/compiler/typeinfer.jl:209; typeinf(interp::Core.Compiler.NativeInterpreter,...
      ╎    ╎    ╎    ╎    ╎    ╎ 5695   @Base/compiler/typeinfer.jl:226; _typeinf(interp::Core.Compiler.NativeInterpreter...
      ╎    ╎    ╎    ╎    ╎    ╎  5695   ...iler/abstractinterpretation.jl:2014; typeinf_nocycle(interp::Core.Compiler.NativeIn...
      ╎    ╎    ╎    ╎    ╎    ╎   5693   ...ler/abstractinterpretation.jl:1918; typeinf_local(interp::Core.Compiler.NativeInte...
      ╎    ╎    ╎    ╎    ╎    ╎    5693   ...ler/abstractinterpretation.jl:1534; abstract_eval_statement(interp::Core.Compiler...
      ╎    ╎    ╎    ╎    ╎    ╎     5693   ...ler/abstractinterpretation.jl:1382; abstract_call(interp::Core.Compiler.NativeInt...
      ╎    ╎    ╎    ╎    ╎    ╎    ╎ 5693   ...er/abstractinterpretation.jl:1397; abstract_call(interp::Core.Compiler.NativeIn...
      ╎    ╎    ╎    ╎    ╎    ╎    ╎  5677   ...er/abstractinterpretation.jl:1342; abstract_call_known(interp::Core.Compiler.N...
      ╎    ╎    ╎    ╎    ╎    ╎    ╎   5421   ...er/abstractinterpretation.jl:105; abstract_call_gf_by_type(interp::Core.Compi...
      ╎    ╎    ╎    ╎    ╎    ╎    ╎    5419   ...r/abstractinterpretation.jl:504; abstract_call_method(interp::Core.Compiler....
      ╎    ╎    ╎    ╎    ╎    ╎    ╎     5408   @Base/compiler/typeinfer.jl:823; typeinf_edge
      ╎    ╎    ╎    ╎    ╎    ╎    ╎    ╎ 5408   @Base/compiler/typeinfer.jl:209; typeinf(interp::Core.Compiler.NativeInterp...
     1╎    ╎    ╎    ╎    ╎  14732  @Tables/src/fallbacks.jl:143; replacex(t::NTuple{295, Tables.EmptyVector}, col::...
      ╎    ╎    ╎    ╎    ╎   14731  @Base/ntuple.jl:19; ntuple
     8╎    ╎    ╎    ╎    ╎    14731  @Base/ntuple.jl:37; _ntuple(f::Tables.var"#41#42"{NTuple{295, Tables....
  6228╎    ╎    ╎    ╎    ╎     14717  @Base/array.jl:734; collect(itr::Base.Generator{UnitRange{Int64}, Ba...
   131╎    ╎    ╎    ╎    ╎    ╎ 7799   @Base/array.jl:760; collect_to_with_first!(dest::Vector{Vector{ArchG...
  6890╎    ╎    ╎    ╎    ╎    ╎  7653   @Base/array.jl:790; collect_to!(dest::Vector{Vector{ArchGDAL.IGeome...
Total snapshots: 125792

julia> 

At the 2nd run the performance is normal: 1.5s (x100 vs 1st run)

julia> @time @profile DataFrame(AG.getlayer(AG.read("BasinATLAS_v10_lev01")))
  1.461657 seconds (246.65 k allocations: 210.206 MiB, 31.72% gc time, 1.47% compilation time)
10×295 DataFrame
 Row │                            HYBAS_ID    NEXT_DOWN  NEXT_SINK   MAIN_BAS    DIST_SINK  DIST_MAIN  SUB_AREA   UP_AREA    PFAF_ID  EN 
     │ IGeometr                  Int64       Int64      Int64       Int64       Float64    Float64    Float64    Float64    Int64    In 
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Geometry: wkbMultiPolygon  1010000010          0  1010000010  1010000010        0.0        0.0  2.99531e7  2.99531e7        1     
   2 │ Geometry: wkbMultiPolygon  2010000010          0  2010000010  2010000010        0.0        0.0  1.78589e7  1.78589e7        2
   3 │ Geometry: wkbMultiPolygon  3010000010          0  3010000010  3010000010        0.0        0.0  1.29491e7  1.29491e7        3
   4 │ Geometry: wkbMultiPolygon  4010000010          0  4010000010  4010000010        0.0        0.0  2.08274e7  2.08274e7        4
   5 │ Geometry: wkbMultiPolygon  5010000010          0  5010000010  5010000010        0.0        0.0  1.08038e7  1.08038e7        5     
   6 │ Geometry: wkbMultiPolygon  6010000010          0  6010000010  6010000010        0.0        0.0  1.78535e7  1.78535e7        6
   7 │ Geometry: wkbMultiPolygon  7010000010          0  7010000010  7010000010        0.0        0.0  1.59146e7  1.59146e7        7
   8 │ Geometry: wkbMultiPolygon  8010000010          0  8010000010  8010000010        0.0        0.0  6.19709e6  6.19709e6        8
   9 │ Geometry: wkbMultiPolygon  8010020760          0  8010020760  8010020760        0.0        0.0  1.17002e5  1.17002e5        3     
  10 │ Geometry: wkbMultiPolygon  9010000010          0  9010000010  9010000010        0.0        0.0  2.14672e6  2.14672e6        9
      

I got around this issue in PR yeesian/ArchGDAL.jl#266 in which Tables.columns has been implemented primarily to address other performance issues more directly linked to GDAL specific C interface constraints.

I used a mutable grid: Vector{Vector{T} where T} with wide initial types instead of NamedTuple iteratively extended on column length and column types.

I tightened the column types a posteriori.

EDIT 1: Note that the iterator's size is advertised as SizeUnknown() because for some sources, GDAL drivers may reset the iterator if length retrieval is forced. Therefore length is not forced to avoid any side effect and can stay unknown.
Forcing the length retrieval was breaking the parsing done by Tables.buildcolumns with Schema = nothing for such GDAL drivers (ex. "GML" driver, see comment yeesian/ArchGDAL.jl#226 (comment) for further details if necessary)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions