-
Notifications
You must be signed in to change notification settings - Fork 53
Description
Hello,
I'm not aware if any user implementing the Tables interface and relying on fallback functions with no Schema
has already encountered a performance issue at compilation, but here is below a particular case encountered in ArchGDAL.jl
Using the Tables.jl interface on ArchGDAL.jl layer objects, @visr stumbled on a performance issue with Tables.columns
fallback functions at 1st run, with Tables.Schema = nothing
and Base.IteratorSize = SizeUnknown()
. The performance bottleneck at first run is located in Tables.add_or_widen!
(seen when profiling). Probably around the columm allocation machinery based on Tuple
.
There is no problem when providing detailed types in the Schema
. Unfortunately, we cannot know the tightest schema for sure from a layer feature definition.
The problem can be reproduced with sequence below: ~170s at first run for a 10 row table with more than 300 columns
Download BasinATLAS_v10_lev01.zip and decompress it in ./BasinATLAS_v10_lev01/
julia> import ArchGDAL as AG
julia> using DataFrames
julia> using Profile
julia> @time @profile DataFrame(AG.getlayer(AG.read("BasinATLAS_v10_lev01")))
172.879812 seconds (53.38 M allocations: 2.992 GiB, 0.44% gc time, 99.38% compilation time)
10×295 DataFrame
Row │ HYBAS_ID NEXT_DOWN NEXT_SINK MAIN_BAS DIST_SINK DIST_MAIN SUB_AREA UP_AREA PFAF_ID EN ⋯
│ IGeometr… Int64 Int64 Int64 Int64 Float64 Float64 Float64 Float64 Int64 In ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Geometry: wkbMultiPolygon 1010000010 0 1010000010 1010000010 0.0 0.0 2.99531e7 2.99531e7 1 ⋯
2 │ Geometry: wkbMultiPolygon 2010000010 0 2010000010 2010000010 0.0 0.0 1.78589e7 1.78589e7 2
3 │ Geometry: wkbMultiPolygon 3010000010 0 3010000010 3010000010 0.0 0.0 1.29491e7 1.29491e7 3
4 │ Geometry: wkbMultiPolygon 4010000010 0 4010000010 4010000010 0.0 0.0 2.08274e7 2.08274e7 4
5 │ Geometry: wkbMultiPolygon 5010000010 0 5010000010 5010000010 0.0 0.0 1.08038e7 1.08038e7 5 ⋯
6 │ Geometry: wkbMultiPolygon 6010000010 0 6010000010 6010000010 0.0 0.0 1.78535e7 1.78535e7 6
7 │ Geometry: wkbMultiPolygon 7010000010 0 7010000010 7010000010 0.0 0.0 1.59146e7 1.59146e7 7
8 │ Geometry: wkbMultiPolygon 8010000010 0 8010000010 8010000010 0.0 0.0 6.19709e6 6.19709e6 8
9 │ Geometry: wkbMultiPolygon 8010020760 0 8010020760 8010020760 0.0 0.0 1.17002e5 1.17002e5 3 ⋯
10 │ Geometry: wkbMultiPolygon 9010000010 0 9010000010 9010000010 0.0 0.0 2.14672e6 2.14672e6 9
285 columns omitted
julia> Profile.print(;mincount=5000)
Overhead ╎ [+additional indent] Count File:Line; Function
=========================================================
╎123273 @Base/client.jl:495; _start()
╎ 123273 @Base/client.jl:309; exec_options(opts::Base.JLOptions)
╎ 123273 @Base/client.jl:379; run_main_repl(interactive::Bool, quiet::Bool, banner::Bool, hi...
╎ 123273 @Base/essentials.jl:714; invokelatest
╎ 123273 @Base/essentials.jl:716; #invokelatest#2
╎ 123273 @Base/client.jl:394; (::Base.var"#930#932"{Bool, Bool, Bool})(REPL::Module)
╎ ╎ 123273 ...hare/julia/stdlib/v1.7/REPL/src/REPL.jl:349; run_repl(repl::REPL.AbstractREPL, consumer::Any)
╎ ╎ 123273 ...are/julia/stdlib/v1.7/REPL/src/REPL.jl:362; run_repl(repl::REPL.AbstractREPL, consumer::Any; backend_on...
╎ ╎ 123273 ...are/julia/stdlib/v1.7/REPL/src/REPL.jl:229; start_repl_backend(backend::REPL.REPLBackend, consumer::Any)
╎ ╎ 123273 ...re/julia/stdlib/v1.7/REPL/src/REPL.jl:244; repl_backend_loop(backend::REPL.REPLBackend)
╎ ╎ 123273 ...re/julia/stdlib/v1.7/REPL/src/REPL.jl:150; eval_user_input(ast::Any, backend::REPL.REPLBackend)
╎ ╎ ╎ 123273 @Base/boot.jl:373; eval
╎ ╎ ╎ 123273 REPL[4]:0; top-level scope
╎ ╎ ╎ 123273 @Base/timing.jl:220; top-level scope
╎ ╎ ╎ 123273 ...a/stdlib/v1.7/Profile/src/Profile.jl:28; macro expansion
╎ ╎ ╎ 123273 @DataFrames/src/other/tables.jl:49; DataFrame
╎ ╎ ╎ ╎ 123073 @DataFrames/src/other/tables.jl:58; DataFrame(x::ArchGDAL.IFeatureLayer; copycols::Nothing)
147╎ ╎ ╎ ╎ 123073 @Tables/src/fallbacks.jl:253; columns
192╎ ╎ ╎ ╎ 122879 @Tables/src/fallbacks.jl:217; buildcolumns
╎ ╎ ╎ ╎ 122029 @Tables/src/fallbacks.jl:187; _buildcolumns(rowitr::ArchGDAL.IFeatureLayer, row::A...
58╎ ╎ ╎ ╎ 122029 @Tables/src/utils.jl:132; eachcolumns
101464╎ ╎ ╎ ╎ ╎ 121929 @Tables/src/fallbacks.jl:153; add_or_widen!(val::ArchGDAL.IGeometry{ArchGDAL.wkbM...
╎ ╎ ╎ ╎ ╎ 5732 @Base/compiler/typeinfer.jl:938; typeinf_ext_toplevel(mi::Core.MethodInstance, worl...
╎ ╎ ╎ ╎ ╎ 5731 @Base/compiler/typeinfer.jl:942; typeinf_ext_toplevel(interp::Core.Compiler.Native...
╎ ╎ ╎ ╎ ╎ 5721 @Base/compiler/typeinfer.jl:909; typeinf_ext(interp::Core.Compiler.NativeInterpret...
╎ ╎ ╎ ╎ ╎ 5721 @Base/compiler/typeinfer.jl:209; typeinf(interp::Core.Compiler.NativeInterpreter,...
╎ ╎ ╎ ╎ ╎ ╎ 5695 @Base/compiler/typeinfer.jl:226; _typeinf(interp::Core.Compiler.NativeInterpreter...
╎ ╎ ╎ ╎ ╎ ╎ 5695 ...iler/abstractinterpretation.jl:2014; typeinf_nocycle(interp::Core.Compiler.NativeIn...
╎ ╎ ╎ ╎ ╎ ╎ 5693 ...ler/abstractinterpretation.jl:1918; typeinf_local(interp::Core.Compiler.NativeInte...
╎ ╎ ╎ ╎ ╎ ╎ 5693 ...ler/abstractinterpretation.jl:1534; abstract_eval_statement(interp::Core.Compiler...
╎ ╎ ╎ ╎ ╎ ╎ 5693 ...ler/abstractinterpretation.jl:1382; abstract_call(interp::Core.Compiler.NativeInt...
╎ ╎ ╎ ╎ ╎ ╎ ╎ 5693 ...er/abstractinterpretation.jl:1397; abstract_call(interp::Core.Compiler.NativeIn...
╎ ╎ ╎ ╎ ╎ ╎ ╎ 5677 ...er/abstractinterpretation.jl:1342; abstract_call_known(interp::Core.Compiler.N...
╎ ╎ ╎ ╎ ╎ ╎ ╎ 5421 ...er/abstractinterpretation.jl:105; abstract_call_gf_by_type(interp::Core.Compi...
╎ ╎ ╎ ╎ ╎ ╎ ╎ 5419 ...r/abstractinterpretation.jl:504; abstract_call_method(interp::Core.Compiler....
╎ ╎ ╎ ╎ ╎ ╎ ╎ 5408 @Base/compiler/typeinfer.jl:823; typeinf_edge
╎ ╎ ╎ ╎ ╎ ╎ ╎ ╎ 5408 @Base/compiler/typeinfer.jl:209; typeinf(interp::Core.Compiler.NativeInterp...
1╎ ╎ ╎ ╎ ╎ 14732 @Tables/src/fallbacks.jl:143; replacex(t::NTuple{295, Tables.EmptyVector}, col::...
╎ ╎ ╎ ╎ ╎ 14731 @Base/ntuple.jl:19; ntuple
8╎ ╎ ╎ ╎ ╎ 14731 @Base/ntuple.jl:37; _ntuple(f::Tables.var"#41#42"{NTuple{295, Tables....
6228╎ ╎ ╎ ╎ ╎ 14717 @Base/array.jl:734; collect(itr::Base.Generator{UnitRange{Int64}, Ba...
131╎ ╎ ╎ ╎ ╎ ╎ 7799 @Base/array.jl:760; collect_to_with_first!(dest::Vector{Vector{ArchG...
6890╎ ╎ ╎ ╎ ╎ ╎ 7653 @Base/array.jl:790; collect_to!(dest::Vector{Vector{ArchGDAL.IGeome...
Total snapshots: 125792
julia>
At the 2nd run the performance is normal: 1.5s (x100 vs 1st run)
julia> @time @profile DataFrame(AG.getlayer(AG.read("BasinATLAS_v10_lev01")))
1.461657 seconds (246.65 k allocations: 210.206 MiB, 31.72% gc time, 1.47% compilation time)
10×295 DataFrame
Row │ HYBAS_ID NEXT_DOWN NEXT_SINK MAIN_BAS DIST_SINK DIST_MAIN SUB_AREA UP_AREA PFAF_ID EN ⋯
│ IGeometr… Int64 Int64 Int64 Int64 Float64 Float64 Float64 Float64 Int64 In ⋯
─────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
1 │ Geometry: wkbMultiPolygon 1010000010 0 1010000010 1010000010 0.0 0.0 2.99531e7 2.99531e7 1 ⋯
2 │ Geometry: wkbMultiPolygon 2010000010 0 2010000010 2010000010 0.0 0.0 1.78589e7 1.78589e7 2
3 │ Geometry: wkbMultiPolygon 3010000010 0 3010000010 3010000010 0.0 0.0 1.29491e7 1.29491e7 3
4 │ Geometry: wkbMultiPolygon 4010000010 0 4010000010 4010000010 0.0 0.0 2.08274e7 2.08274e7 4
5 │ Geometry: wkbMultiPolygon 5010000010 0 5010000010 5010000010 0.0 0.0 1.08038e7 1.08038e7 5 ⋯
6 │ Geometry: wkbMultiPolygon 6010000010 0 6010000010 6010000010 0.0 0.0 1.78535e7 1.78535e7 6
7 │ Geometry: wkbMultiPolygon 7010000010 0 7010000010 7010000010 0.0 0.0 1.59146e7 1.59146e7 7
8 │ Geometry: wkbMultiPolygon 8010000010 0 8010000010 8010000010 0.0 0.0 6.19709e6 6.19709e6 8
9 │ Geometry: wkbMultiPolygon 8010020760 0 8010020760 8010020760 0.0 0.0 1.17002e5 1.17002e5 3 ⋯
10 │ Geometry: wkbMultiPolygon 9010000010 0 9010000010 9010000010 0.0 0.0 2.14672e6 2.14672e6 9
I got around this issue in PR yeesian/ArchGDAL.jl#266 in which Tables.columns
has been implemented primarily to address other performance issues more directly linked to GDAL specific C interface constraints.
I used a mutable grid: Vector{Vector{T} where T}
with wide initial types instead of NamedTuple
iteratively extended on column length and column types.
I tightened the column types a posteriori.
EDIT 1: Note that the iterator's size is advertised as SizeUnknown()
because for some sources, GDAL drivers may reset the iterator if length retrieval is forced. Therefore length is not forced to avoid any side effect and can stay unknown.
Forcing the length retrieval was breaking the parsing done by Tables.buildcolumns
with Schema = nothing
for such GDAL drivers (ex. "GML" driver, see comment yeesian/ArchGDAL.jl#226 (comment) for further details if necessary)