Description
We have a global variable that counts malloc'd bytes and gets updated for every malloc call. If there are multiple threads that are doing malloc, there will be contention and will have measurable overhead.
The following is measured with Julia GCBenchmarks, using the multithreaded benchmarks (using 8 mutator threads). The two builds both return 0 in vm_live_bytes()
for a fair comparison, and the build with no-malloc-counter
does not have the malloc counter update. The results showed that there is measurable overhead for some benchmarks, e.g. 2% slowdown for mergesort_parallel.
MMTK_MIN_HSIZE=31650 MMTK_MAX_HSIZE=31650 /home/yilin/Code/julia_workspace/julia/julia-mmtk-immix-release-no-malloc-counter/usr/bin/julia --project=/home/yilin/Code/julia_workspace/GCBenchmarks /home/yilin/Code/julia_workspace/GCBenchmarks/run_benchmarks.jl multithreaded mergesort_parallel mergesort_parallel -n 1 --threads=8
total time | gc time | mutator time | total time error | |
---|---|---|---|---|
('multithreaded-big_arrays-issue-52937', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 7328.7 | 0 | 7328.7 | 3.26144 |
('multithreaded-big_arrays-issue-52937', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 7345.78 | 0 | 7345.78 | 2.8509 |
('multithreaded-big_arrays-objarray', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 7279.05 | 0 | 7279.05 | 7.97443 |
('multithreaded-big_arrays-objarray', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 7288.47 | 0 | 7288.47 | 6.95254 |
('multithreaded-binary_tree-tree_immutable', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 2233.35 | 360.83 | 1872.52 | 3.61634 |
('multithreaded-binary_tree-tree_immutable', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 2231.79 | 360.56 | 1871.23 | 3.18454 |
('multithreaded-binary_tree-tree_mutable', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 3130.31 | 640.23 | 2490.08 | 6.81284 |
('multithreaded-binary_tree-tree_mutable', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 3132.71 | 641.74 | 2490.97 | 6.62351 |
('multithreaded-mergesort_parallel-mergesort_parallel', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 20202.5 | 0 | 20202.5 | 811.654 |
('multithreaded-mergesort_parallel-mergesort_parallel', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 20648 | 0 | 20648 | 608.926 |
('multithreaded-mm_divide_and_conquer-mm_divide_and_conquer', 'julia-mmtk-immix(6.0x minheap,.multithreaded-8)') | 791.47 | 0 | 791.47 | 1.83954 |
('multithreaded-mm_divide_and_conquer-mm_divide_and_conquer', 'julia-mmtk-immix-no-malloc-counter(6.0x minheap,.multithreaded-8)') | 797.59 | 0 | 797.59 | 1.93677 |
One way to mitigate this issue is to reduce the frequency of global counter update. We could have a local counter for malloc'd bytes, and only update the global counter for every X bytes allocated (X could be 16K or something).