|
1 |
| -# MMTk performace tracing |
| 1 | +# eBPF-based tracing tools |
2 | 2 |
|
3 | 3 | ## Notes for MMTk developers
|
| 4 | + |
4 | 5 | Please open pull requests if you develop new tools that others might find useful.
|
5 | 6 | When you add new tools, please update this documentation.
|
6 | 7 | If you change MMTk internals that the tracing tools depend on (such as the
|
7 | 8 | definition of `enum WorkBucketStage`), please update the scripts accordingly.
|
8 | 9 |
|
9 | 10 | ## Notes for MMTk users
|
| 11 | + |
10 | 12 | Since some of the tools depend on the MMTk internals, please use the tools
|
11 | 13 | shipped with the MMTk release you use.
|
12 | 14 |
|
13 | 15 | ## Tracepoints
|
14 |
| -Currently, the core provides the following tracepoints. |
15 |
| -- `mmtk:collection_initialized()`: GC is enabled |
16 |
| -- `mmtk:harness_begin()`: the timing iteration of a benchmark begins |
17 |
| -- `mmtk:harness_end()`: the timing iteration of a benchmark ends |
18 |
| -- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop |
19 |
| -- `mmtk:gcworker_run()`: a GC worker thread enters its work loop |
20 |
| -- `mmtk:gc_start()`: a collection epoch starts |
21 |
| -- `mmtk:gc_end()`: a collection epoch ends |
22 |
| -- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the |
23 |
| -`process_edges` method. The first argument is the number of edges to be processed, |
24 |
| -and the second argument is whether these edges are root edges. |
25 |
| -- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the |
26 |
| -numerical representation of `enum WorkBucketStage`. |
27 |
| -- `mmtk:work_poll()`: a work packet is to be polled. |
28 |
| -- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just |
29 |
| -executed. The first argument is points to the string of the Rust type name of |
30 |
| -the work packet, and the second argument is the length of the string. |
31 |
| -- `mmtk:alloc_slow_once_start()`: the allocation slow path starts. |
32 |
| -- `mmtk:alloc_slow_once_end()`: the allocation slow path ends. |
33 |
| - |
34 |
| -## Running tracing tools |
35 |
| -The tracing tools are to be invoked by a wrapper script `run.py`. |
36 |
| -``` |
37 |
| -usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool |
38 | 16 |
|
39 |
| -positional arguments: |
40 |
| - tool Name of the bpftrace tool |
41 |
| -
|
42 |
| -optional arguments: |
43 |
| - -h, --help show this help message and exit |
44 |
| - -b BPFTRACE, --bpftrace BPFTRACE |
45 |
| - Path of the bpftrace executable |
46 |
| - -m MMTK, --mmtk MMTK Path of the MMTk binary |
47 |
| - -H, --harness Only collect data for the timing iteration (harness_begin/harness_end) |
48 |
| - -p, --print-script Print the content of the bpftrace script |
49 |
| - -f {text,json}, --format {text,json} |
50 |
| - bpftrace output format |
51 |
| -``` |
52 |
| - |
53 |
| -- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace` |
54 |
| -executable in your `PATH`. We strongly recommend you use the latest statically |
55 |
| -complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases). |
56 |
| -You need to be able to have sudo permission for whichever `bpftrace` you want to use. |
57 |
| -- `-m`: the path to a MMTk binary that contains the tracepoints. |
58 |
| -This depends on the binding you use. |
59 |
| -For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under |
60 |
| -your build folder. |
61 |
| -To check whether the binary contains tracepoints, you can use `readelf -n`. |
62 |
| -You should see a bunch of `stapsdt` notes with `mmtk` as the provider. |
63 |
| -- `-H`: pass this flag is you want to only measure the timing iteration of a |
64 |
| -benchmark. |
65 |
| -By default, the tracing tools will measure the entire execution. |
66 |
| -- `-p`: print the entire tracing script before execution. |
67 |
| -This is mainly for debugging use. |
68 |
| -- `-f`: change the bpftrace output format. |
69 |
| -By default, it uses human-readable plain text output (`text`). |
70 |
| -You can set this to `json` for easy parsing. |
| 17 | +Currently, the core provides the following tracepoints. |
71 | 18 |
|
72 |
| -Please run the tracing tools **before** running the workload. |
73 |
| -If you use `-H`, the tracing tools will automatically end with `harness_end` is |
74 |
| -called. |
75 |
| -Otherwise, you will need to terminate the tools manually with `Ctrl-C`. |
76 |
| -These tools also have a timeout of 1200 seconds so not to stall unattended |
77 |
| -benchmark execution. |
| 19 | +- `mmtk:collection_initialized()`: GC is enabled |
| 20 | +- `mmtk:harness_begin()`: the timing iteration of a benchmark begins |
| 21 | +- `mmtk:harness_end()`: the timing iteration of a benchmark ends |
| 22 | +- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop |
| 23 | +- `mmtk:gcworker_run()`: a GC worker thread enters its work loop |
| 24 | +- `mmtk:gc_start()`: a collection epoch starts |
| 25 | +- `mmtk:gc_end()`: a collection epoch ends |
| 26 | +- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the `process_edges` |
| 27 | + method. The first argument is the number of edges to be processed, and the second argument is |
| 28 | + whether these edges are root edges. |
| 29 | +- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the numerical |
| 30 | + representation of `enum WorkBucketStage`. |
| 31 | +- `mmtk:work_poll()`: a work packet is to be polled. |
| 32 | +- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just executed. The first |
| 33 | + argument is points to the string of the Rust type name of the work packet, and the second |
| 34 | + argument is the length of the string. |
| 35 | +- `mmtk:alloc_slow_once_start()`: the allocation slow path starts. |
| 36 | +- `mmtk:alloc_slow_once_end()`: the allocation slow path ends. |
78 | 37 |
|
79 | 38 | ## Tracing tools
|
80 |
| -### Measuring the time spend in allocation slow path (`alloc_slow`) |
81 |
| -This tool measures the distribution of the allocation slow path time. |
82 |
| -The time unit is 400ns, so that we use the histogram bins with higher |
83 |
| -fidelity better. |
84 |
| - |
85 |
| -Sample output: |
86 |
| -``` |
87 |
| -@alloc_slow_hist: |
88 |
| -[4, 8) 304 |@ | |
89 |
| -[8, 16) 12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| |
90 |
| -[16, 32) 8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
91 |
| -[32, 64) 941 |@@@ | |
92 |
| -[64, 128) 171 | | |
93 |
| -[128, 256) 13 | | |
94 |
| -[256, 512) 2 | | |
95 |
| -[512, 1K) 0 | | |
96 |
| -[1K, 2K) 0 | | |
97 |
| -[2K, 4K) 0 | | |
98 |
| -[4K, 8K) 0 | | |
99 |
| -[8K, 16K) 0 | | |
100 |
| -[16K, 32K) 14 | | |
101 |
| -[32K, 64K) 37 | | |
102 |
| -[64K, 128K) 19 | | |
103 |
| -[128K, 256K) 1 | | |
104 |
| -``` |
105 |
| - |
106 |
| -In the above output, we can see that most allocation slow paths finish between |
107 |
| -3.2us and 6.4us. |
108 |
| -However, there is a long tail, presumably due to GC pauses. |
109 |
| - |
110 |
| -### Measuring the time spend in different GC stages (`gc_stages`) |
111 |
| -This tool measures the time spent in different stages of GC: before `Closure`, |
112 |
| -during `Closure`, and after `Closure`. |
113 |
| -The time unit is ns. |
114 |
| - |
115 |
| -Sample output: |
116 |
| -``` |
117 |
| -@closure_time: 1405302743 |
118 |
| -@post_closure_time: 81432919 |
119 |
| -@pre_closure_time: 103886118 |
120 |
| -``` |
121 |
| - |
122 |
| -In the above output, overall, the execution spends 1.4s in the main transitive |
123 |
| -closure, 103ms before that, and 81ms after that (a total of around 1.5s). |
124 |
| - |
125 |
| -### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`) |
126 |
| -This tools measures the time spent in the lock contended state for Rust `Mutex`s. |
127 |
| -The Rust standard library implements `Mutex` using the fast-slow-path paradigm. |
128 |
| -Most lock operations take place in inlined fast paths, when there's no contention. |
129 |
| -However, when there's contention, |
130 |
| -`std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called. |
131 | 39 |
|
132 |
| -```rust |
133 |
| -#[inline] |
134 |
| -pub fn lock(&self) { |
135 |
| - if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() { |
136 |
| - self.lock_contended(); |
137 |
| - } |
138 |
| -} |
139 |
| - |
140 |
| -#[cold] |
141 |
| -fn lock_contended(&self) { |
142 |
| - <snip> |
143 |
| -} |
144 |
| -``` |
145 |
| - |
146 |
| - |
147 |
| -MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization, |
148 |
| -and this tool can be useful to measure the contention in these parts of code. |
149 |
| - |
150 |
| -The time unit is 256ns. |
151 |
| - |
152 |
| -Sample output: |
153 |
| -``` |
154 |
| -@lock_dist[140637228007056]: |
155 |
| -[1] 447 |@@@@ | |
156 |
| -[2, 4) 3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
157 |
| -[4, 8) 3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
158 |
| -[8, 16) 1354 |@@@@@@@@@@@@@@ | |
159 |
| -[16, 32) 832 |@@@@@@@@ | |
160 |
| -[32, 64) 1077 |@@@@@@@@@@@ | |
161 |
| -[64, 128) 2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
162 |
| -[128, 256) 4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
163 |
| -[256, 512) 5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| |
164 |
| -[512, 1K) 1203 |@@@@@@@@@@@@ | |
165 |
| -[1K, 2K) 34 | | |
166 |
| -[2K, 4K) 15 | | |
167 |
| -``` |
| 40 | +Each sub-directory contains a set of scripts. |
168 | 41 |
|
169 |
| -In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90) |
170 |
| -roughly has a bimodal distribution in terms of the time spent in lock contended |
171 |
| -code path. |
172 |
| -The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us. |
173 |
| - |
174 |
| -If you can't tell which lock instance is for which lock in MMTk, you can trace |
175 |
| -the allocation of the Mutex and record the stack trace (note that you might want |
176 |
| -to compile MMTk with `force-frame-pointers` to obtain better stack traces). |
177 |
| - |
178 |
| -### Measuring the distribution of `process_edges` packet sizes (`packet_size`) |
179 |
| -Most of the GC time is spend in the transitive closure for tracing-based GCs, |
180 |
| -and MMTk performs transitive closure via work packets that calls the `process_edges` method. |
181 |
| -This tool measures the distribution of the sizes of these work packets, and also |
182 |
| -count root edges separately. |
183 |
| - |
184 |
| -Sample output: |
185 |
| -``` |
186 |
| -@process_edges_packet_size: |
187 |
| -[1] 238 |@@@@@ | |
188 |
| -[2, 4) 806 |@@@@@@@@@@@@@@@@@ | |
189 |
| -[4, 8) 1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
190 |
| -[8, 16) 1105 |@@@@@@@@@@@@@@@@@@@@@@@ | |
191 |
| -[16, 32) 2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| |
192 |
| -[32, 64) 1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
193 |
| -[64, 128) 1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
194 |
| -[128, 256) 1131 |@@@@@@@@@@@@@@@@@@@@@@@@ | |
195 |
| -[256, 512) 2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
196 |
| -[512, 1K) 1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
197 |
| -[1K, 2K) 1028 |@@@@@@@@@@@@@@@@@@@@@@ | |
198 |
| -[2K, 4K) 874 |@@@@@@@@@@@@@@@@@@ | |
199 |
| -[4K, 8K) 1024 |@@@@@@@@@@@@@@@@@@@@@@ | |
200 |
| -[8K, 16K) 58 |@ | |
201 |
| -[16K, 32K) 5 | | |
202 |
| -
|
203 |
| -@process_edges_root_packet_size: |
204 |
| -[1] 71 |@@@@@@@ | |
205 |
| -[2, 4) 4 | | |
206 |
| -[4, 8) 276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
207 |
| -[8, 16) 495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| |
208 |
| -[16, 32) 477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
209 |
| -[32, 64) 344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
210 |
| -[64, 128) 242 |@@@@@@@@@@@@@@@@@@@@@@@@@ | |
211 |
| -[128, 256) 109 |@@@@@@@@@@@ | |
212 |
| -[256, 512) 31 |@@@ | |
213 |
| -[512, 1K) 33 |@@@ | |
214 |
| -[1K, 2K) 75 |@@@@@@@ | |
215 |
| -[2K, 4K) 75 |@@@@@@@ | |
216 |
| -[4K, 8K) 336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | |
217 |
| -[8K, 16K) 56 |@@@@@ | |
218 |
| -[16K, 32K) 3 | | |
219 |
| -``` |
220 |
| - |
221 |
| -In the above output, we can see that overall, the sizes of the `process_edges` |
222 |
| -has a unimodal distribution with a peak around 16\~32 edges per packet. |
223 |
| -However, if we focus on root edges, the distribution is roughly bimodal, with a |
224 |
| -first peak around 8\~16 and a second peak around 4096\~8192. |
| 42 | +- `performance`: Print various GC-related statistics, such as the distribution of time spent in |
| 43 | + allocation slow path, the time spent in each GC stages, and the distribution of `process_edges` |
| 44 | + packet sizes. |
| 45 | +- `timeline`: Record the start and end time of each GC and each work packet, and visualize them on |
| 46 | + a timeline in Perfetto UI. |
225 | 47 |
|
226 | 48 | ## Attribution
|
| 49 | + |
227 | 50 | If used for research, please cite the following publication.
|
| 51 | + |
228 | 52 | ```bibtex
|
229 | 53 | @inproceedings{DBLP:conf/pppj/HuangBC23,
|
230 | 54 | author = {Claire Huang and
|
|
0 commit comments