Skip to content

Commit b66fa35

Browse files
wksclairexhuangcaizixian
authored
Timeline visualization tool (#1022)
This PR adds eBPF-based scripts for recording the start and end time of work packets and formatting the log for visualization with PerfettoUI. Co-authored-by: Claire Huang <claire.x.huang@gmail.com> Co-authored-by: Zixian Cai <u5937495@anu.edu.au>
1 parent e7143b9 commit b66fa35

17 files changed

+675
-207
lines changed

Cargo.toml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,10 @@ malloc_counted_size = []
152152
# Count the size of all live objects in GC
153153
count_live_bytes_in_gc = []
154154

155+
# Workaround a problem where bpftrace scripts (see tools/tracing/timeline/capture.bt) cannot
156+
# capture the type names of work packets.
157+
bpftrace_workaround = []
158+
155159
# Do not modify the following line - ci-common.sh matches it
156160
# -- Mutally exclusive features --
157161
# Only one feature from each group can be provided. Otherwise build will fail.

src/scheduler/worker.rs

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -384,6 +384,13 @@ impl<VM: VMBinding> GCWorker<VM> {
384384
// probe! expands to an empty block on unsupported platforms
385385
#[allow(unused_variables)]
386386
let typename = work.get_type_name();
387+
388+
#[cfg(feature = "bpftrace_workaround")]
389+
// Workaround a problem where bpftrace script cannot see the work packet names,
390+
// by force loading from the packet name.
391+
// See the "Known issues" section in `tools/tracing/timeline/README.md`
392+
std::hint::black_box(unsafe { *(typename.as_ptr()) });
393+
387394
probe!(mmtk, work, typename.as_ptr(), typename.len());
388395
work.do_work_with_stat(self, mmtk);
389396
}

tools/tracing/README.md

Lines changed: 30 additions & 206 deletions
Original file line numberDiff line numberDiff line change
@@ -1,230 +1,54 @@
1-
# MMTk performace tracing
1+
# eBPF-based tracing tools
22

33
## Notes for MMTk developers
4+
45
Please open pull requests if you develop new tools that others might find useful.
56
When you add new tools, please update this documentation.
67
If you change MMTk internals that the tracing tools depend on (such as the
78
definition of `enum WorkBucketStage`), please update the scripts accordingly.
89

910
## Notes for MMTk users
11+
1012
Since some of the tools depend on the MMTk internals, please use the tools
1113
shipped with the MMTk release you use.
1214

1315
## Tracepoints
14-
Currently, the core provides the following tracepoints.
15-
- `mmtk:collection_initialized()`: GC is enabled
16-
- `mmtk:harness_begin()`: the timing iteration of a benchmark begins
17-
- `mmtk:harness_end()`: the timing iteration of a benchmark ends
18-
- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop
19-
- `mmtk:gcworker_run()`: a GC worker thread enters its work loop
20-
- `mmtk:gc_start()`: a collection epoch starts
21-
- `mmtk:gc_end()`: a collection epoch ends
22-
- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the
23-
`process_edges` method. The first argument is the number of edges to be processed,
24-
and the second argument is whether these edges are root edges.
25-
- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the
26-
numerical representation of `enum WorkBucketStage`.
27-
- `mmtk:work_poll()`: a work packet is to be polled.
28-
- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just
29-
executed. The first argument is points to the string of the Rust type name of
30-
the work packet, and the second argument is the length of the string.
31-
- `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
32-
- `mmtk:alloc_slow_once_end()`: the allocation slow path ends.
33-
34-
## Running tracing tools
35-
The tracing tools are to be invoked by a wrapper script `run.py`.
36-
```
37-
usage: run.py [-h] [-b BPFTRACE] -m MMTK [-H] [-p] [-f {text,json}] tool
3816

39-
positional arguments:
40-
tool Name of the bpftrace tool
41-
42-
optional arguments:
43-
-h, --help show this help message and exit
44-
-b BPFTRACE, --bpftrace BPFTRACE
45-
Path of the bpftrace executable
46-
-m MMTK, --mmtk MMTK Path of the MMTk binary
47-
-H, --harness Only collect data for the timing iteration (harness_begin/harness_end)
48-
-p, --print-script Print the content of the bpftrace script
49-
-f {text,json}, --format {text,json}
50-
bpftrace output format
51-
```
52-
53-
- `-b`: the path to the `bpftrace` executable. By default, it uses `bpftrace`
54-
executable in your `PATH`. We strongly recommend you use the latest statically
55-
complied `bpftrace` from [upstream](https://github.com/iovisor/bpftrace/releases).
56-
You need to be able to have sudo permission for whichever `bpftrace` you want to use.
57-
- `-m`: the path to a MMTk binary that contains the tracepoints.
58-
This depends on the binding you use.
59-
For the OpenJDK binding, it should be `jdk/lib/server/libmmtk_openjdk.so` under
60-
your build folder.
61-
To check whether the binary contains tracepoints, you can use `readelf -n`.
62-
You should see a bunch of `stapsdt` notes with `mmtk` as the provider.
63-
- `-H`: pass this flag is you want to only measure the timing iteration of a
64-
benchmark.
65-
By default, the tracing tools will measure the entire execution.
66-
- `-p`: print the entire tracing script before execution.
67-
This is mainly for debugging use.
68-
- `-f`: change the bpftrace output format.
69-
By default, it uses human-readable plain text output (`text`).
70-
You can set this to `json` for easy parsing.
17+
Currently, the core provides the following tracepoints.
7118

72-
Please run the tracing tools **before** running the workload.
73-
If you use `-H`, the tracing tools will automatically end with `harness_end` is
74-
called.
75-
Otherwise, you will need to terminate the tools manually with `Ctrl-C`.
76-
These tools also have a timeout of 1200 seconds so not to stall unattended
77-
benchmark execution.
19+
- `mmtk:collection_initialized()`: GC is enabled
20+
- `mmtk:harness_begin()`: the timing iteration of a benchmark begins
21+
- `mmtk:harness_end()`: the timing iteration of a benchmark ends
22+
- `mmtk:gccontroller_run()`: the GC controller thread enters its work loop
23+
- `mmtk:gcworker_run()`: a GC worker thread enters its work loop
24+
- `mmtk:gc_start()`: a collection epoch starts
25+
- `mmtk:gc_end()`: a collection epoch ends
26+
- `mmtk:process_edges(num_edges: int, is_roots: bool)`: a invocation of the `process_edges`
27+
method. The first argument is the number of edges to be processed, and the second argument is
28+
whether these edges are root edges.
29+
- `mmtk:bucket_opened(id: int)`: a work bucket opened. The first argument is the numerical
30+
representation of `enum WorkBucketStage`.
31+
- `mmtk:work_poll()`: a work packet is to be polled.
32+
- `mmtk:work(type_name: char *, type_name_len: int)`: a work packet was just executed. The first
33+
argument is points to the string of the Rust type name of the work packet, and the second
34+
argument is the length of the string.
35+
- `mmtk:alloc_slow_once_start()`: the allocation slow path starts.
36+
- `mmtk:alloc_slow_once_end()`: the allocation slow path ends.
7837

7938
## Tracing tools
80-
### Measuring the time spend in allocation slow path (`alloc_slow`)
81-
This tool measures the distribution of the allocation slow path time.
82-
The time unit is 400ns, so that we use the histogram bins with higher
83-
fidelity better.
84-
85-
Sample output:
86-
```
87-
@alloc_slow_hist:
88-
[4, 8) 304 |@ |
89-
[8, 16) 12603 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
90-
[16, 32) 8040 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
91-
[32, 64) 941 |@@@ |
92-
[64, 128) 171 | |
93-
[128, 256) 13 | |
94-
[256, 512) 2 | |
95-
[512, 1K) 0 | |
96-
[1K, 2K) 0 | |
97-
[2K, 4K) 0 | |
98-
[4K, 8K) 0 | |
99-
[8K, 16K) 0 | |
100-
[16K, 32K) 14 | |
101-
[32K, 64K) 37 | |
102-
[64K, 128K) 19 | |
103-
[128K, 256K) 1 | |
104-
```
105-
106-
In the above output, we can see that most allocation slow paths finish between
107-
3.2us and 6.4us.
108-
However, there is a long tail, presumably due to GC pauses.
109-
110-
### Measuring the time spend in different GC stages (`gc_stages`)
111-
This tool measures the time spent in different stages of GC: before `Closure`,
112-
during `Closure`, and after `Closure`.
113-
The time unit is ns.
114-
115-
Sample output:
116-
```
117-
@closure_time: 1405302743
118-
@post_closure_time: 81432919
119-
@pre_closure_time: 103886118
120-
```
121-
122-
In the above output, overall, the execution spends 1.4s in the main transitive
123-
closure, 103ms before that, and 81ms after that (a total of around 1.5s).
124-
125-
### Measuring the time spend in lock contended state for Rust `Mutex` (`lock_contend`)
126-
This tools measures the time spent in the lock contended state for Rust `Mutex`s.
127-
The Rust standard library implements `Mutex` using the fast-slow-path paradigm.
128-
Most lock operations take place in inlined fast paths, when there's no contention.
129-
However, when there's contention,
130-
`std::sys::unix::locks::futex_mutex::Mutex::lock_contended` is called.
13139

132-
```rust
133-
#[inline]
134-
pub fn lock(&self) {
135-
if self.futex.compare_exchange(0, 1, Acquire, Relaxed).is_err() {
136-
self.lock_contended();
137-
}
138-
}
139-
140-
#[cold]
141-
fn lock_contended(&self) {
142-
<snip>
143-
}
144-
```
145-
146-
147-
MMTk uses Rust `Mutex`, e.g., in allocation slow paths for synchronization,
148-
and this tool can be useful to measure the contention in these parts of code.
149-
150-
The time unit is 256ns.
151-
152-
Sample output:
153-
```
154-
@lock_dist[140637228007056]:
155-
[1] 447 |@@@@ |
156-
[2, 4) 3836 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
157-
[4, 8) 3505 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
158-
[8, 16) 1354 |@@@@@@@@@@@@@@ |
159-
[16, 32) 832 |@@@@@@@@ |
160-
[32, 64) 1077 |@@@@@@@@@@@ |
161-
[64, 128) 2991 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
162-
[128, 256) 4846 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
163-
[256, 512) 5013 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
164-
[512, 1K) 1203 |@@@@@@@@@@@@ |
165-
[1K, 2K) 34 | |
166-
[2K, 4K) 15 | |
167-
```
40+
Each sub-directory contains a set of scripts.
16841

169-
In the above output, we can see that the lock instance (140637228007056, or 0x7fe8a8047e90)
170-
roughly has a bimodal distribution in terms of the time spent in lock contended
171-
code path.
172-
The first peak is around 512ns\~1024ns, and the second peak is around 66us\~131us.
173-
174-
If you can't tell which lock instance is for which lock in MMTk, you can trace
175-
the allocation of the Mutex and record the stack trace (note that you might want
176-
to compile MMTk with `force-frame-pointers` to obtain better stack traces).
177-
178-
### Measuring the distribution of `process_edges` packet sizes (`packet_size`)
179-
Most of the GC time is spend in the transitive closure for tracing-based GCs,
180-
and MMTk performs transitive closure via work packets that calls the `process_edges` method.
181-
This tool measures the distribution of the sizes of these work packets, and also
182-
count root edges separately.
183-
184-
Sample output:
185-
```
186-
@process_edges_packet_size:
187-
[1] 238 |@@@@@ |
188-
[2, 4) 806 |@@@@@@@@@@@@@@@@@ |
189-
[4, 8) 1453 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
190-
[8, 16) 1105 |@@@@@@@@@@@@@@@@@@@@@@@ |
191-
[16, 32) 2410 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
192-
[32, 64) 1317 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
193-
[64, 128) 1252 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
194-
[128, 256) 1131 |@@@@@@@@@@@@@@@@@@@@@@@@ |
195-
[256, 512) 2017 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
196-
[512, 1K) 1270 |@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
197-
[1K, 2K) 1028 |@@@@@@@@@@@@@@@@@@@@@@ |
198-
[2K, 4K) 874 |@@@@@@@@@@@@@@@@@@ |
199-
[4K, 8K) 1024 |@@@@@@@@@@@@@@@@@@@@@@ |
200-
[8K, 16K) 58 |@ |
201-
[16K, 32K) 5 | |
202-
203-
@process_edges_root_packet_size:
204-
[1] 71 |@@@@@@@ |
205-
[2, 4) 4 | |
206-
[4, 8) 276 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
207-
[8, 16) 495 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
208-
[16, 32) 477 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
209-
[32, 64) 344 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
210-
[64, 128) 242 |@@@@@@@@@@@@@@@@@@@@@@@@@ |
211-
[128, 256) 109 |@@@@@@@@@@@ |
212-
[256, 512) 31 |@@@ |
213-
[512, 1K) 33 |@@@ |
214-
[1K, 2K) 75 |@@@@@@@ |
215-
[2K, 4K) 75 |@@@@@@@ |
216-
[4K, 8K) 336 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
217-
[8K, 16K) 56 |@@@@@ |
218-
[16K, 32K) 3 | |
219-
```
220-
221-
In the above output, we can see that overall, the sizes of the `process_edges`
222-
has a unimodal distribution with a peak around 16\~32 edges per packet.
223-
However, if we focus on root edges, the distribution is roughly bimodal, with a
224-
first peak around 8\~16 and a second peak around 4096\~8192.
42+
- `performance`: Print various GC-related statistics, such as the distribution of time spent in
43+
allocation slow path, the time spent in each GC stages, and the distribution of `process_edges`
44+
packet sizes.
45+
- `timeline`: Record the start and end time of each GC and each work packet, and visualize them on
46+
a timeline in Perfetto UI.
22547

22648
## Attribution
49+
22750
If used for research, please cite the following publication.
51+
22852
```bibtex
22953
@inproceedings{DBLP:conf/pppj/HuangBC23,
23054
author = {Claire Huang and

0 commit comments

Comments
 (0)