1
+ //! Profiling counters and their implementation.
2
+ //!
3
+ //! # Available counters
4
+ //!
5
+ //! Name (for [`Counter::by_name()`]) | Counter | OSes | CPUs
6
+ //! --------------------------------- | ------- | ---- | ----
7
+ //! `wall-time` | [`WallTime`] | any | any
8
+ //! `instructions:u` | [`Instructions`] | Linux | `x86_64`
9
+ //! `instructions-minus-irqs:u` | [`InstructionsMinusIrqs`] | Linux | `x86_64`<br>- AMD (since K8)<br>- Intel (since Sandy Bridge)
10
+ //! `instructions-minus-r0420:u` | [`InstructionsMinusRaw0420`] | Linux | `x86_64`<br>- AMD (Zen)
11
+ //!
12
+ //! *Note: `:u` suffixes for hardware performance counters come from the Linux `perf`
13
+ //! tool, and indicate that the counter is only active while userspace code executes
14
+ //! (i.e. it's paused while the kernel handles syscalls, interrupts, etc.).*
15
+ //!
16
+ //! # Limitations and caveats
17
+ //!
18
+ //! *Note: for more information, also see the GitHub PR which first implemented hardware
19
+ //! performance counter support ([#143](https://github.com/rust-lang/measureme/pull/143)).*
20
+ //!
21
+ //! The hardware performance counters (i.e. all counters other than `wall-time`) are limited to:
22
+ //! * nightly Rust (gated on `features = ["nightly"]`), for `asm!`
23
+ //! * Linux, for out-of-the-box performance counter reads from userspace
24
+ //! * other OSes could work through custom kernel extensions/drivers, in the future
25
+ //! * `x86_64` CPUs, mostly due to lack of other available test hardware
26
+ //! * new architectures would be easier to support (on Linux) than new OSes
27
+ //! * easiest to add would be 32-bit `x86` (aka `i686`), which would reuse
28
+ //! most of the `x86_64` CPU model detection logic
29
+ //! * specific (newer) CPU models, for certain non-standard counters
30
+ //! * e.g. `instructions-minus-irqs:u` requires a "hardware interrupts" (aka "IRQs")
31
+ //! counter, which is implemented differently between vendors / models (if at all)
32
+ //! * single-threaded programs (counters only work on the thread they were created on)
33
+ //! * for profiling `rustc`, this means only "check mode" (`--emit=metadata`),
34
+ //! is supported currently (`-Z no-llvm-threads` could also work)
35
+ //! * unclear what the best approach for handling multiple threads would be
36
+ //! * changing the API (e.g. to require per-thread profiler handles) could result
37
+ //! in a more efficient implementation, but would also be less ergonomic
38
+ //! * profiling data from multithreaded programs would be harder to use due to
39
+ //! noise from synchronization mechanisms, non-deterministic work-stealing, etc.
40
+ //!
41
+ //! For ergonomic reasons, the public API doesn't vary based on `features` or target.
42
+ //! Instead, attempting to create any unsupported counter will return `Err`, just
43
+ //! like it does for any issue detected at runtime (e.g. incompatible CPU model).
44
+ //!
45
+ //! When counting instructions specifically, these factors will impact the profiling quality:
46
+ //! * high-level non-determinism (e.g. user interactions, networking)
47
+ //! * the ideal use-case is a mostly-deterministic program, e.g. a compiler like `rustc`
48
+ //! * if I/O can be isolated to separate profiling events, and doesn't impact
49
+ //! execution in a more subtle way (see below), the deterministic parts of
50
+ //! the program can still be profiled with high accuracy
51
+ //! * low-level non-determinism (e.g. ASLR, randomized `HashMap`s, thread scheduling)
52
+ //! * ASLR ("Address Space Layout Randomization"), may be provided by the OS for
53
+ //! security reasons, or accidentally caused through allocations that depend on
54
+ //! random data (even as low-entropy as e.g. the base 10 length of a process ID)
55
+ //! * on Linux ASLR can be disabled by running the process under `setarch -R`
56
+ //! * this impacts `rustc` and LLVM, which rely on keying `HashMap`s by addresses
57
+ //! (typically of interned data) as an optimization, and while non-determinstic
58
+ //! outputs are considered bugs, the instructions executed can still vary a lot,
59
+ //! even when the externally observable behavior is perfectly repeatable
60
+ //! * `HashMap`s are involved in one more than one way:
61
+ //! * both the executed instructions, and the shape of the allocations depend
62
+ //! on both the hasher state and choice of keys (as the buckets are in
63
+ //! a flat array indexed by some of the lower bits of the key hashes)
64
+ //! * so every `HashMap` with keys being/containing addresses will amplify
65
+ //! ASLR and ASLR-like effects, making the entire program more sensitive
66
+ //! * the default hasher is randomized, and while `rustc` doesn't use it,
67
+ //! proc macros can (and will), and it's harder to disable than Linux ASLR
68
+ //! * `jemalloc` (the allocator used by `rustc`, at least in official releases)
69
+ //! has a 10 second "purge timer", which can introduce an ASLR-like effect,
70
+ //! unless disabled with `MALLOC_CONF=dirty_decay_ms:0,muzzy_decay_ms:0`
71
+ //! * hardware flaws (whether in the design or implementation)
72
+ //! * hardware interrupts ("IRQs") and exceptions (like page faults) cause
73
+ //! overcounting (1 instruction per interrupt, possibly the `iret` from the
74
+ //! kernel handler back to the interrupted userspace program)
75
+ //! * this is the reason why `instructions-minus-irqs:u` should be preferred
76
+ //! to `instructions:u`, where the former is available
77
+ //! * there are system-wide options (e.g. `CONFIG_NO_HZ_FULL`) for removing
78
+ //! some interrupts from the cores used for profiling, but they're not as
79
+ //! complete of a solution, nor easy to set up in the first place
80
+ //! * AMD Zen CPUs have a speculative execution feature (dubbed `SpecLockMap`),
81
+ //! which can cause non-deterministic overcounting for instructions following
82
+ //! an atomic instruction (such as found in heap allocators, or `measureme`)
83
+ //! * this is automatically detected, with a `log` message pointing the user
84
+ //! to [https://github.com/mozilla/rr/wiki/Zen] for guidance on how to
85
+ //! disable `SpecLockMap` on their system (sadly requires root access)
86
+ //!
87
+ //! Even if some of the above caveats apply for some profiling setup, as long as
88
+ //! the counters function, they can still be used, and compared with `wall-time`.
89
+ //! Chances are, they will still have less variance, as everything that impacts
90
+ //! instruction counts will also impact any time measurements.
91
+ //!
92
+ //! Also keep in mind that instruction counts do not properly reflect all kinds
93
+ //! of workloads, e.g. SIMD throughput and cache locality are unaccounted for.
94
+
1
95
use std:: error:: Error ;
2
96
use std:: time:: Instant ;
3
97
@@ -60,6 +154,9 @@ impl Counter {
60
154
}
61
155
}
62
156
157
+ /// "Monotonic clock" with nanosecond precision (using [`std::time::Instant`]).
158
+ ///
159
+ /// Can be obtained with `Counter::by_name("wall-time")`.
63
160
pub struct WallTime {
64
161
start : Instant ,
65
162
}
@@ -79,6 +176,9 @@ impl WallTime {
79
176
}
80
177
}
81
178
179
+ /// "Instructions retired" hardware performance counter (userspace-only).
180
+ ///
181
+ /// Can be obtained with `Counter::by_name("instructions:u")`.
82
182
pub struct Instructions {
83
183
instructions : hw:: Counter ,
84
184
start : u64 ,
@@ -103,6 +203,9 @@ impl Instructions {
103
203
}
104
204
}
105
205
206
+ /// More accurate [`Instructions`] (subtracting hardware interrupt counts).
207
+ ///
208
+ /// Can be obtained with `Counter::by_name("instructions-minus-irqs:u")`.
106
209
pub struct InstructionsMinusIrqs {
107
210
instructions : hw:: Counter ,
108
211
irqs : hw:: Counter ,
@@ -132,6 +235,10 @@ impl InstructionsMinusIrqs {
132
235
}
133
236
}
134
237
238
+ /// (Experimental) Like [`InstructionsMinusIrqs`] (but using an undocumented `r0420:u` counter).
239
+ ///
240
+ /// Can be obtained with `Counter::by_name("instructions-minus-r0420:u")`.
241
+ //
135
242
// HACK(eddyb) this is a variant of `instructions-minus-irqs:u`, where `r0420`
136
243
// is subtracted, instead of the usual "hardware interrupts" (aka IRQs).
137
244
// `r0420` is an undocumented counter on AMD Zen CPUs which appears to count
0 commit comments