Skip to content

Commit 0df8218

Browse files
committed
Merge tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux
Pull perf tools updates from Arnaldo Carvalho de Melo: "Miscellaneous: - Add Ian Rogers to MAINTAINERS as a perf tools reviewer. - Add support for retire latency feature (pipeline stall of a instruction compared to the previous one, in cycles) present on some Intel processors. - Add 'perf c2c' report option to show false sharing with adjacent cachelines, to be used in machines with cacheline prefetching, where accesses to a cacheline brings the next one too. - Skip 'perf test bpf' when the required kernel-debuginfo package isn't installed. - Avoid d3-flame-graph package dependency in 'perf script flamegraph', making this feature more generally available. - Add JSON metric events to present CPI stall cycles in Power10. - Assorted improvements/refactorings on the JSON metrics parsing code. perf lock contention: - Add -o/--lock-owner option: $ sudo ./perf lock contention -abo -- ./perf bench sched pipe # Running 'sched/pipe' benchmark: # Executed 1000000 pipe operations between two processes Total time: 4.766 [sec] 4.766540 usecs/op 209795 ops/sec contended total wait max wait avg wait pid owner 403 565.32 us 26.81 us 1.40 us -1 Unknown 4 27.99 us 8.57 us 7.00 us 1583145 sched-pipe 1 8.25 us 8.25 us 8.25 us 1583144 sched-pipe 1 2.03 us 2.03 us 2.03 us 5068 chrome The owner is unknown in most cases. Filtering only for the mutex locks, it will more likely get the owners. - -S/--callstack-filter is to limit display entries having the given string in the callstack: $ sudo ./perf lock contention -abv -S net sleep 1 ... contended total wait max wait avg wait type caller 5 70.20 us 16.13 us 14.04 us spinlock __dev_queue_xmit+0xb6d 0xffffffffa5dd1c60 _raw_spin_lock+0x30 0xffffffffa5b8f6ed __dev_queue_xmit+0xb6d 0xffffffffa5cd8267 ip6_finish_output2+0x2c7 0xffffffffa5cdac14 ip6_finish_output+0x1d4 0xffffffffa5cdb477 ip6_xmit+0x457 0xffffffffa5d1fd17 inet6_csk_xmit+0xd7 0xffffffffa5c5f4aa __tcp_transmit_skb+0x54a 0xffffffffa5c6467d tcp_keepalive_timer+0x2fd Please note that to have the -b option (BPF) working above one has to build with BUILD_BPF_SKEL=1. - Add more 'perf test' entries to test these new features. perf script: - Add 'cgroup' field for 'perf script' output: $ perf record --all-cgroups -- true $ perf script -F comm,pid,cgroup true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... true 337112 /user.slice/user-657345.slice/user@657345.service/... - Add support for showing branch speculation information in 'perf script' and in the 'perf report' raw dump (-D). perf record: - Fix 'perf record' segfault with --overwrite and --max-size. perf test/bench: - Switch basic BPF filtering test to use syscall tracepoint to avoid the variable number of probes inserted when using the previous probe point (do_epoll_wait) that happens on different CPU architectures. - Fix DWARF unwind test by adding non-inline to expected function in a backtrace. - Use 'grep -c' where the longer form 'grep | wc -l' was being used. - Add getpid and execve benchmarks to 'perf bench syscall'. Intel PT: - Add support for synthesizing "cycle" events from Intel PT traces as we support "instruction" events when Intel PT CYC packets are available. This enables much more accurate profiles than when using the regular 'perf record -e cycles' (the default) when the workload lasts for very short periods (<10ms). - .plt symbol handling improvements, better handling IBT (in the past MPX) done in the context of decoding Intel PT processor traces, IFUNC symbols on x86_64, static executables, understanding .plt.got symbols on x86_64. - Add a 'perf test' to test symbol resolution, part of the .plt improvements series, this tests things like symbol size in contexts where only the symbol start is available (kallsyms), etc. - Better handle auxtrace/Intel PT data when using pipe mode (perf record sleep 1|perf report). - Fix symbol lookup with kcore with multiple segments match stext, getting the symbol resolution to just show DSOs as unknown. ARM: - Timestamp improvements for ARM64 systems with ETMv4 (Embedded Trace Macrocell v4). - Ensure ARM64 CoreSight timestamps don't go backwards. - Document that ARM64 SPE (Statistical Profiling Extension) is used with 'perf c2c/mem'. - Add raw decoding for ARM64 SPEv1.2 previous branch address. - Update neoverse-n2-v2 ARM vendor events (JSON tables): topdown L1, TLB, cache, branch, PE utilization and instruction mix metrics. - Update decoder code for OpenCSD version 1.4, on ARM64 systems. - Fix command line auto-complete of CPU events on aarch64. Build: - Fix 'perf probe' and 'perf test' when libtraceevent isn't linked, as several tests use tracepoints, those should be skipped. - More fallout fixes for the removal of tools/lib/traceevent/. - Fix build error when linking with libpfm" * tag 'perf-tools-for-v6.3-1-2023-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux: (114 commits) perf tests stat_all_metrics: Change true workload to sleep workload for system wide check perf vendor events power10: Add JSON metric events to present CPI stall cycles in powerpc perf intel-pt: Synthesize cycle events perf c2c: Add report option to show false sharing in adjacent cachelines perf record: Fix segfault with --overwrite and --max-size perf stat: Avoid merging/aggregating metric counts twice perf tools: Fix perf tool build error in util/pfm.c perf tools: Fix auto-complete on aarch64 perf lock contention: Support old rw_semaphore type perf lock contention: Add -o/--lock-owner option perf lock contention: Fix to save callstack for the default modified perf test bpf: Skip test if kernel-debuginfo is not present perf probe: Update the exit error codes in function try_to_find_probe_trace_event perf script: Fix missing Retire Latency fields option documentation perf event x86: Add retire_lat when synthesizing PERF_SAMPLE_WEIGHT_STRUCT perf test x86: Support the retire_lat (Retire Latency) sample_type check perf test bpf: Check for libtraceevent support perf script: Support Retire Latency perf report: Support Retire Latency perf lock contention: Support filters for different aggregation ...
2 parents b72b5fe + f9fa077 commit 0df8218

File tree

129 files changed

+3210
-1092
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

129 files changed

+3210
-1092
lines changed

MAINTAINERS

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16323,6 +16323,7 @@ R: Mark Rutland <mark.rutland@arm.com>
1632316323
R: Alexander Shishkin <alexander.shishkin@linux.intel.com>
1632416324
R: Jiri Olsa <jolsa@kernel.org>
1632516325
R: Namhyung Kim <namhyung@kernel.org>
16326+
R: Ian Rogers <irogers@google.com>
1632616327
L: linux-perf-users@vger.kernel.org
1632716328
L: linux-kernel@vger.kernel.org
1632816329
S: Supported
Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,25 @@
11
/* SPDX-License-Identifier: GPL-2.0 */
2-
#ifndef __NR_perf_event_open
3-
# define __NR_perf_event_open 336
2+
#ifndef __NR_execve
3+
#define __NR_execve 11
44
#endif
5-
#ifndef __NR_futex
6-
# define __NR_futex 240
5+
#ifndef __NR_getppid
6+
#define __NR_getppid 64
7+
#endif
8+
#ifndef __NR_getpgid
9+
#define __NR_getpgid 132
710
#endif
811
#ifndef __NR_gettid
9-
# define __NR_gettid 224
12+
#define __NR_gettid 224
13+
#endif
14+
#ifndef __NR_futex
15+
#define __NR_futex 240
1016
#endif
1117
#ifndef __NR_getcpu
12-
# define __NR_getcpu 318
18+
#define __NR_getcpu 318
19+
#endif
20+
#ifndef __NR_perf_event_open
21+
#define __NR_perf_event_open 336
1322
#endif
1423
#ifndef __NR_setns
15-
# define __NR_setns 346
24+
#define __NR_setns 346
1625
#endif
Lines changed: 16 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,25 @@
11
/* SPDX-License-Identifier: GPL-2.0 */
2-
#ifndef __NR_perf_event_open
3-
# define __NR_perf_event_open 298
2+
#ifndef __NR_execve
3+
#define __NR_execve 59
44
#endif
5-
#ifndef __NR_futex
6-
# define __NR_futex 202
5+
#ifndef __NR_getppid
6+
#define __NR_getppid 110
7+
#endif
8+
#ifndef __NR_getpgid
9+
#define __NR_getpgid 121
710
#endif
811
#ifndef __NR_gettid
9-
# define __NR_gettid 186
12+
#define __NR_gettid 186
1013
#endif
11-
#ifndef __NR_getcpu
12-
# define __NR_getcpu 309
14+
#ifndef __NR_futex
15+
#define __NR_futex 202
16+
#endif
17+
#ifndef __NR_perf_event_open
18+
#define __NR_perf_event_open 298
1319
#endif
1420
#ifndef __NR_setns
1521
#define __NR_setns 308
1622
#endif
23+
#ifndef __NR_getcpu
24+
#define __NR_getcpu 309
25+
#endif

tools/build/Makefile.build

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,7 @@ build-file := $(dir)/Build
5353

5454
quiet_cmd_flex = FLEX $@
5555
quiet_cmd_bison = BISON $@
56+
quiet_cmd_test = TEST $@
5657

5758
# Create directory unless it exists
5859
quiet_cmd_mkdir = MKDIR $(dir $@)

tools/perf/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@ arch/*/include/generated/
3838
trace/beauty/generated/
3939
pmu-events/pmu-events.c
4040
pmu-events/jevents
41+
pmu-events/metric_test.log
4142
feature/
4243
libapi/
4344
libbpf/

tools/perf/Documentation/itrace.txt

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
i synthesize instructions events
2+
y synthesize cycles events
23
b synthesize branches events (branch misses for Arm SPE)
34
c synthesize branches events (calls only)
45
r synthesize branches events (returns only)
@@ -25,7 +26,7 @@
2526
A approximate IPC
2627
Z prefer to ignore timestamps (so-called "timeless" decoding)
2728

28-
The default is all events i.e. the same as --itrace=ibxwpe,
29+
The default is all events i.e. the same as --itrace=iybxwpe,
2930
except for perf script where it is --itrace=ce
3031

3132
In addition, the period (default 100000, except for perf script where it is 1)

tools/perf/Documentation/perf-bench.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ COMMON OPTIONS
1818
--------------
1919
-r::
2020
--repeat=::
21-
Specify amount of times to repeat the run (default 10).
21+
Specify number of times to repeat the run (default 10).
2222

2323
-f::
2424
--format=::

tools/perf/Documentation/perf-c2c.txt

Lines changed: 13 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,11 @@ you to track down the cacheline contentions.
2222
On Intel, the tool is based on load latency and precise store facility events
2323
provided by Intel CPUs. On PowerPC, the tool uses random instruction sampling
2424
with thresholding feature. On AMD, the tool uses IBS op pmu (due to hardware
25-
limitations, perf c2c is not supported on Zen3 cpus).
25+
limitations, perf c2c is not supported on Zen3 cpus). On Arm64 it uses SPE to
26+
sample load and store operations, therefore hardware and kernel support is
27+
required. See linkperf:perf-arm-spe[1] for a setup guide. Due to the
28+
statistical nature of Arm SPE sampling, not every memory operation will be
29+
sampled.
2630

2731
These events provide:
2832
- memory address of the access
@@ -121,11 +125,17 @@ REPORT OPTIONS
121125
perf c2c record --call-graph lbr.
122126
Disabled by default. In common cases with call stack overflows,
123127
it can recreate better call stacks than the default lbr call stack
124-
output. But this approach is not full proof. There can be cases
128+
output. But this approach is not foolproof. There can be cases
125129
where it creates incorrect call stacks from incorrect matches.
126130
The known limitations include exception handing such as
127131
setjmp/longjmp will have calls/returns not match.
128132

133+
--double-cl::
134+
Group the detection of shared cacheline events into double cacheline
135+
granularity. Some architectures have an Adjacent Cacheline Prefetch
136+
feature, which causes cacheline sharing to behave like the cacheline
137+
size is doubled.
138+
129139
C2C RECORD
130140
----------
131141
The perf c2c record command setup options related to HITM cacheline analysis
@@ -333,4 +343,4 @@ Check Joe's blog on c2c tool for detailed use case explanation:
333343

334344
SEE ALSO
335345
--------
336-
linkperf:perf-record[1], linkperf:perf-mem[1]
346+
linkperf:perf-record[1], linkperf:perf-mem[1], linkperf:perf-arm-spe[1]

tools/perf/Documentation/perf-intel-pt.txt

Lines changed: 54 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -101,12 +101,12 @@ data is available you can use the 'perf script' tool with all itrace sampling
101101
options, which will list all the samples.
102102

103103
perf record -e intel_pt//u ls
104-
perf script --itrace=ibxwpe
104+
perf script --itrace=iybxwpe
105105

106106
An interesting field that is not printed by default is 'flags' which can be
107107
displayed as follows:
108108

109-
perf script --itrace=ibxwpe -F+flags
109+
perf script --itrace=iybxwpe -F+flags
110110

111111
The flags are "bcrosyiABExghDt" which stand for branch, call, return, conditional,
112112
system, asynchronous, interrupt, transaction abort, trace begin, trace end,
@@ -147,16 +147,17 @@ displayed as follows:
147147
There are two ways that instructions-per-cycle (IPC) can be calculated depending
148148
on the recording.
149149

150-
If the 'cyc' config term (see config terms section below) was used, then IPC is
151-
calculated using the cycle count from CYC packets, otherwise MTC packets are
152-
used - refer to the 'mtc' config term. When MTC is used, however, the values
153-
are less accurate because the timing is less accurate.
150+
If the 'cyc' config term (see config terms section below) was used, then IPC
151+
and cycle events are calculated using the cycle count from CYC packets, otherwise
152+
MTC packets are used - refer to the 'mtc' config term. When MTC is used, however,
153+
the values are less accurate because the timing is less accurate.
154154

155155
Because Intel PT does not update the cycle count on every branch or instruction,
156156
the values will often be zero. When there are values, they will be the number
157157
of instructions and number of cycles since the last update, and thus represent
158-
the average IPC since the last IPC for that event type. Note IPC for "branches"
159-
events is calculated separately from IPC for "instructions" events.
158+
the average IPC cycle count since the last IPC for that event type.
159+
Note IPC for "branches" events is calculated separately from IPC for "instructions"
160+
events.
160161

161162
Even with the 'cyc' config term, it is possible to produce IPC information for
162163
every change of timestamp, but at the expense of accuracy. That is selected by
@@ -900,11 +901,12 @@ Having no option is the same as
900901

901902
which, in turn, is the same as
902903

903-
--itrace=cepwx
904+
--itrace=cepwxy
904905

905906
The letters are:
906907

907908
i synthesize "instructions" events
909+
y synthesize "cycles" events
908910
b synthesize "branches" events
909911
x synthesize "transactions" events
910912
w synthesize "ptwrite" events
@@ -927,16 +929,26 @@ The letters are:
927929
"Instructions" events look like they were recorded by "perf record -e
928930
instructions".
929931

932+
"Cycles" events look like they were recorded by "perf record -e cycles"
933+
(ie., the default). Note that even with CYC packets enabled and no sampling,
934+
these are not fully accurate, since CYC packets are not emitted for each
935+
instruction, only when some other event (like an indirect branch, or a
936+
TNT packet representing multiple branches) happens causes a packet to
937+
be emitted. Thus, it is more effective for attributing cycles to functions
938+
(and possibly basic blocks) than to individual instructions, although it
939+
is not even perfect for functions (although it becomes better if the noretcomp
940+
option is active).
941+
930942
"Branches" events look like they were recorded by "perf record -e branches". "c"
931943
and "r" can be combined to get calls and returns.
932944

933945
"Transactions" events correspond to the start or end of transactions. The
934946
'flags' field can be used in perf script to determine whether the event is a
935947
transaction start, commit or abort.
936948

937-
Note that "instructions", "branches" and "transactions" events depend on code
938-
flow packets which can be disabled by using the config term "branch=0". Refer
939-
to the config terms section above.
949+
Note that "instructions", "cycles", "branches" and "transactions" events
950+
depend on code flow packets which can be disabled by using the config term
951+
"branch=0". Refer to the config terms section above.
940952

941953
"ptwrite" events record the payload of the ptwrite instruction and whether
942954
"fup_on_ptw" was used. "ptwrite" events depend on PTWRITE packets which are
@@ -1821,6 +1833,36 @@ Can be compiled and traced:
18211833
$
18221834

18231835

1836+
Pipe mode
1837+
---------
1838+
Pipe mode is a problem for Intel PT and possibly other auxtrace users.
1839+
It's not recommended to use a pipe as data output with Intel PT because
1840+
of the following reason.
1841+
1842+
Essentially the auxtrace buffers do not behave like the regular perf
1843+
event buffers. That is because the head and tail are updated by
1844+
software, but in the auxtrace case the data is written by hardware.
1845+
So the head and tail do not get updated as data is written.
1846+
1847+
In the Intel PT case, the head and tail are updated only when the trace
1848+
is disabled by software, for example:
1849+
- full-trace, system wide : when buffer passes watermark
1850+
- full-trace, not system-wide : when buffer passes watermark or
1851+
context switches
1852+
- snapshot mode : as above but also when a snapshot is made
1853+
- sample mode : as above but also when a sample is made
1854+
1855+
That means finished-round ordering doesn't work. An auxtrace buffer
1856+
can turn up that has data that extends back in time, possibly to the
1857+
very beginning of tracing.
1858+
1859+
For a perf.data file, that problem is solved by going through the trace
1860+
and queuing up the auxtrace buffers in advance.
1861+
1862+
For pipe mode, the order of events and timestamps can presumably
1863+
be messed up.
1864+
1865+
18241866
EXAMPLE
18251867
-------
18261868

tools/perf/Documentation/perf-list.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -232,7 +232,7 @@ This can be overridden by setting the kernel.perf_event_paranoid
232232
sysctl to -1, which allows non root to use these events.
233233

234234
For accessing trace point events perf needs to have read access to
235-
/sys/kernel/debug/tracing, even when perf_event_paranoid is in a relaxed
235+
/sys/kernel/tracing, even when perf_event_paranoid is in a relaxed
236236
setting.
237237

238238
TRACING

0 commit comments

Comments
 (0)