|
| 1 | +.. SPDX-License-Identifier: GPL-2.0 |
| 2 | +
|
| 3 | +========================== |
| 4 | +Fprobe-based Event Tracing |
| 5 | +========================== |
| 6 | + |
| 7 | +.. Author: Masami Hiramatsu <mhiramat@kernel.org> |
| 8 | +
|
| 9 | +Overview |
| 10 | +-------- |
| 11 | + |
| 12 | +Fprobe event is similar to the kprobe event, but limited to probe on |
| 13 | +the function entry and exit only. It is good enough for many use cases |
| 14 | +which only traces some specific functions. |
| 15 | + |
| 16 | +This document also covers tracepoint probe events (tprobe) since this |
| 17 | +is also works only on the tracepoint entry. User can trace a part of |
| 18 | +tracepoint argument, or the tracepoint without trace-event, which is |
| 19 | +not exposed on tracefs. |
| 20 | + |
| 21 | +As same as other dynamic events, fprobe events and tracepoint probe |
| 22 | +events are defined via `dynamic_events` interface file on tracefs. |
| 23 | + |
| 24 | +Synopsis of fprobe-events |
| 25 | +------------------------- |
| 26 | +:: |
| 27 | + |
| 28 | + f[:[GRP1/][EVENT1]] SYM [FETCHARGS] : Probe on function entry |
| 29 | + f[MAXACTIVE][:[GRP1/][EVENT1]] SYM%return [FETCHARGS] : Probe on function exit |
| 30 | + t[:[GRP2/][EVENT2]] TRACEPOINT [FETCHARGS] : Probe on tracepoint |
| 31 | + |
| 32 | + GRP1 : Group name for fprobe. If omitted, use "fprobes" for it. |
| 33 | + GRP2 : Group name for tprobe. If omitted, use "tracepoints" for it. |
| 34 | + EVENT1 : Event name for fprobe. If omitted, the event name is |
| 35 | + "SYM__entry" or "SYM__exit". |
| 36 | + EVENT2 : Event name for tprobe. If omitted, the event name is |
| 37 | + the same as "TRACEPOINT", but if the "TRACEPOINT" starts |
| 38 | + with a digit character, "_TRACEPOINT" is used. |
| 39 | + MAXACTIVE : Maximum number of instances of the specified function that |
| 40 | + can be probed simultaneously, or 0 for the default value |
| 41 | + as defined in Documentation/trace/fprobe.rst |
| 42 | + |
| 43 | + FETCHARGS : Arguments. Each probe can have up to 128 args. |
| 44 | + ARG : Fetch "ARG" function argument using BTF (only for function |
| 45 | + entry or tracepoint.) (\*1) |
| 46 | + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) |
| 47 | + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) |
| 48 | + $stackN : Fetch Nth entry of stack (N >= 0) |
| 49 | + $stack : Fetch stack address. |
| 50 | + $argN : Fetch the Nth function argument. (N >= 1) (\*2) |
| 51 | + $retval : Fetch return value.(\*3) |
| 52 | + $comm : Fetch current task comm. |
| 53 | + +|-[u]OFFS(FETCHARG) : Fetch memory at FETCHARG +|- OFFS address.(\*4)(\*5) |
| 54 | + \IMM : Store an immediate value to the argument. |
| 55 | + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. |
| 56 | + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types |
| 57 | + (u8/u16/u32/u64/s8/s16/s32/s64), hexadecimal types |
| 58 | + (x8/x16/x32/x64), "char", "string", "ustring", "symbol", "symstr" |
| 59 | + and bitfield are supported. |
| 60 | + |
| 61 | + (\*1) This is available only when BTF is enabled. |
| 62 | + (\*2) only for the probe on function entry (offs == 0). |
| 63 | + (\*3) only for return probe. |
| 64 | + (\*4) this is useful for fetching a field of data structures. |
| 65 | + (\*5) "u" means user-space dereference. |
| 66 | + |
| 67 | +For the details of TYPE, see :ref:`kprobetrace documentation <kprobetrace_types>`. |
| 68 | + |
| 69 | +BTF arguments |
| 70 | +------------- |
| 71 | +BTF (BPF Type Format) argument allows user to trace function and tracepoint |
| 72 | +parameters by its name instead of ``$argN``. This feature is available if the |
| 73 | +kernel is configured with CONFIG_BPF_SYSCALL and CONFIG_DEBUG_INFO_BTF. |
| 74 | +If user only specify the BTF argument, the event's argument name is also |
| 75 | +automatically set by the given name. :: |
| 76 | + |
| 77 | + # echo 'f:myprobe vfs_read count pos' >> dynamic_events |
| 78 | + # cat dynamic_events |
| 79 | + f:fprobes/myprobe vfs_read count=count pos=pos |
| 80 | + |
| 81 | +It also chooses the fetch type from BTF information. For example, in the above |
| 82 | +example, the ``count`` is unsigned long, and the ``pos`` is a pointer. Thus, both |
| 83 | +are converted to 64bit unsigned long, but only ``pos`` has "%Lx" print-format as |
| 84 | +below :: |
| 85 | + |
| 86 | + # cat events/fprobes/myprobe/format |
| 87 | + name: myprobe |
| 88 | + ID: 1313 |
| 89 | + format: |
| 90 | + field:unsigned short common_type; offset:0; size:2; signed:0; |
| 91 | + field:unsigned char common_flags; offset:2; size:1; signed:0; |
| 92 | + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; |
| 93 | + field:int common_pid; offset:4; size:4; signed:1; |
| 94 | + |
| 95 | + field:unsigned long __probe_ip; offset:8; size:8; signed:0; |
| 96 | + field:u64 count; offset:16; size:8; signed:0; |
| 97 | + field:u64 pos; offset:24; size:8; signed:0; |
| 98 | + |
| 99 | + print fmt: "(%lx) count=%Lu pos=0x%Lx", REC->__probe_ip, REC->count, REC->pos |
| 100 | + |
| 101 | +If user unsures the name of arguments, ``$arg*`` will be helpful. The ``$arg*`` |
| 102 | +is expanded to all function arguments of the function or the tracepoint. :: |
| 103 | + |
| 104 | + # echo 'f:myprobe vfs_read $arg*' >> dynamic_events |
| 105 | + # cat dynamic_events |
| 106 | + f:fprobes/myprobe vfs_read file=file buf=buf count=count pos=pos |
| 107 | + |
| 108 | +BTF also affects the ``$retval``. If user doesn't set any type, the retval type is |
| 109 | +automatically picked from the BTF. If the function returns ``void``, ``$retval`` |
| 110 | +is rejected. |
| 111 | + |
| 112 | +Usage examples |
| 113 | +-------------- |
| 114 | +Here is an example to add fprobe events on ``vfs_read()`` function entry |
| 115 | +and exit, with BTF arguments. |
| 116 | +:: |
| 117 | + |
| 118 | + # echo 'f vfs_read $arg*' >> dynamic_events |
| 119 | + # echo 'f vfs_read%return $retval' >> dynamic_events |
| 120 | + # cat dynamic_events |
| 121 | + f:fprobes/vfs_read__entry vfs_read file=file buf=buf count=count pos=pos |
| 122 | + f:fprobes/vfs_read__exit vfs_read%return arg1=$retval |
| 123 | + # echo 1 > events/fprobes/enable |
| 124 | + # head -n 20 trace | tail |
| 125 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 126 | + # | | | ||||| | | |
| 127 | + sh-70 [000] ...1. 335.883195: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 |
| 128 | + sh-70 [000] ..... 335.883208: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 |
| 129 | + sh-70 [000] ...1. 335.883220: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 |
| 130 | + sh-70 [000] ..... 335.883224: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 |
| 131 | + sh-70 [000] ...1. 335.883232: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c687a count=1 pos=0xffffc900005aff08 |
| 132 | + sh-70 [000] ..... 335.883237: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 |
| 133 | + sh-70 [000] ...1. 336.050329: vfs_read__entry: (vfs_read+0x4/0x340) file=0xffff888005cf9a80 buf=0x7ffef36c6879 count=1 pos=0xffffc900005aff08 |
| 134 | + sh-70 [000] ..... 336.050343: vfs_read__exit: (ksys_read+0x75/0x100 <- vfs_read) arg1=1 |
| 135 | + |
| 136 | +You can see all function arguments and return values are recorded as signed int. |
| 137 | + |
| 138 | +Also, here is an example of tracepoint events on ``sched_switch`` tracepoint. |
| 139 | +To compare the result, this also enables the ``sched_switch`` traceevent too. |
| 140 | +:: |
| 141 | + |
| 142 | + # echo 't sched_switch $arg*' >> dynamic_events |
| 143 | + # echo 1 > events/sched/sched_switch/enable |
| 144 | + # echo 1 > events/tracepoints/sched_switch/enable |
| 145 | + # echo > trace |
| 146 | + # head -n 20 trace | tail |
| 147 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 148 | + # | | | ||||| | | |
| 149 | + sh-70 [000] d..2. 3912.083993: sched_switch: prev_comm=sh prev_pid=70 prev_prio=120 prev_state=S ==> next_comm=swapper/0 next_pid=0 next_prio=120 |
| 150 | + sh-70 [000] d..3. 3912.083995: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff88800664e100 next=0xffffffff828229c0 prev_state=1 |
| 151 | + <idle>-0 [000] d..2. 3912.084183: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120 |
| 152 | + <idle>-0 [000] d..3. 3912.084184: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0 |
| 153 | + rcu_preempt-16 [000] d..2. 3912.084196: sched_switch: prev_comm=rcu_preempt prev_pid=16 prev_prio=120 prev_state=I ==> next_comm=swapper/0 next_pid=0 next_prio=120 |
| 154 | + rcu_preempt-16 [000] d..3. 3912.084196: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffff888004208000 next=0xffffffff828229c0 prev_state=1026 |
| 155 | + <idle>-0 [000] d..2. 3912.085191: sched_switch: prev_comm=swapper/0 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=16 next_prio=120 |
| 156 | + <idle>-0 [000] d..3. 3912.085191: sched_switch: (__probestub_sched_switch+0x4/0x10) preempt=0 prev=0xffffffff828229c0 next=0xffff888004208000 prev_state=0 |
| 157 | + |
| 158 | +As you can see, the ``sched_switch`` trace-event shows *cooked* parameters, on |
| 159 | +the other hand, the ``sched_switch`` tracepoint probe event shows *raw* |
| 160 | +parameters. This means you can access any field values in the task |
| 161 | +structure pointed by the ``prev`` and ``next`` arguments. |
| 162 | + |
| 163 | +For example, usually ``task_struct::start_time`` is not traced, but with this |
| 164 | +traceprobe event, you can trace it as below. |
| 165 | +:: |
| 166 | + |
| 167 | + # echo 't sched_switch comm=+1896(next):string start_time=+1728(next):u64' > dynamic_events |
| 168 | + # head -n 20 trace | tail |
| 169 | + # TASK-PID CPU# ||||| TIMESTAMP FUNCTION |
| 170 | + # | | | ||||| | | |
| 171 | + sh-70 [000] d..3. 5606.686577: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000 |
| 172 | + rcu_preempt-16 [000] d..3. 5606.686602: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="sh" usage=1 start_time=1596095526 |
| 173 | + sh-70 [000] d..3. 5606.686637: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 |
| 174 | + <idle>-0 [000] d..3. 5606.687190: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="rcu_preempt" usage=1 start_time=245000000 |
| 175 | + rcu_preempt-16 [000] d..3. 5606.687202: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 |
| 176 | + <idle>-0 [000] d..3. 5606.690317: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000 |
| 177 | + kworker/0:1-14 [000] d..3. 5606.690339: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="swapper/0" usage=2 start_time=0 |
| 178 | + <idle>-0 [000] d..3. 5606.692368: sched_switch: (__probestub_sched_switch+0x4/0x10) comm="kworker/0:1" usage=1 start_time=137000000 |
| 179 | + |
| 180 | +Currently, to find the offset of a specific field in the data structure, |
| 181 | +you need to build kernel with debuginfo and run `perf probe` command with |
| 182 | +`-D` option. e.g. |
| 183 | +:: |
| 184 | + |
| 185 | + # perf probe -D "__probestub_sched_switch next->comm:string next->start_time" |
| 186 | + p:probe/__probestub_sched_switch __probestub_sched_switch+0 comm=+1896(%cx):string start_time=+1728(%cx):u64 |
| 187 | + |
| 188 | +And replace the ``%cx`` with the ``next``. |
0 commit comments