How Linux kprobes works

How do Linux kernel probes (kprobes) work

Everything below comes from reading kprobes.txt at [1].

Summary

There are kprobes and jprobes. jprobes are specialized kprobes for function calls, making it easier to get the function parameters. It is usually useful to filter function call traces according to the value of a parameter. kprobes can be optimized in restricted conditions. The optimized kprobe avoids using CPU trap and runs up to 10 times faster.

Normal kprobe

1.1 How Does a Kprobe Work?

When a kprobe is registered, Kprobes makes a copy of the probed
instruction and replaces the first byte(s) of the probed instruction
with a breakpoint instruction (e.g., int3 on i386 and x86_64).

When a CPU hits the breakpoint instruction, a trap occurs, the CPU's
registers are saved, and control passes to Kprobes via the
notifier_call_chain mechanism.  Kprobes executes the "pre_handler"
associated with the kprobe, passing the handler the addresses of the
kprobe struct and the saved registers.

Next, Kprobes single-steps its copy of the probed instruction.
(It would be simpler to single-step the actual instruction in place,
but then Kprobes would have to temporarily remove the breakpoint
instruction.  This would open a small time window when another CPU
could sail right past the probepoint.)

After the instruction is single-stepped, Kprobes executes the
"post_handler," if any, that is associated with the kprobe.
Execution then continues with the instruction following the probepoint.
Notes:
  • trap is interrupt 3 (aka int3). See "Interrupt 3—Breakpoint Exception" in [2].

jprobe

1.2 How Does a Jprobe Work?

A jprobe is implemented using a kprobe that is placed on a function's
entry point.  It employs a simple mirroring principle to allow
seamless access to the probed function's arguments.  The jprobe
handler routine should have the same signature (arg list and return
type) as the function being probed, and must always end by calling
the Kprobes function jprobe_return().

Here's how it works.  When the probe is hit, Kprobes makes a copy of
the saved registers and a generous portion of the stack (see below).
Kprobes then points the saved instruction pointer at the jprobe's
handler routine, and returns from the trap.  As a result, control
passes to the handler, which is presented with the same register and
stack contents as the probed function.  When it is done, the handler
calls jprobe_return(), which traps again to restore the original stack
contents and processor state and switch to the probed function.

By convention, the callee owns its arguments, so gcc may produce code
that unexpectedly modifies that portion of the stack.  This is why
Kprobes saves a copy of the stack and restores it after the jprobe
handler has run.  Up to MAX_STACK_SIZE bytes are copied -- e.g.,
64 bytes on i386.

Note that the probed function's args may be passed on the stack
or in registers.  The jprobe will work in either case, so long as the
handler's prototype matches that of the probed function.

Note that in some architectures (e.g.: arm64 and sparc64) the stack
copy is not done, as the actual location of stacked parameters may be
outside of a reasonable MAX_STACK_SIZE value and because that location
cannot be determined by the jprobes code. In this case the jprobes
user must be careful to make certain the calling signature of the
function does not cause parameters to be passed on the stack (e.g.:
more than eight function arguments, an argument of more than sixteen
bytes, or more than 64 bytes of argument data, depending on
architecture).

Optimized kprobe

1.4 How Does Jump Optimization Work?

If your kernel is built with CONFIG_OPTPROBES=y (currently this flag
is automatically set 'y' on x86/x86-64, non-preemptive kernel) and
the "debug.kprobes_optimization" kernel parameter is set to 1 (see
sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump
instruction instead of a breakpoint instruction at each probepoint.

1.4.1 Init a Kprobe

[...]

1.4.2 Safety Check

[...]

1.4.3 Preparing Detour Buffer

Next, Kprobes prepares a "detour" buffer, which contains the following
instruction sequence:
- code to push the CPU's registers (emulating a breakpoint trap)
- a call to the trampoline code which calls user's probe handlers.
- code to restore registers
- the instructions from the optimized region
- a jump back to the original execution path.

1.4.4 Pre-optimization

After preparing the detour buffer, Kprobes verifies that none of the
following situations exist:
- The probe has either a break_handler (i.e., it's a jprobe) or a
post_handler.
- Other instructions in the optimized region are probed.
- The probe is disabled.
In any of the above cases, Kprobes won't start optimizing the probe.
Since these are temporary situations, Kprobes tries to start
optimizing it again if the situation is changed.

If the kprobe can be optimized, Kprobes enqueues the kprobe to an
optimizing list, and kicks the kprobe-optimizer workqueue to optimize
it.  If the to-be-optimized probepoint is hit before being optimized,
Kprobes returns control to the original instruction path by setting
the CPU's instruction pointer to the copied code in the detour buffer
-- thus at least avoiding the single-step.

Performances: ptrace vs uprobe

From [3]:

References

[1]Linux kernel, kprobes docs, https://www.kernel.org/doc/Documentation/kprobes.txt
[2]Intel 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf
[3]"Ptrace, Utrace, Uprobes: Lightweight, Dynamic Tracing of User Apps", https://landley.net/kdocs/ols/2007/ols2007v1-pages-215-224.pdf