How do Linux kernel probes (kprobes) work
Everything below comes from reading kprobes.txt at [1].
Summary
There are kprobes and jprobes. jprobes are specialized kprobes for function calls, making it easier to get the function parameters. It is usually useful to filter function call traces according to the value of a parameter. kprobes can be optimized in restricted conditions. The optimized kprobe avoids using CPU trap and runs up to 10 times faster.
Normal kprobe
1.1 How Does a Kprobe Work? When a kprobe is registered, Kprobes makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64). When a CPU hits the breakpoint instruction, a trap occurs, the CPU's registers are saved, and control passes to Kprobes via the notifier_call_chain mechanism. Kprobes executes the "pre_handler" associated with the kprobe, passing the handler the addresses of the kprobe struct and the saved registers. Next, Kprobes single-steps its copy of the probed instruction. (It would be simpler to single-step the actual instruction in place, but then Kprobes would have to temporarily remove the breakpoint instruction. This would open a small time window when another CPU could sail right past the probepoint.) After the instruction is single-stepped, Kprobes executes the "post_handler," if any, that is associated with the kprobe. Execution then continues with the instruction following the probepoint.

- Notes:
- trap is interrupt 3 (aka int3). See "Interrupt 3—Breakpoint Exception" in [2].
jprobe
1.2 How Does a Jprobe Work? A jprobe is implemented using a kprobe that is placed on a function's entry point. It employs a simple mirroring principle to allow seamless access to the probed function's arguments. The jprobe handler routine should have the same signature (arg list and return type) as the function being probed, and must always end by calling the Kprobes function jprobe_return(). Here's how it works. When the probe is hit, Kprobes makes a copy of the saved registers and a generous portion of the stack (see below). Kprobes then points the saved instruction pointer at the jprobe's handler routine, and returns from the trap. As a result, control passes to the handler, which is presented with the same register and stack contents as the probed function. When it is done, the handler calls jprobe_return(), which traps again to restore the original stack contents and processor state and switch to the probed function. By convention, the callee owns its arguments, so gcc may produce code that unexpectedly modifies that portion of the stack. This is why Kprobes saves a copy of the stack and restores it after the jprobe handler has run. Up to MAX_STACK_SIZE bytes are copied -- e.g., 64 bytes on i386. Note that the probed function's args may be passed on the stack or in registers. The jprobe will work in either case, so long as the handler's prototype matches that of the probed function. Note that in some architectures (e.g.: arm64 and sparc64) the stack copy is not done, as the actual location of stacked parameters may be outside of a reasonable MAX_STACK_SIZE value and because that location cannot be determined by the jprobes code. In this case the jprobes user must be careful to make certain the calling signature of the function does not cause parameters to be passed on the stack (e.g.: more than eight function arguments, an argument of more than sixteen bytes, or more than 64 bytes of argument data, depending on architecture).

Optimized kprobe
1.4 How Does Jump Optimization Work? If your kernel is built with CONFIG_OPTPROBES=y (currently this flag is automatically set 'y' on x86/x86-64, non-preemptive kernel) and the "debug.kprobes_optimization" kernel parameter is set to 1 (see sysctl(8)), Kprobes tries to reduce probe-hit overhead by using a jump instruction instead of a breakpoint instruction at each probepoint. 1.4.1 Init a Kprobe [...] 1.4.2 Safety Check [...] 1.4.3 Preparing Detour Buffer Next, Kprobes prepares a "detour" buffer, which contains the following instruction sequence: - code to push the CPU's registers (emulating a breakpoint trap) - a call to the trampoline code which calls user's probe handlers. - code to restore registers - the instructions from the optimized region - a jump back to the original execution path. 1.4.4 Pre-optimization After preparing the detour buffer, Kprobes verifies that none of the following situations exist: - The probe has either a break_handler (i.e., it's a jprobe) or a post_handler. - Other instructions in the optimized region are probed. - The probe is disabled. In any of the above cases, Kprobes won't start optimizing the probe. Since these are temporary situations, Kprobes tries to start optimizing it again if the situation is changed. If the kprobe can be optimized, Kprobes enqueues the kprobe to an optimizing list, and kicks the kprobe-optimizer workqueue to optimize it. If the to-be-optimized probepoint is hit before being optimized, Kprobes returns control to the original instruction path by setting the CPU's instruction pointer to the copied code in the detour buffer -- thus at least avoiding the single-step.

References
[1] | Linux kernel, kprobes docs, https://www.kernel.org/doc/Documentation/kprobes.txt |
[2] | Intel 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3A: System Programming Guide, Part 1, http://www.intel.com/Assets/en_US/PDF/manual/253668.pdf |
[3] | "Ptrace, Utrace, Uprobes: Lightweight, Dynamic Tracing of User Apps", https://landley.net/kdocs/ols/2007/ols2007v1-pages-215-224.pdf |