diff options
| -rw-r--r-- | Documentation/static-keys.txt | 286 | 
1 files changed, 286 insertions, 0 deletions
diff --git a/Documentation/static-keys.txt b/Documentation/static-keys.txt new file mode 100644 index 00000000000..d93f3c00f24 --- /dev/null +++ b/Documentation/static-keys.txt @@ -0,0 +1,286 @@ +			Static Keys +			----------- + +By: Jason Baron <jbaron@redhat.com> + +0) Abstract + +Static keys allows the inclusion of seldom used features in +performance-sensitive fast-path kernel code, via a GCC feature and a code +patching technique. A quick example: + +	struct static_key key = STATIC_KEY_INIT_FALSE; + +	... + +        if (static_key_false(&key)) +                do unlikely code +        else +                do likely code + +	... +	static_key_slow_inc(); +	... +	static_key_slow_inc(); +	... + +The static_key_false() branch will be generated into the code with as little +impact to the likely code path as possible. + + +1) Motivation + + +Currently, tracepoints are implemented using a conditional branch. The +conditional check requires checking a global variable for each tracepoint. +Although the overhead of this check is small, it increases when the memory +cache comes under pressure (memory cache lines for these global variables may +be shared with other memory accesses). As we increase the number of tracepoints +in the kernel this overhead may become more of an issue. In addition, +tracepoints are often dormant (disabled) and provide no direct kernel +functionality. Thus, it is highly desirable to reduce their impact as much as +possible. Although tracepoints are the original motivation for this work, other +kernel code paths should be able to make use of the static keys facility. + + +2) Solution + + +gcc (v4.5) adds a new 'asm goto' statement that allows branching to a label: + +http://gcc.gnu.org/ml/gcc-patches/2009-07/msg01556.html + +Using the 'asm goto', we can create branches that are either taken or not taken +by default, without the need to check memory. Then, at run-time, we can patch +the branch site to change the branch direction. + +For example, if we have a simple branch that is disabled by default: + +	if (static_key_false(&key)) +		printk("I am the true branch\n"); + +Thus, by default the 'printk' will not be emitted. And the code generated will +consist of a single atomic 'no-op' instruction (5 bytes on x86), in the +straight-line code path. When the branch is 'flipped', we will patch the +'no-op' in the straight-line codepath with a 'jump' instruction to the +out-of-line true branch. Thus, changing branch direction is expensive but +branch selection is basically 'free'. That is the basic tradeoff of this +optimization. + +This lowlevel patching mechanism is called 'jump label patching', and it gives +the basis for the static keys facility. + +3) Static key label API, usage and examples: + + +In order to make use of this optimization you must first define a key: + +	struct static_key key; + +Which is initialized as: + +	struct static_key key = STATIC_KEY_INIT_TRUE; + +or: + +	struct static_key key = STATIC_KEY_INIT_FALSE; + +If the key is not initialized, it is default false. The 'struct static_key', +must be a 'global'. That is, it can't be allocated on the stack or dynamically +allocated at run-time. + +The key is then used in code as: + +        if (static_key_false(&key)) +                do unlikely code +        else +                do likely code + +Or: + +        if (static_key_true(&key)) +                do likely code +        else +                do unlikely code + +A key that is initialized via 'STATIC_KEY_INIT_FALSE', must be used in a +'static_key_false()' construct. Likewise, a key initialized via +'STATIC_KEY_INIT_TRUE' must be used in a 'static_key_true()' construct. A +single key can be used in many branches, but all the branches must match the +way that the key has been initialized. + +The branch(es) can then be switched via: + +	static_key_slow_inc(&key); +	... +	static_key_slow_dec(&key); + +Thus, 'static_key_slow_inc()' means 'make the branch true', and +'static_key_slow_dec()' means 'make the the branch false' with appropriate +reference counting. For example, if the key is initialized true, a +static_key_slow_dec(), will switch the branch to false. And a subsequent +static_key_slow_inc(), will change the branch back to true. Likewise, if the +key is initialized false, a 'static_key_slow_inc()', will change the branch to +true. And then a 'static_key_slow_dec()', will again make the branch false. + +An example usage in the kernel is the implementation of tracepoints: + +        static inline void trace_##name(proto)                          \ +        {                                                               \ +                if (static_key_false(&__tracepoint_##name.key))		\ +                        __DO_TRACE(&__tracepoint_##name,                \ +                                TP_PROTO(data_proto),                   \ +                                TP_ARGS(data_args),                     \ +                                TP_CONDITION(cond));                    \ +        } + +Tracepoints are disabled by default, and can be placed in performance critical +pieces of the kernel. Thus, by using a static key, the tracepoints can have +absolutely minimal impact when not in use. + + +4) Architecture level code patching interface, 'jump labels' + + +There are a few functions and macros that architectures must implement in order +to take advantage of this optimization. If there is no architecture support, we +simply fall back to a traditional, load, test, and jump sequence. + +* select HAVE_ARCH_JUMP_LABEL, see: arch/x86/Kconfig + +* #define JUMP_LABEL_NOP_SIZE, see: arch/x86/include/asm/jump_label.h + +* __always_inline bool arch_static_branch(struct static_key *key), see: +					arch/x86/include/asm/jump_label.h + +* void arch_jump_label_transform(struct jump_entry *entry, enum jump_label_type type), +					see: arch/x86/kernel/jump_label.c + +* __init_or_module void arch_jump_label_transform_static(struct jump_entry *entry, enum jump_label_type type), +					see: arch/x86/kernel/jump_label.c + + +* struct jump_entry, see: arch/x86/include/asm/jump_label.h + + +5) Static keys / jump label analysis, results (x86_64): + + +As an example, let's add the following branch to 'getppid()', such that the +system call now looks like: + +SYSCALL_DEFINE0(getppid) +{ +        int pid; + ++       if (static_key_false(&key)) ++               printk("I am the true branch\n"); + +        rcu_read_lock(); +        pid = task_tgid_vnr(rcu_dereference(current->real_parent)); +        rcu_read_unlock(); + +        return pid; +} + +The resulting instructions with jump labels generated by GCC is: + +ffffffff81044290 <sys_getppid>: +ffffffff81044290:       55                      push   %rbp +ffffffff81044291:       48 89 e5                mov    %rsp,%rbp +ffffffff81044294:       e9 00 00 00 00          jmpq   ffffffff81044299 <sys_getppid+0x9> +ffffffff81044299:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax +ffffffff810442a0:       00 00 +ffffffff810442a2:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax +ffffffff810442a9:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax +ffffffff810442b0:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi +ffffffff810442b7:       e8 f4 d9 00 00          callq  ffffffff81051cb0 <pid_vnr> +ffffffff810442bc:       5d                      pop    %rbp +ffffffff810442bd:       48 98                   cltq +ffffffff810442bf:       c3                      retq +ffffffff810442c0:       48 c7 c7 e3 54 98 81    mov    $0xffffffff819854e3,%rdi +ffffffff810442c7:       31 c0                   xor    %eax,%eax +ffffffff810442c9:       e8 71 13 6d 00          callq  ffffffff8171563f <printk> +ffffffff810442ce:       eb c9                   jmp    ffffffff81044299 <sys_getppid+0x9> + +Without the jump label optimization it looks like: + +ffffffff810441f0 <sys_getppid>: +ffffffff810441f0:       8b 05 8a 52 d8 00       mov    0xd8528a(%rip),%eax        # ffffffff81dc9480 <key> +ffffffff810441f6:       55                      push   %rbp +ffffffff810441f7:       48 89 e5                mov    %rsp,%rbp +ffffffff810441fa:       85 c0                   test   %eax,%eax +ffffffff810441fc:       75 27                   jne    ffffffff81044225 <sys_getppid+0x35> +ffffffff810441fe:       65 48 8b 04 25 c0 b6    mov    %gs:0xb6c0,%rax +ffffffff81044205:       00 00 +ffffffff81044207:       48 8b 80 80 02 00 00    mov    0x280(%rax),%rax +ffffffff8104420e:       48 8b 80 b0 02 00 00    mov    0x2b0(%rax),%rax +ffffffff81044215:       48 8b b8 e8 02 00 00    mov    0x2e8(%rax),%rdi +ffffffff8104421c:       e8 2f da 00 00          callq  ffffffff81051c50 <pid_vnr> +ffffffff81044221:       5d                      pop    %rbp +ffffffff81044222:       48 98                   cltq +ffffffff81044224:       c3                      retq +ffffffff81044225:       48 c7 c7 13 53 98 81    mov    $0xffffffff81985313,%rdi +ffffffff8104422c:       31 c0                   xor    %eax,%eax +ffffffff8104422e:       e8 60 0f 6d 00          callq  ffffffff81715193 <printk> +ffffffff81044233:       eb c9                   jmp    ffffffff810441fe <sys_getppid+0xe> +ffffffff81044235:       66 66 2e 0f 1f 84 00    data32 nopw %cs:0x0(%rax,%rax,1) +ffffffff8104423c:       00 00 00 00 + +Thus, the disable jump label case adds a 'mov', 'test' and 'jne' instruction +vs. the jump label case just has a 'no-op' or 'jmp 0'. (The jmp 0, is patched +to a 5 byte atomic no-op instruction at boot-time.) Thus, the disabled jump +label case adds: + +6 (mov) + 2 (test) + 2 (jne) = 10 - 5 (5 byte jump 0) = 5 addition bytes. + +If we then include the padding bytes, the jump label code saves, 16 total bytes +of instruction memory for this small fucntion. In this case the non-jump label +function is 80 bytes long. Thus, we have have saved 20% of the instruction +footprint. We can in fact improve this even further, since the 5-byte no-op +really can be a 2-byte no-op since we can reach the branch with a 2-byte jmp. +However, we have not yet implemented optimal no-op sizes (they are currently +hard-coded). + +Since there are a number of static key API uses in the scheduler paths, +'pipe-test' (also known as 'perf bench sched pipe') can be used to show the +performance improvement. Testing done on 3.3.0-rc2: + +jump label disabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + +        855.700314 task-clock                #    0.534 CPUs utilized            ( +-  0.11% ) +           200,003 context-switches          #    0.234 M/sec                    ( +-  0.00% ) +                 0 CPU-migrations            #    0.000 M/sec                    ( +- 39.58% ) +               487 page-faults               #    0.001 M/sec                    ( +-  0.02% ) +     1,474,374,262 cycles                    #    1.723 GHz                      ( +-  0.17% ) +   <not supported> stalled-cycles-frontend +   <not supported> stalled-cycles-backend +     1,178,049,567 instructions              #    0.80  insns per cycle          ( +-  0.06% ) +       208,368,926 branches                  #  243.507 M/sec                    ( +-  0.06% ) +         5,569,188 branch-misses             #    2.67% of all branches          ( +-  0.54% ) + +       1.601607384 seconds time elapsed                                          ( +-  0.07% ) + +jump label enabled: + + Performance counter stats for 'bash -c /tmp/pipe-test' (50 runs): + +        841.043185 task-clock                #    0.533 CPUs utilized            ( +-  0.12% ) +           200,004 context-switches          #    0.238 M/sec                    ( +-  0.00% ) +                 0 CPU-migrations            #    0.000 M/sec                    ( +- 40.87% ) +               487 page-faults               #    0.001 M/sec                    ( +-  0.05% ) +     1,432,559,428 cycles                    #    1.703 GHz                      ( +-  0.18% ) +   <not supported> stalled-cycles-frontend +   <not supported> stalled-cycles-backend +     1,175,363,994 instructions              #    0.82  insns per cycle          ( +-  0.04% ) +       206,859,359 branches                  #  245.956 M/sec                    ( +-  0.04% ) +         4,884,119 branch-misses             #    2.36% of all branches          ( +-  0.85% ) + +       1.579384366 seconds time elapsed + +The percentage of saved branches is .7%, and we've saved 12% on +'branch-misses'. This is where we would expect to get the most savings, since +this optimization is about reducing the number of branches. In addition, we've +saved .2% on instructions, and 2.8% on cycles and 1.4% on elapsed time.  |