aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
authorAlexei Starovoitov <ast@plumgrid.com>2014-03-28 13:58:26 -0400
committerDavid S. Miller <davem@davemloft.net>2014-03-31 00:45:09 -0400
commit9a985cdc5ccb0d557720221d01bd70c19f04bb8c (patch)
tree495b67bcf755829a5409da5b7444ea9b93f60b35 /Documentation/networking
parentbd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (diff)
doc: filter: extend BPF documentation to document new internals
Further extend the current BPF documentation to document new BPF engine internals. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/filter.txt125
1 files changed, 125 insertions, 0 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d2f5cc..81f940f4e884 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
546For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful 546For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
547toolchain for developing and testing the kernel's JIT compiler. 547toolchain for developing and testing the kernel's JIT compiler.
548 548
549BPF kernel internals
550--------------------
551Internally, for the kernel interpreter, a different BPF instruction set
552format with similar underlying principles from BPF described in previous
553paragraphs is being used. However, the instruction set format is modelled
554closer to the underlying architecture to mimic native instruction sets, so
555that a better performance can be achieved (more details later).
556
557It is designed to be JITed with one to one mapping, which can also open up
558the possibility for GCC/LLVM compilers to generate optimized BPF code through
559a BPF backend that performs almost as fast as natively compiled code.
560
561The new instruction set was originally designed with the possible goal in
562mind to write programs in "restricted C" and compile into BPF with a optional
563GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
564minimal performance overhead over two steps, that is, C -> BPF -> native code.
565
566Currently, the new format is being used for running user BPF programs, which
567includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
568team driver's classifier for its load-balancing mode, netfilter's xt_bpf
569extension, PTP dissector/classifier, and much more. They are all internally
570converted by the kernel into the new instruction set representation and run
571in the extended interpreter. For in-kernel handlers, this all works
572transparently by using sk_unattached_filter_create() for setting up the
573filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
574SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
575run the filter. 'filter' is a pointer to struct sk_filter that we got from
576sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
577All constraints and restrictions from sk_chk_filter() apply before a
578conversion to the new layout is being done behind the scenes!
579
580Currently, for JITing, the user BPF format is being used and current BPF JIT
581compilers reused whenever possible. In other words, we do not (yet!) perform
582a JIT compilation in the new layout, however, future work will successively
583migrate traditional JIT compilers into the new instruction format as well, so
584that they will profit from the very same benefits. Thus, when speaking about
585JIT in the following, a JIT compiler (TBD) for the new instruction format is
586meant in this context.
587
588Some core changes of the new internal format:
589
590- Number of registers increase from 2 to 10:
591
592 The old format had two registers A and X, and a hidden frame pointer. The
593 new layout extends this to be 10 internal registers and a read-only frame
594 pointer. Since 64-bit CPUs are passing arguments to functions via registers
595 the number of args from BPF program to in-kernel function is restricted
596 to 5 and one register is used to accept return value from an in-kernel
597 function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
598 sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
599 registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
600
601 Therefore, BPF calling convention is defined as:
602
603 * R0 - return value from in-kernel function
604 * R1 - R5 - arguments from BPF program to in-kernel function
605 * R6 - R9 - callee saved registers that in-kernel function will preserve
606 * R10 - read-only frame pointer to access stack
607
608 Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
609 etc, and BPF calling convention maps directly to ABIs used by the kernel on
610 64-bit architectures.
611
612 On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
613 and may let more complex programs to be interpreted.
614
615 R0 - R5 are scratch registers and BPF program needs spill/fill them if
616 necessary across calls. Note that there is only one BPF program (== one BPF
617 main routine) and it cannot call other BPF functions, it can only call
618 predefined in-kernel functions, though.
619
620- Register width increases from 32-bit to 64-bit:
621
622 Still, the semantics of the original 32-bit ALU operations are preserved
623 via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
624 subregisters that zero-extend into 64-bit if they are being written to.
625 That behavior maps directly to x86_64 and arm64 subregister definition, but
626 makes other JITs more difficult.
627
628 32-bit architectures run 64-bit internal BPF programs via interpreter.
629 Their JITs may convert BPF programs that only use 32-bit subregisters into
630 native instruction set and let the rest being interpreted.
631
632 Operation is 64-bit, because on 64-bit architectures, pointers are also
633 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
634 so 32-bit BPF registers would otherwise require to define register-pair
635 ABI, thus, there won't be able to use a direct BPF register to HW register
636 mapping and JIT would need to do combine/split/move operations for every
637 register in and out of the function, which is complex, bug prone and slow.
638 Another reason is the use of atomic 64-bit counters.
639
640- Conditional jt/jf targets replaced with jt/fall-through:
641
642 While the original design has constructs such as "if (cond) jump_true;
643 else jump_false;", they are being replaced into alternative constructs like
644 "if (cond) jump_true; /* else fall-through */".
645
646- Introduces bpf_call insn and register passing convention for zero overhead
647 calls from/to other kernel functions:
648
649 After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
650 return type of the function. Since R6 - R9 are callee saved, their state is
651 preserved across the call.
652
653Also in the new design, BPF is limited to 4096 insns, which means that any
654program will terminate quickly and will only call a fixed number of kernel
655functions. Original BPF and the new format are two operand instructions,
656which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
657
658The input context pointer for invoking the interpreter function is generic,
659its content is defined by a specific use case. For seccomp register R1 points
660to seccomp_data, for converted BPF filters R1 points to a skb.
661
662A program, that is translated internally consists of the following elements:
663
664 op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
665
666Just like the original BPF, the new format runs within a controlled environment,
667is deterministic and the kernel can easily prove that. The safety of the program
668can be determined in two steps: first step does depth-first-search to disallow
669loops and other CFG validation; second step starts from the first insn and
670descends all possible paths. It simulates execution of every insn and observes
671the state change of registers and stack.
672
549Misc 673Misc
550---- 674----
551 675
@@ -561,3 +685,4 @@ the underlying architecture.
561 685
562Jay Schulist <jschlst@samba.org> 686Jay Schulist <jschlst@samba.org>
563Daniel Borkmann <dborkman@redhat.com> 687Daniel Borkmann <dborkman@redhat.com>
688Alexei Starovoitov <ast@plumgrid.com>