diff options
author | Alexei Starovoitov <ast@plumgrid.com> | 2014-03-28 13:58:26 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2014-03-31 00:45:09 -0400 |
commit | 9a985cdc5ccb0d557720221d01bd70c19f04bb8c (patch) | |
tree | 495b67bcf755829a5409da5b7444ea9b93f60b35 /Documentation/networking | |
parent | bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (diff) |
doc: filter: extend BPF documentation to document new internals
Further extend the current BPF documentation to document new BPF
engine internals. Joint work with Daniel Borkmann.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/filter.txt | 125 |
1 files changed, 125 insertions, 0 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index a06b48d2f5cc..81f940f4e884 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt | |||
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>: | |||
546 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful | 546 | For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful |
547 | toolchain for developing and testing the kernel's JIT compiler. | 547 | toolchain for developing and testing the kernel's JIT compiler. |
548 | 548 | ||
549 | BPF kernel internals | ||
550 | -------------------- | ||
551 | Internally, for the kernel interpreter, a different BPF instruction set | ||
552 | format with similar underlying principles from BPF described in previous | ||
553 | paragraphs is being used. However, the instruction set format is modelled | ||
554 | closer to the underlying architecture to mimic native instruction sets, so | ||
555 | that a better performance can be achieved (more details later). | ||
556 | |||
557 | It is designed to be JITed with one to one mapping, which can also open up | ||
558 | the possibility for GCC/LLVM compilers to generate optimized BPF code through | ||
559 | a BPF backend that performs almost as fast as natively compiled code. | ||
560 | |||
561 | The new instruction set was originally designed with the possible goal in | ||
562 | mind to write programs in "restricted C" and compile into BPF with a optional | ||
563 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with | ||
564 | minimal performance overhead over two steps, that is, C -> BPF -> native code. | ||
565 | |||
566 | Currently, the new format is being used for running user BPF programs, which | ||
567 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, | ||
568 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf | ||
569 | extension, PTP dissector/classifier, and much more. They are all internally | ||
570 | converted by the kernel into the new instruction set representation and run | ||
571 | in the extended interpreter. For in-kernel handlers, this all works | ||
572 | transparently by using sk_unattached_filter_create() for setting up the | ||
573 | filter, resp. sk_unattached_filter_destroy() for destroying it. The macro | ||
574 | SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to | ||
575 | run the filter. 'filter' is a pointer to struct sk_filter that we got from | ||
576 | sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer). | ||
577 | All constraints and restrictions from sk_chk_filter() apply before a | ||
578 | conversion to the new layout is being done behind the scenes! | ||
579 | |||
580 | Currently, for JITing, the user BPF format is being used and current BPF JIT | ||
581 | compilers reused whenever possible. In other words, we do not (yet!) perform | ||
582 | a JIT compilation in the new layout, however, future work will successively | ||
583 | migrate traditional JIT compilers into the new instruction format as well, so | ||
584 | that they will profit from the very same benefits. Thus, when speaking about | ||
585 | JIT in the following, a JIT compiler (TBD) for the new instruction format is | ||
586 | meant in this context. | ||
587 | |||
588 | Some core changes of the new internal format: | ||
589 | |||
590 | - Number of registers increase from 2 to 10: | ||
591 | |||
592 | The old format had two registers A and X, and a hidden frame pointer. The | ||
593 | new layout extends this to be 10 internal registers and a read-only frame | ||
594 | pointer. Since 64-bit CPUs are passing arguments to functions via registers | ||
595 | the number of args from BPF program to in-kernel function is restricted | ||
596 | to 5 and one register is used to accept return value from an in-kernel | ||
597 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ | ||
598 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved | ||
599 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. | ||
600 | |||
601 | Therefore, BPF calling convention is defined as: | ||
602 | |||
603 | * R0 - return value from in-kernel function | ||
604 | * R1 - R5 - arguments from BPF program to in-kernel function | ||
605 | * R6 - R9 - callee saved registers that in-kernel function will preserve | ||
606 | * R10 - read-only frame pointer to access stack | ||
607 | |||
608 | Thus, all BPF registers map one to one to HW registers on x86_64, aarch64, | ||
609 | etc, and BPF calling convention maps directly to ABIs used by the kernel on | ||
610 | 64-bit architectures. | ||
611 | |||
612 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic | ||
613 | and may let more complex programs to be interpreted. | ||
614 | |||
615 | R0 - R5 are scratch registers and BPF program needs spill/fill them if | ||
616 | necessary across calls. Note that there is only one BPF program (== one BPF | ||
617 | main routine) and it cannot call other BPF functions, it can only call | ||
618 | predefined in-kernel functions, though. | ||
619 | |||
620 | - Register width increases from 32-bit to 64-bit: | ||
621 | |||
622 | Still, the semantics of the original 32-bit ALU operations are preserved | ||
623 | via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower | ||
624 | subregisters that zero-extend into 64-bit if they are being written to. | ||
625 | That behavior maps directly to x86_64 and arm64 subregister definition, but | ||
626 | makes other JITs more difficult. | ||
627 | |||
628 | 32-bit architectures run 64-bit internal BPF programs via interpreter. | ||
629 | Their JITs may convert BPF programs that only use 32-bit subregisters into | ||
630 | native instruction set and let the rest being interpreted. | ||
631 | |||
632 | Operation is 64-bit, because on 64-bit architectures, pointers are also | ||
633 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, | ||
634 | so 32-bit BPF registers would otherwise require to define register-pair | ||
635 | ABI, thus, there won't be able to use a direct BPF register to HW register | ||
636 | mapping and JIT would need to do combine/split/move operations for every | ||
637 | register in and out of the function, which is complex, bug prone and slow. | ||
638 | Another reason is the use of atomic 64-bit counters. | ||
639 | |||
640 | - Conditional jt/jf targets replaced with jt/fall-through: | ||
641 | |||
642 | While the original design has constructs such as "if (cond) jump_true; | ||
643 | else jump_false;", they are being replaced into alternative constructs like | ||
644 | "if (cond) jump_true; /* else fall-through */". | ||
645 | |||
646 | - Introduces bpf_call insn and register passing convention for zero overhead | ||
647 | calls from/to other kernel functions: | ||
648 | |||
649 | After a kernel function call, R1 - R5 are reset to unreadable and R0 has a | ||
650 | return type of the function. Since R6 - R9 are callee saved, their state is | ||
651 | preserved across the call. | ||
652 | |||
653 | Also in the new design, BPF is limited to 4096 insns, which means that any | ||
654 | program will terminate quickly and will only call a fixed number of kernel | ||
655 | functions. Original BPF and the new format are two operand instructions, | ||
656 | which helps to do one-to-one mapping between BPF insn and x86 insn during JIT. | ||
657 | |||
658 | The input context pointer for invoking the interpreter function is generic, | ||
659 | its content is defined by a specific use case. For seccomp register R1 points | ||
660 | to seccomp_data, for converted BPF filters R1 points to a skb. | ||
661 | |||
662 | A program, that is translated internally consists of the following elements: | ||
663 | |||
664 | op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32 | ||
665 | |||
666 | Just like the original BPF, the new format runs within a controlled environment, | ||
667 | is deterministic and the kernel can easily prove that. The safety of the program | ||
668 | can be determined in two steps: first step does depth-first-search to disallow | ||
669 | loops and other CFG validation; second step starts from the first insn and | ||
670 | descends all possible paths. It simulates execution of every insn and observes | ||
671 | the state change of registers and stack. | ||
672 | |||
549 | Misc | 673 | Misc |
550 | ---- | 674 | ---- |
551 | 675 | ||
@@ -561,3 +685,4 @@ the underlying architecture. | |||
561 | 685 | ||
562 | Jay Schulist <jschlst@samba.org> | 686 | Jay Schulist <jschlst@samba.org> |
563 | Daniel Borkmann <dborkman@redhat.com> | 687 | Daniel Borkmann <dborkman@redhat.com> |
688 | Alexei Starovoitov <ast@plumgrid.com> | ||