diff options
author | Alexei Starovoitov <ast@plumgrid.com> | 2014-06-10 11:44:06 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2014-06-11 18:39:18 -0400 |
commit | e4ad403269ff0ecdfb137b2a72349c30941cec7a (patch) | |
tree | 059c9ca9c07dbcba990ddf8e2032cec35ee19699 /Documentation | |
parent | 9709674e68646cee5a24e3000b3558d25412203a (diff) |
net: filter: mention eBPF terminology as well
Since the term eBPF is used anyway on mailing list discussions, lets
also document that in the main BPF documentation file and replace a
couple of occurrences with eBPF terminology to be more clear.
Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/networking/filter.txt | 85 |
1 files changed, 43 insertions, 42 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 9f49b8690500..1c7fc6baed84 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt | |||
@@ -561,42 +561,43 @@ toolchain for developing and testing the kernel's JIT compiler. | |||
561 | 561 | ||
562 | BPF kernel internals | 562 | BPF kernel internals |
563 | -------------------- | 563 | -------------------- |
564 | Internally, for the kernel interpreter, a different BPF instruction set | 564 | Internally, for the kernel interpreter, a different instruction set |
565 | format with similar underlying principles from BPF described in previous | 565 | format with similar underlying principles from BPF described in previous |
566 | paragraphs is being used. However, the instruction set format is modelled | 566 | paragraphs is being used. However, the instruction set format is modelled |
567 | closer to the underlying architecture to mimic native instruction sets, so | 567 | closer to the underlying architecture to mimic native instruction sets, so |
568 | that a better performance can be achieved (more details later). | 568 | that a better performance can be achieved (more details later). This new |
569 | ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which | ||
570 | originates from [e]xtended BPF is not the same as BPF extensions! While | ||
571 | eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' | ||
572 | of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) | ||
569 | 573 | ||
570 | It is designed to be JITed with one to one mapping, which can also open up | 574 | It is designed to be JITed with one to one mapping, which can also open up |
571 | the possibility for GCC/LLVM compilers to generate optimized BPF code through | 575 | the possibility for GCC/LLVM compilers to generate optimized eBPF code through |
572 | a BPF backend that performs almost as fast as natively compiled code. | 576 | an eBPF backend that performs almost as fast as natively compiled code. |
573 | 577 | ||
574 | The new instruction set was originally designed with the possible goal in | 578 | The new instruction set was originally designed with the possible goal in |
575 | mind to write programs in "restricted C" and compile into BPF with a optional | 579 | mind to write programs in "restricted C" and compile into eBPF with a optional |
576 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with | 580 | GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with |
577 | minimal performance overhead over two steps, that is, C -> BPF -> native code. | 581 | minimal performance overhead over two steps, that is, C -> eBPF -> native code. |
578 | 582 | ||
579 | Currently, the new format is being used for running user BPF programs, which | 583 | Currently, the new format is being used for running user BPF programs, which |
580 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, | 584 | includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, |
581 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf | 585 | team driver's classifier for its load-balancing mode, netfilter's xt_bpf |
582 | extension, PTP dissector/classifier, and much more. They are all internally | 586 | extension, PTP dissector/classifier, and much more. They are all internally |
583 | converted by the kernel into the new instruction set representation and run | 587 | converted by the kernel into the new instruction set representation and run |
584 | in the extended interpreter. For in-kernel handlers, this all works | 588 | in the eBPF interpreter. For in-kernel handlers, this all works transparently |
585 | transparently by using sk_unattached_filter_create() for setting up the | 589 | by using sk_unattached_filter_create() for setting up the filter, resp. |
586 | filter, resp. sk_unattached_filter_destroy() for destroying it. The macro | 590 | sk_unattached_filter_destroy() for destroying it. The macro |
587 | SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to | 591 | SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed |
588 | run the filter. 'filter' is a pointer to struct sk_filter that we got from | 592 | code to run the filter. 'filter' is a pointer to struct sk_filter that we |
589 | sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer). | 593 | got from sk_unattached_filter_create(), and 'ctx' the given context (e.g. |
590 | All constraints and restrictions from sk_chk_filter() apply before a | 594 | skb pointer). All constraints and restrictions from sk_chk_filter() apply |
591 | conversion to the new layout is being done behind the scenes! | 595 | before a conversion to the new layout is being done behind the scenes! |
592 | 596 | ||
593 | Currently, for JITing, the user BPF format is being used and current BPF JIT | 597 | Currently, the classic BPF format is being used for JITing on most of the |
594 | compilers reused whenever possible. In other words, we do not (yet!) perform | 598 | architectures. Only x86-64 performs JIT compilation from eBPF instruction set, |
595 | a JIT compilation in the new layout, however, future work will successively | 599 | however, future work will migrate other JIT compilers as well, so that they |
596 | migrate traditional JIT compilers into the new instruction format as well, so | 600 | will profit from the very same benefits. |
597 | that they will profit from the very same benefits. Thus, when speaking about | ||
598 | JIT in the following, a JIT compiler (TBD) for the new instruction format is | ||
599 | meant in this context. | ||
600 | 601 | ||
601 | Some core changes of the new internal format: | 602 | Some core changes of the new internal format: |
602 | 603 | ||
@@ -605,35 +606,35 @@ Some core changes of the new internal format: | |||
605 | The old format had two registers A and X, and a hidden frame pointer. The | 606 | The old format had two registers A and X, and a hidden frame pointer. The |
606 | new layout extends this to be 10 internal registers and a read-only frame | 607 | new layout extends this to be 10 internal registers and a read-only frame |
607 | pointer. Since 64-bit CPUs are passing arguments to functions via registers | 608 | pointer. Since 64-bit CPUs are passing arguments to functions via registers |
608 | the number of args from BPF program to in-kernel function is restricted | 609 | the number of args from eBPF program to in-kernel function is restricted |
609 | to 5 and one register is used to accept return value from an in-kernel | 610 | to 5 and one register is used to accept return value from an in-kernel |
610 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ | 611 | function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ |
611 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved | 612 | sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved |
612 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. | 613 | registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. |
613 | 614 | ||
614 | Therefore, BPF calling convention is defined as: | 615 | Therefore, eBPF calling convention is defined as: |
615 | 616 | ||
616 | * R0 - return value from in-kernel function, and exit value for BPF program | 617 | * R0 - return value from in-kernel function, and exit value for eBPF program |
617 | * R1 - R5 - arguments from BPF program to in-kernel function | 618 | * R1 - R5 - arguments from eBPF program to in-kernel function |
618 | * R6 - R9 - callee saved registers that in-kernel function will preserve | 619 | * R6 - R9 - callee saved registers that in-kernel function will preserve |
619 | * R10 - read-only frame pointer to access stack | 620 | * R10 - read-only frame pointer to access stack |
620 | 621 | ||
621 | Thus, all BPF registers map one to one to HW registers on x86_64, aarch64, | 622 | Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, |
622 | etc, and BPF calling convention maps directly to ABIs used by the kernel on | 623 | etc, and eBPF calling convention maps directly to ABIs used by the kernel on |
623 | 64-bit architectures. | 624 | 64-bit architectures. |
624 | 625 | ||
625 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic | 626 | On 32-bit architectures JIT may map programs that use only 32-bit arithmetic |
626 | and may let more complex programs to be interpreted. | 627 | and may let more complex programs to be interpreted. |
627 | 628 | ||
628 | R0 - R5 are scratch registers and BPF program needs spill/fill them if | 629 | R0 - R5 are scratch registers and eBPF program needs spill/fill them if |
629 | necessary across calls. Note that there is only one BPF program (== one BPF | 630 | necessary across calls. Note that there is only one eBPF program (== one |
630 | main routine) and it cannot call other BPF functions, it can only call | 631 | eBPF main routine) and it cannot call other eBPF functions, it can only |
631 | predefined in-kernel functions, though. | 632 | call predefined in-kernel functions, though. |
632 | 633 | ||
633 | - Register width increases from 32-bit to 64-bit: | 634 | - Register width increases from 32-bit to 64-bit: |
634 | 635 | ||
635 | Still, the semantics of the original 32-bit ALU operations are preserved | 636 | Still, the semantics of the original 32-bit ALU operations are preserved |
636 | via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower | 637 | via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower |
637 | subregisters that zero-extend into 64-bit if they are being written to. | 638 | subregisters that zero-extend into 64-bit if they are being written to. |
638 | That behavior maps directly to x86_64 and arm64 subregister definition, but | 639 | That behavior maps directly to x86_64 and arm64 subregister definition, but |
639 | makes other JITs more difficult. | 640 | makes other JITs more difficult. |
@@ -644,8 +645,8 @@ Some core changes of the new internal format: | |||
644 | 645 | ||
645 | Operation is 64-bit, because on 64-bit architectures, pointers are also | 646 | Operation is 64-bit, because on 64-bit architectures, pointers are also |
646 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, | 647 | 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, |
647 | so 32-bit BPF registers would otherwise require to define register-pair | 648 | so 32-bit eBPF registers would otherwise require to define register-pair |
648 | ABI, thus, there won't be able to use a direct BPF register to HW register | 649 | ABI, thus, there won't be able to use a direct eBPF register to HW register |
649 | mapping and JIT would need to do combine/split/move operations for every | 650 | mapping and JIT would need to do combine/split/move operations for every |
650 | register in and out of the function, which is complex, bug prone and slow. | 651 | register in and out of the function, which is complex, bug prone and slow. |
651 | Another reason is the use of atomic 64-bit counters. | 652 | Another reason is the use of atomic 64-bit counters. |
@@ -690,7 +691,7 @@ Some core changes of the new internal format: | |||
690 | subq %rsi, %rax | 691 | subq %rsi, %rax |
691 | ret | 692 | ret |
692 | 693 | ||
693 | Function f2 in BPF may look like: | 694 | Function f2 in eBPF may look like: |
694 | 695 | ||
695 | f2: | 696 | f2: |
696 | bpf_mov R2, R1 | 697 | bpf_mov R2, R1 |
@@ -702,7 +703,7 @@ Some core changes of the new internal format: | |||
702 | returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to | 703 | returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to |
703 | be used to call into f2. | 704 | be used to call into f2. |
704 | 705 | ||
705 | For practical reasons all BPF programs have only one argument 'ctx' which is | 706 | For practical reasons all eBPF programs have only one argument 'ctx' which is |
706 | already placed into R1 (e.g. on __sk_run_filter() startup) and the programs | 707 | already placed into R1 (e.g. on __sk_run_filter() startup) and the programs |
707 | can call kernel functions with up to 5 arguments. Calls with 6 or more arguments | 708 | can call kernel functions with up to 5 arguments. Calls with 6 or more arguments |
708 | are currently not supported, but these restrictions can be lifted if necessary | 709 | are currently not supported, but these restrictions can be lifted if necessary |
@@ -779,9 +780,9 @@ Some core changes of the new internal format: | |||
779 | 780 | ||
780 | In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 | 781 | In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 |
781 | arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper | 782 | arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper |
782 | registers and place their return value into '%rax' which is R0 in BPF. | 783 | registers and place their return value into '%rax' which is R0 in eBPF. |
783 | Prologue and epilogue are emitted by JIT and are implicit in the | 784 | Prologue and epilogue are emitted by JIT and are implicit in the |
784 | interpreter. R0-R5 are scratch registers, so BPF program needs to preserve | 785 | interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve |
785 | them across the calls as defined by calling convention. | 786 | them across the calls as defined by calling convention. |
786 | 787 | ||
787 | For example the following program is invalid: | 788 | For example the following program is invalid: |
@@ -792,12 +793,12 @@ Some core changes of the new internal format: | |||
792 | bpf_exit | 793 | bpf_exit |
793 | 794 | ||
794 | After the call the registers R1-R5 contain junk values and cannot be read. | 795 | After the call the registers R1-R5 contain junk values and cannot be read. |
795 | In the future a BPF verifier can be used to validate internal BPF programs. | 796 | In the future an eBPF verifier can be used to validate internal BPF programs. |
796 | 797 | ||
797 | Also in the new design, BPF is limited to 4096 insns, which means that any | 798 | Also in the new design, eBPF is limited to 4096 insns, which means that any |
798 | program will terminate quickly and will only call a fixed number of kernel | 799 | program will terminate quickly and will only call a fixed number of kernel |
799 | functions. Original BPF and the new format are two operand instructions, | 800 | functions. Original BPF and the new format are two operand instructions, |
800 | which helps to do one-to-one mapping between BPF insn and x86 insn during JIT. | 801 | which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. |
801 | 802 | ||
802 | The input context pointer for invoking the interpreter function is generic, | 803 | The input context pointer for invoking the interpreter function is generic, |
803 | its content is defined by a specific use case. For seccomp register R1 points | 804 | its content is defined by a specific use case. For seccomp register R1 points |