doc: filter: extend BPF documentation to document new internals

Further extend the current BPF documentation to document new BPF engine internals. Joint work with Daniel Borkmann. Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
author: Alexei Starovoitov <ast@plumgrid.com> 2014-03-28 13:58:26 -0400
committer: David S. Miller <davem@davemloft.net> 2014-03-31 00:45:09 -0400
commit: 9a985cdc5ccb0d557720221d01bd70c19f04bb8c (patch)
tree: 495b67bcf755829a5409da5b7444ea9b93f60b35 /Documentation/networking
parent: bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (diff)
1 files changed, 125 insertions, 0 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d2f5cc..81f940f4e884 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
 For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
 toolchain for developing and testing the kernel's JIT compiler.
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different BPF instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later).
+It is designed to be JITed with one to one mapping, which can also open up
+the possibility for GCC/LLVM compilers to generate optimized BPF code through
+a BPF backend that performs almost as fast as natively compiled code.
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into BPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> BPF -> native code.
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the extended interpreter. For in-kernel handlers, this all works
+transparently by using sk_unattached_filter_create() for setting up the
+filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
+SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
+run the filter. 'filter' is a pointer to struct sk_filter that we got from
+sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
+All constraints and restrictions from sk_chk_filter() apply before a
+conversion to the new layout is being done behind the scenes!
+Currently, for JITing, the user BPF format is being used and current BPF JIT
+compilers reused whenever possible. In other words, we do not (yet!) perform
+a JIT compilation in the new layout, however, future work will successively
+migrate traditional JIT compilers into the new instruction format as well, so
+that they will profit from the very same benefits. Thus, when speaking about
+JIT in the following, a JIT compiler (TBD) for the new instruction format is
+meant in this context.
+Some core changes of the new internal format:
+- Number of registers increase from 2 to 10:
+  The old format had two registers A and X, and a hidden frame pointer. The
+  new layout extends this to be 10 internal registers and a read-only frame
+  pointer. Since 64-bit CPUs are passing arguments to functions via registers
+  the number of args from BPF program to in-kernel function is restricted
+  to 5 and one register is used to accept return value from an in-kernel
+  function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
+  sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
+  registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+  Therefore, BPF calling convention is defined as:
+    * R0        - return value from in-kernel function
+    * R1 - R5   - arguments from BPF program to in-kernel function
+    * R6 - R9   - callee saved registers that in-kernel function will preserve
+    * R10       - read-only frame pointer to access stack
+  Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
+  etc, and BPF calling convention maps directly to ABIs used by the kernel on
+  64-bit architectures.
+  On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+  and may let more complex programs to be interpreted.
+  R0 - R5 are scratch registers and BPF program needs spill/fill them if
+  necessary across calls. Note that there is only one BPF program (== one BPF
+  main routine) and it cannot call other BPF functions, it can only call
+  predefined in-kernel functions, though.
+- Register width increases from 32-bit to 64-bit:
+  Still, the semantics of the original 32-bit ALU operations are preserved
+  via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
+  subregisters that zero-extend into 64-bit if they are being written to.
+  That behavior maps directly to x86_64 and arm64 subregister definition, but
+  makes other JITs more difficult.
+  32-bit architectures run 64-bit internal BPF programs via interpreter.
+  Their JITs may convert BPF programs that only use 32-bit subregisters into
+  native instruction set and let the rest being interpreted.
+  Operation is 64-bit, because on 64-bit architectures, pointers are also
+  64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
+  so 32-bit BPF registers would otherwise require to define register-pair
+  ABI, thus, there won't be able to use a direct BPF register to HW register
+  mapping and JIT would need to do combine/split/move operations for every
+  register in and out of the function, which is complex, bug prone and slow.
+  Another reason is the use of atomic 64-bit counters.
+- Conditional jt/jf targets replaced with jt/fall-through:
+  While the original design has constructs such as "if (cond) jump_true;
+  else jump_false;", they are being replaced into alternative constructs like
+  "if (cond) jump_true; /* else fall-through */".
+- Introduces bpf_call insn and register passing convention for zero overhead
+  calls from/to other kernel functions:
+  After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
+  return type of the function. Since R6 - R9 are callee saved, their state is
+  preserved across the call.
+Also in the new design, BPF is limited to 4096 insns, which means that any
+program will terminate quickly and will only call a fixed number of kernel
+functions. Original BPF and the new format are two operand instructions,
+which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
+The input context pointer for invoking the interpreter function is generic,
+its content is defined by a specific use case. For seccomp register R1 points
+to seccomp_data, for converted BPF filters R1 points to a skb.
+A program, that is translated internally consists of the following elements:
+  op:16, jt:8, jf:8, k:32    ==>    op:8, a_reg:4, x_reg:4, off:16, imm:32
+Just like the original BPF, the new format runs within a controlled environment,
+is deterministic and the kernel can easily prove that. The safety of the program
+can be determined in two steps: first step does depth-first-search to disallow
+loops and other CFG validation; second step starts from the first insn and
+descends all possible paths. It simulates execution of every insn and observes
+the state change of registers and stack.
 Misc
 ----
@@ -561,3 +685,4 @@ the underlying architecture.
 Jay Schulist <jschlst@samba.org>
 Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
author	Alexei Starovoitov <ast@plumgrid.com>	2014-03-28 13:58:26 -0400
committer	David S. Miller <davem@davemloft.net>	2014-03-31 00:45:09 -0400
commit	9a985cdc5ccb0d557720221d01bd70c19f04bb8c (patch)
tree	495b67bcf755829a5409da5b7444ea9b93f60b35 /Documentation/networking
parent	bd4cf0ed331a275e9bf5a49e6d0fd55dffc551b8 (diff)

diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index a06b48d2f5cc..81f940f4e884 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
546	For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful	546	For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
547	toolchain for developing and testing the kernel's JIT compiler.	547	toolchain for developing and testing the kernel's JIT compiler.
548		548
		549	BPF kernel internals
		550	--------------------
		551	Internally, for the kernel interpreter, a different BPF instruction set
		552	format with similar underlying principles from BPF described in previous
		553	paragraphs is being used. However, the instruction set format is modelled
		554	closer to the underlying architecture to mimic native instruction sets, so
		555	that a better performance can be achieved (more details later).
		556
		557	It is designed to be JITed with one to one mapping, which can also open up
		558	the possibility for GCC/LLVM compilers to generate optimized BPF code through
		559	a BPF backend that performs almost as fast as natively compiled code.
		560
		561	The new instruction set was originally designed with the possible goal in
		562	mind to write programs in "restricted C" and compile into BPF with a optional
		563	GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
		564	minimal performance overhead over two steps, that is, C -> BPF -> native code.
		565
		566	Currently, the new format is being used for running user BPF programs, which
		567	includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
		568	team driver's classifier for its load-balancing mode, netfilter's xt_bpf
		569	extension, PTP dissector/classifier, and much more. They are all internally
		570	converted by the kernel into the new instruction set representation and run
		571	in the extended interpreter. For in-kernel handlers, this all works
		572	transparently by using sk_unattached_filter_create() for setting up the
		573	filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
		574	SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
		575	run the filter. 'filter' is a pointer to struct sk_filter that we got from
		576	sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
		577	All constraints and restrictions from sk_chk_filter() apply before a
		578	conversion to the new layout is being done behind the scenes!
		579
		580	Currently, for JITing, the user BPF format is being used and current BPF JIT
		581	compilers reused whenever possible. In other words, we do not (yet!) perform
		582	a JIT compilation in the new layout, however, future work will successively
		583	migrate traditional JIT compilers into the new instruction format as well, so
		584	that they will profit from the very same benefits. Thus, when speaking about
		585	JIT in the following, a JIT compiler (TBD) for the new instruction format is
		586	meant in this context.
		587
		588	Some core changes of the new internal format:
		589
		590	- Number of registers increase from 2 to 10:
		591
		592	The old format had two registers A and X, and a hidden frame pointer. The
		593	new layout extends this to be 10 internal registers and a read-only frame
		594	pointer. Since 64-bit CPUs are passing arguments to functions via registers
		595	the number of args from BPF program to in-kernel function is restricted
		596	to 5 and one register is used to accept return value from an in-kernel
		597	function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
		598	sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
		599	registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
		600
		601	Therefore, BPF calling convention is defined as:
		602
		603	* R0 - return value from in-kernel function
		604	* R1 - R5 - arguments from BPF program to in-kernel function
		605	* R6 - R9 - callee saved registers that in-kernel function will preserve
		606	* R10 - read-only frame pointer to access stack
		607
		608	Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
		609	etc, and BPF calling convention maps directly to ABIs used by the kernel on
		610	64-bit architectures.
		611
		612	On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
		613	and may let more complex programs to be interpreted.
		614
		615	R0 - R5 are scratch registers and BPF program needs spill/fill them if
		616	necessary across calls. Note that there is only one BPF program (== one BPF
		617	main routine) and it cannot call other BPF functions, it can only call
		618	predefined in-kernel functions, though.
		619
		620	- Register width increases from 32-bit to 64-bit:
		621
		622	Still, the semantics of the original 32-bit ALU operations are preserved
		623	via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
		624	subregisters that zero-extend into 64-bit if they are being written to.
		625	That behavior maps directly to x86_64 and arm64 subregister definition, but
		626	makes other JITs more difficult.
		627
		628	32-bit architectures run 64-bit internal BPF programs via interpreter.
		629	Their JITs may convert BPF programs that only use 32-bit subregisters into
		630	native instruction set and let the rest being interpreted.
		631
		632	Operation is 64-bit, because on 64-bit architectures, pointers are also
		633	64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
		634	so 32-bit BPF registers would otherwise require to define register-pair
		635	ABI, thus, there won't be able to use a direct BPF register to HW register
		636	mapping and JIT would need to do combine/split/move operations for every
		637	register in and out of the function, which is complex, bug prone and slow.
		638	Another reason is the use of atomic 64-bit counters.
		639
		640	- Conditional jt/jf targets replaced with jt/fall-through:
		641
		642	While the original design has constructs such as "if (cond) jump_true;
		643	else jump_false;", they are being replaced into alternative constructs like
		644	"if (cond) jump_true; /* else fall-through */".
		645
		646	- Introduces bpf_call insn and register passing convention for zero overhead
		647	calls from/to other kernel functions:
		648
		649	After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
		650	return type of the function. Since R6 - R9 are callee saved, their state is
		651	preserved across the call.
		652
		653	Also in the new design, BPF is limited to 4096 insns, which means that any
		654	program will terminate quickly and will only call a fixed number of kernel
		655	functions. Original BPF and the new format are two operand instructions,
		656	which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
		657
		658	The input context pointer for invoking the interpreter function is generic,
		659	its content is defined by a specific use case. For seccomp register R1 points
		660	to seccomp_data, for converted BPF filters R1 points to a skb.
		661
		662	A program, that is translated internally consists of the following elements:
		663
		664	op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
		665
		666	Just like the original BPF, the new format runs within a controlled environment,
		667	is deterministic and the kernel can easily prove that. The safety of the program
		668	can be determined in two steps: first step does depth-first-search to disallow
		669	loops and other CFG validation; second step starts from the first insn and
		670	descends all possible paths. It simulates execution of every insn and observes
		671	the state change of registers and stack.
		672
549	Misc	673	Misc
550	----	674	----
551		675
@@ -561,3 +685,4 @@ the underlying architecture.
561		685
562	Jay Schulist <jschlst@samba.org>	686	Jay Schulist <jschlst@samba.org>
563	Daniel Borkmann <dborkman@redhat.com>	687	Daniel Borkmann <dborkman@redhat.com>
		688	Alexei Starovoitov <ast@plumgrid.com>