diff options
| -rw-r--r-- | Documentation/networking/filter.txt | 161 |
1 files changed, 161 insertions, 0 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 1c7fc6baed84..ee78eba78a9d 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt | |||
| @@ -834,6 +834,167 @@ loops and other CFG validation; second step starts from the first insn and | |||
| 834 | descends all possible paths. It simulates execution of every insn and observes | 834 | descends all possible paths. It simulates execution of every insn and observes |
| 835 | the state change of registers and stack. | 835 | the state change of registers and stack. |
| 836 | 836 | ||
| 837 | eBPF opcode encoding | ||
| 838 | -------------------- | ||
| 839 | |||
| 840 | eBPF is reusing most of the opcode encoding from classic to simplify conversion | ||
| 841 | of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' | ||
| 842 | field is divided into three parts: | ||
| 843 | |||
| 844 | +----------------+--------+--------------------+ | ||
| 845 | | 4 bits | 1 bit | 3 bits | | ||
| 846 | | operation code | source | instruction class | | ||
| 847 | +----------------+--------+--------------------+ | ||
| 848 | (MSB) (LSB) | ||
| 849 | |||
| 850 | Three LSB bits store instruction class which is one of: | ||
| 851 | |||
| 852 | Classic BPF classes: eBPF classes: | ||
| 853 | |||
| 854 | BPF_LD 0x00 BPF_LD 0x00 | ||
| 855 | BPF_LDX 0x01 BPF_LDX 0x01 | ||
| 856 | BPF_ST 0x02 BPF_ST 0x02 | ||
| 857 | BPF_STX 0x03 BPF_STX 0x03 | ||
| 858 | BPF_ALU 0x04 BPF_ALU 0x04 | ||
| 859 | BPF_JMP 0x05 BPF_JMP 0x05 | ||
| 860 | BPF_RET 0x06 [ class 6 unused, for future if needed ] | ||
| 861 | BPF_MISC 0x07 BPF_ALU64 0x07 | ||
| 862 | |||
| 863 | When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... | ||
| 864 | |||
| 865 | BPF_K 0x00 | ||
| 866 | BPF_X 0x08 | ||
| 867 | |||
| 868 | * in classic BPF, this means: | ||
| 869 | |||
| 870 | BPF_SRC(code) == BPF_X - use register X as source operand | ||
| 871 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | ||
| 872 | |||
| 873 | * in eBPF, this means: | ||
| 874 | |||
| 875 | BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand | ||
| 876 | BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand | ||
| 877 | |||
| 878 | ... and four MSB bits store operation code. | ||
| 879 | |||
| 880 | If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: | ||
| 881 | |||
| 882 | BPF_ADD 0x00 | ||
| 883 | BPF_SUB 0x10 | ||
| 884 | BPF_MUL 0x20 | ||
| 885 | BPF_DIV 0x30 | ||
| 886 | BPF_OR 0x40 | ||
| 887 | BPF_AND 0x50 | ||
| 888 | BPF_LSH 0x60 | ||
| 889 | BPF_RSH 0x70 | ||
| 890 | BPF_NEG 0x80 | ||
| 891 | BPF_MOD 0x90 | ||
| 892 | BPF_XOR 0xa0 | ||
| 893 | BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ | ||
| 894 | BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ | ||
| 895 | BPF_END 0xd0 /* eBPF only: endianness conversion */ | ||
| 896 | |||
| 897 | If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: | ||
| 898 | |||
| 899 | BPF_JA 0x00 | ||
| 900 | BPF_JEQ 0x10 | ||
| 901 | BPF_JGT 0x20 | ||
| 902 | BPF_JGE 0x30 | ||
| 903 | BPF_JSET 0x40 | ||
| 904 | BPF_JNE 0x50 /* eBPF only: jump != */ | ||
| 905 | BPF_JSGT 0x60 /* eBPF only: signed '>' */ | ||
| 906 | BPF_JSGE 0x70 /* eBPF only: signed '>=' */ | ||
| 907 | BPF_CALL 0x80 /* eBPF only: function call */ | ||
| 908 | BPF_EXIT 0x90 /* eBPF only: function return */ | ||
| 909 | |||
| 910 | So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF | ||
| 911 | and eBPF. There are only two registers in classic BPF, so it means A += X. | ||
| 912 | In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, | ||
| 913 | BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous | ||
| 914 | src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. | ||
| 915 | |||
| 916 | Classic BPF is using BPF_MISC class to represent A = X and X = A moves. | ||
| 917 | eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no | ||
| 918 | BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean | ||
| 919 | exactly the same operations as BPF_ALU, but with 64-bit wide operands | ||
| 920 | instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: | ||
| 921 | dst_reg = dst_reg + src_reg | ||
| 922 | |||
| 923 | Classic BPF wastes the whole BPF_RET class to represent a single 'ret' | ||
| 924 | operation. Classic BPF_RET | BPF_K means copy imm32 into return register | ||
| 925 | and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT | ||
| 926 | in eBPF means function exit only. The eBPF program needs to store return | ||
| 927 | value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently | ||
| 928 | unused and reserved for future use. | ||
| 929 | |||
| 930 | For load and store instructions the 8-bit 'code' field is divided as: | ||
| 931 | |||
| 932 | +--------+--------+-------------------+ | ||
| 933 | | 3 bits | 2 bits | 3 bits | | ||
| 934 | | mode | size | instruction class | | ||
| 935 | +--------+--------+-------------------+ | ||
| 936 | (MSB) (LSB) | ||
| 937 | |||
| 938 | Size modifier is one of ... | ||
| 939 | |||
| 940 | BPF_W 0x00 /* word */ | ||
| 941 | BPF_H 0x08 /* half word */ | ||
| 942 | BPF_B 0x10 /* byte */ | ||
| 943 | BPF_DW 0x18 /* eBPF only, double word */ | ||
| 944 | |||
| 945 | ... which encodes size of load/store operation: | ||
| 946 | |||
| 947 | B - 1 byte | ||
| 948 | H - 2 byte | ||
| 949 | W - 4 byte | ||
| 950 | DW - 8 byte (eBPF only) | ||
| 951 | |||
| 952 | Mode modifier is one of: | ||
| 953 | |||
| 954 | BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ | ||
| 955 | BPF_ABS 0x20 | ||
| 956 | BPF_IND 0x40 | ||
| 957 | BPF_MEM 0x60 | ||
| 958 | BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ | ||
| 959 | BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ | ||
| 960 | BPF_XADD 0xc0 /* eBPF only, exclusive add */ | ||
| 961 | |||
| 962 | eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and | ||
| 963 | (BPF_IND | <size> | BPF_LD) which are used to access packet data. | ||
| 964 | |||
| 965 | They had to be carried over from classic to have strong performance of | ||
| 966 | socket filters running in eBPF interpreter. These instructions can only | ||
| 967 | be used when interpreter context is a pointer to 'struct sk_buff' and | ||
| 968 | have seven implicit operands. Register R6 is an implicit input that must | ||
| 969 | contain pointer to sk_buff. Register R0 is an implicit output which contains | ||
| 970 | the data fetched from the packet. Registers R1-R5 are scratch registers | ||
| 971 | and must not be used to store the data across BPF_ABS | BPF_LD or | ||
| 972 | BPF_IND | BPF_LD instructions. | ||
| 973 | |||
| 974 | These instructions have implicit program exit condition as well. When | ||
| 975 | eBPF program is trying to access the data beyond the packet boundary, | ||
| 976 | the interpreter will abort the execution of the program. JIT compilers | ||
| 977 | therefore must preserve this property. src_reg and imm32 fields are | ||
| 978 | explicit inputs to these instructions. | ||
| 979 | |||
| 980 | For example: | ||
| 981 | |||
| 982 | BPF_IND | BPF_W | BPF_LD means: | ||
| 983 | |||
| 984 | R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) | ||
| 985 | and R1 - R5 were scratched. | ||
| 986 | |||
| 987 | Unlike classic BPF instruction set, eBPF has generic load/store operations: | ||
| 988 | |||
| 989 | BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg | ||
| 990 | BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 | ||
| 991 | BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) | ||
| 992 | BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg | ||
| 993 | BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg | ||
| 994 | |||
| 995 | Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and | ||
| 996 | 2 byte atomic increments are not supported. | ||
| 997 | |||
| 837 | Testing | 998 | Testing |
| 838 | ------- | 999 | ------- |
| 839 | 1000 | ||
