diff options
author | Alexei Starovoitov <ast@fb.com> | 2016-05-05 22:49:13 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2016-05-06 16:01:54 -0400 |
commit | f9c8d19d6c7c15a59963f80ec47e68808914abd4 (patch) | |
tree | aa1ef70115c0fb206623a613956412f3aa330cec /Documentation/networking | |
parent | db58ba45920255e967cc1d62a430cebd634b5046 (diff) |
bpf: add documentation for 'direct packet access'
explain how verifier checks safety of packet access
and update email addresses.
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/filter.txt | 85 |
1 files changed, 83 insertions, 2 deletions
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 96da119a47e7..6aef0b5f3bc7 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt | |||
@@ -1095,6 +1095,87 @@ all use cases. | |||
1095 | 1095 | ||
1096 | See details of eBPF verifier in kernel/bpf/verifier.c | 1096 | See details of eBPF verifier in kernel/bpf/verifier.c |
1097 | 1097 | ||
1098 | Direct packet access | ||
1099 | -------------------- | ||
1100 | In cls_bpf and act_bpf programs the verifier allows direct access to the packet | ||
1101 | data via skb->data and skb->data_end pointers. | ||
1102 | Ex: | ||
1103 | 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ | ||
1104 | 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ | ||
1105 | 3: r5 = r3 | ||
1106 | 4: r5 += 14 | ||
1107 | 5: if r5 > r4 goto pc+16 | ||
1108 | R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp | ||
1109 | 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ | ||
1110 | |||
1111 | this 2byte load from the packet is safe to do, since the program author | ||
1112 | did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which | ||
1113 | means that in the fall-through case the register R3 (which points to skb->data) | ||
1114 | has at least 14 directly accessible bytes. The verifier marks it | ||
1115 | as R3=pkt(id=0,off=0,r=14). | ||
1116 | id=0 means that no additional variables were added to the register. | ||
1117 | off=0 means that no additional constants were added. | ||
1118 | r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. | ||
1119 | Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points | ||
1120 | to the packet data, but constant 14 was added to the register, so | ||
1121 | it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) | ||
1122 | which is zero bytes. | ||
1123 | |||
1124 | More complex packet access may look like: | ||
1125 | R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp | ||
1126 | 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ | ||
1127 | 7: r4 = *(u8 *)(r3 +12) | ||
1128 | 8: r4 *= 14 | ||
1129 | 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ | ||
1130 | 10: r3 += r4 | ||
1131 | 11: r2 = r1 | ||
1132 | 12: r2 <<= 48 | ||
1133 | 13: r2 >>= 48 | ||
1134 | 14: r3 += r2 | ||
1135 | 15: r2 = r3 | ||
1136 | 16: r2 += 8 | ||
1137 | 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ | ||
1138 | 18: if r2 > r1 goto pc+2 | ||
1139 | R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp | ||
1140 | 19: r1 = *(u8 *)(r3 +4) | ||
1141 | The state of the register R3 is R3=pkt(id=2,off=0,r=8) | ||
1142 | id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some | ||
1143 | offset within a packet and since the program author did | ||
1144 | 'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). | ||
1145 | The verifier only allows 'add' operation on packet registers. Any other | ||
1146 | operation will set the register state to 'unknown_value' and it won't be | ||
1147 | available for direct packet access. | ||
1148 | Operation 'r3 += rX' may overflow and become less than original skb->data, | ||
1149 | therefore the verifier has to prevent that. So it tracks the number of | ||
1150 | upper zero bits in all 'uknown_value' registers, so when it sees | ||
1151 | 'r3 += rX' instruction and rX is more than 16-bit value, it will error as: | ||
1152 | "cannot add integer value with N upper zero bits to ptr_to_packet" | ||
1153 | Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is | ||
1154 | R4=inv56 which means that upper 56 bits on the register are guaranteed | ||
1155 | to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since | ||
1156 | multiplying 8-bit value by constant 14 will keep upper 52 bits as zero. | ||
1157 | Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign | ||
1158 | extending. This logic is implemented in evaluate_reg_alu() function. | ||
1159 | |||
1160 | The end result is that bpf program author can access packet directly | ||
1161 | using normal C code as: | ||
1162 | void *data = (void *)(long)skb->data; | ||
1163 | void *data_end = (void *)(long)skb->data_end; | ||
1164 | struct eth_hdr *eth = data; | ||
1165 | struct iphdr *iph = data + sizeof(*eth); | ||
1166 | struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); | ||
1167 | |||
1168 | if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) | ||
1169 | return 0; | ||
1170 | if (eth->h_proto != htons(ETH_P_IP)) | ||
1171 | return 0; | ||
1172 | if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) | ||
1173 | return 0; | ||
1174 | if (udp->dest == 53 || udp->source == 9) | ||
1175 | ...; | ||
1176 | which makes such programs easier to write comparing to LD_ABS insn | ||
1177 | and significantly faster. | ||
1178 | |||
1098 | eBPF maps | 1179 | eBPF maps |
1099 | --------- | 1180 | --------- |
1100 | 'maps' is a generic storage of different types for sharing data between kernel | 1181 | 'maps' is a generic storage of different types for sharing data between kernel |
@@ -1293,5 +1374,5 @@ to give potential BPF hackers or security auditors a better overview of | |||
1293 | the underlying architecture. | 1374 | the underlying architecture. |
1294 | 1375 | ||
1295 | Jay Schulist <jschlst@samba.org> | 1376 | Jay Schulist <jschlst@samba.org> |
1296 | Daniel Borkmann <dborkman@redhat.com> | 1377 | Daniel Borkmann <daniel@iogearbox.net> |
1297 | Alexei Starovoitov <ast@plumgrid.com> | 1378 | Alexei Starovoitov <ast@kernel.org> |