diff options
Diffstat (limited to 'Documentation/networking/packet_mmap.txt')
-rw-r--r-- | Documentation/networking/packet_mmap.txt | 399 |
1 files changed, 399 insertions, 0 deletions
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt new file mode 100644 index 000000000000..8d4cf78258e4 --- /dev/null +++ b/Documentation/networking/packet_mmap.txt | |||
@@ -0,0 +1,399 @@ | |||
1 | -------------------------------------------------------------------------------- | ||
2 | + ABSTRACT | ||
3 | -------------------------------------------------------------------------------- | ||
4 | |||
5 | This file documents the CONFIG_PACKET_MMAP option available with the PACKET | ||
6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for | ||
7 | capture network traffic with utilities like tcpdump or any other that uses | ||
8 | the libpcap library. | ||
9 | |||
10 | You can find the latest version of this document at | ||
11 | |||
12 | http://pusa.uv.es/~ulisses/packet_mmap/ | ||
13 | |||
14 | Please send me your comments to | ||
15 | |||
16 | Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> | ||
17 | |||
18 | ------------------------------------------------------------------------------- | ||
19 | + Why use PACKET_MMAP | ||
20 | -------------------------------------------------------------------------------- | ||
21 | |||
22 | In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very | ||
23 | inefficient. It uses very limited buffers and requires one system call | ||
24 | to capture each packet, it requires two if you want to get packet's | ||
25 | timestamp (like libpcap always does). | ||
26 | |||
27 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | ||
28 | configurable circular buffer mapped in user space. This way reading packets just | ||
29 | needs to wait for them, most of the time there is no need to issue a single | ||
30 | system call. By using a shared buffer between the kernel and the user | ||
31 | also has the benefit of minimizing packet copies. | ||
32 | |||
33 | It's fine to use PACKET_MMAP to improve the performance of the capture process, | ||
34 | but it isn't everything. At least, if you are capturing at high speeds (this | ||
35 | is relative to the cpu speed), you should check if the device driver of your | ||
36 | network interface card supports some sort of interrupt load mitigation or | ||
37 | (even better) if it supports NAPI, also make sure it is enabled. | ||
38 | |||
39 | -------------------------------------------------------------------------------- | ||
40 | + How to use CONFIG_PACKET_MMAP | ||
41 | -------------------------------------------------------------------------------- | ||
42 | |||
43 | From the user standpoint, you should use the higher level libpcap library, wich | ||
44 | is a de facto standard, portable across nearly all operating systems | ||
45 | including Win32. | ||
46 | |||
47 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include | ||
48 | support for PACKET_MMAP, and also probably the libpcap included in your distribution. | ||
49 | |||
50 | I'm aware of two implementations of PACKET_MMAP in libpcap: | ||
51 | |||
52 | http://pusa.uv.es/~ulisses/packet_mmap/ (by Simon Patarin, based on libpcap 0.6.2) | ||
53 | http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) | ||
54 | |||
55 | The rest of this document is intended for people who want to understand | ||
56 | the low level details or want to improve libpcap by including PACKET_MMAP | ||
57 | support. | ||
58 | |||
59 | -------------------------------------------------------------------------------- | ||
60 | + How to use CONFIG_PACKET_MMAP directly | ||
61 | -------------------------------------------------------------------------------- | ||
62 | |||
63 | From the system calls stand point, the use of PACKET_MMAP involves | ||
64 | the following process: | ||
65 | |||
66 | |||
67 | [setup] socket() -------> creation of the capture socket | ||
68 | setsockopt() ---> allocation of the circular buffer (ring) | ||
69 | mmap() ---------> maping of the allocated buffer to the | ||
70 | user process | ||
71 | |||
72 | [capture] poll() ---------> to wait for incoming packets | ||
73 | |||
74 | [shutdown] close() --------> destruction of the capture socket and | ||
75 | deallocation of all associated | ||
76 | resources. | ||
77 | |||
78 | |||
79 | socket creation and destruction is straight forward, and is done | ||
80 | the same way with or without PACKET_MMAP: | ||
81 | |||
82 | int fd; | ||
83 | |||
84 | fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) | ||
85 | |||
86 | where mode is SOCK_RAW for the raw interface were link level | ||
87 | information can be captured or SOCK_DGRAM for the cooked | ||
88 | interface where link level information capture is not | ||
89 | supported and a link level pseudo-header is provided | ||
90 | by the kernel. | ||
91 | |||
92 | The destruction of the socket and all associated resources | ||
93 | is done by a simple call to close(fd). | ||
94 | |||
95 | Next I will describe PACKET_MMAP settings and it's constraints, | ||
96 | also the maping of the circular buffer in the user process and | ||
97 | the use of this buffer. | ||
98 | |||
99 | -------------------------------------------------------------------------------- | ||
100 | + PACKET_MMAP settings | ||
101 | -------------------------------------------------------------------------------- | ||
102 | |||
103 | |||
104 | To setup PACKET_MMAP from user level code is done with a call like | ||
105 | |||
106 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) | ||
107 | |||
108 | The most significant argument in the previous call is the req parameter, | ||
109 | this parameter must to have the following structure: | ||
110 | |||
111 | struct tpacket_req | ||
112 | { | ||
113 | unsigned int tp_block_size; /* Minimal size of contiguous block */ | ||
114 | unsigned int tp_block_nr; /* Number of blocks */ | ||
115 | unsigned int tp_frame_size; /* Size of frame */ | ||
116 | unsigned int tp_frame_nr; /* Total number of frames */ | ||
117 | }; | ||
118 | |||
119 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | ||
120 | circular buffer (ring) of unswappable memory mapped in the capture process. | ||
121 | Being mapped in the capture process allows reading the captured frames and | ||
122 | related meta-information like timestamps without requiring a system call. | ||
123 | |||
124 | Captured frames are grouped in blocks. Each block is a physically contiguous | ||
125 | region of memory and holds tp_block_size/tp_frame_size frames. The total number | ||
126 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | ||
127 | |||
128 | frames_per_block = tp_block_size/tp_frame_size | ||
129 | |||
130 | indeed, packet_set_ring checks that the following condition is true | ||
131 | |||
132 | frames_per_block * tp_block_nr == tp_frame_nr | ||
133 | |||
134 | |||
135 | Lets see an example, with the following values: | ||
136 | |||
137 | tp_block_size= 4096 | ||
138 | tp_frame_size= 2048 | ||
139 | tp_block_nr = 4 | ||
140 | tp_frame_nr = 8 | ||
141 | |||
142 | we will get the following buffer structure: | ||
143 | |||
144 | block #1 block #2 | ||
145 | +---------+---------+ +---------+---------+ | ||
146 | | frame 1 | frame 2 | | frame 3 | frame 4 | | ||
147 | +---------+---------+ +---------+---------+ | ||
148 | |||
149 | block #3 block #4 | ||
150 | +---------+---------+ +---------+---------+ | ||
151 | | frame 5 | frame 6 | | frame 7 | frame 8 | | ||
152 | +---------+---------+ +---------+---------+ | ||
153 | |||
154 | A frame can be of any size with the only condition it can fit in a block. A block | ||
155 | can only hold an integer number of frames, or in other words, a frame cannot | ||
156 | be spawn accross two blocks so there are some datails you have to take into | ||
157 | account when choosing the frame_size. See "Maping and use of the circular | ||
158 | buffer (ring)". | ||
159 | |||
160 | |||
161 | -------------------------------------------------------------------------------- | ||
162 | + PACKET_MMAP setting constraints | ||
163 | -------------------------------------------------------------------------------- | ||
164 | |||
165 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | ||
166 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | ||
167 | 16384 in a 64 bit architecture. For information on these kernel versions | ||
168 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | ||
169 | |||
170 | Block size limit | ||
171 | ------------------ | ||
172 | |||
173 | As stated earlier, each block is a contiguous physical region of memory. These | ||
174 | memory regions are allocated with calls to the __get_free_pages() function. As | ||
175 | the name indicates, this function allocates pages of memory, and the second | ||
176 | argument is "order" or a power of two number of pages, that is | ||
177 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes, | ||
178 | order=2 ==> 16384 bytes, etc. The maximum size of a | ||
179 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More | ||
180 | precisely the limit can be calculated as: | ||
181 | |||
182 | PAGE_SIZE << MAX_ORDER | ||
183 | |||
184 | In a i386 architecture PAGE_SIZE is 4096 bytes | ||
185 | In a 2.4/i386 kernel MAX_ORDER is 10 | ||
186 | In a 2.6/i386 kernel MAX_ORDER is 11 | ||
187 | |||
188 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel | ||
189 | respectively, with an i386 architecture. | ||
190 | |||
191 | User space programs can include /usr/include/sys/user.h and | ||
192 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | ||
193 | |||
194 | The pagesize can also be determined dynamically with the getpagesize (2) | ||
195 | system call. | ||
196 | |||
197 | |||
198 | Block number limit | ||
199 | -------------------- | ||
200 | |||
201 | To understand the constraints of PACKET_MMAP, we have to see the structure | ||
202 | used to hold the pointers to each block. | ||
203 | |||
204 | Currently, this structure is a dynamically allocated vector with kmalloc | ||
205 | called pg_vec, its size limits the number of blocks that can be allocated. | ||
206 | |||
207 | +---+---+---+---+ | ||
208 | | x | x | x | x | | ||
209 | +---+---+---+---+ | ||
210 | | | | | | ||
211 | | | | v | ||
212 | | | v block #4 | ||
213 | | v block #3 | ||
214 | v block #2 | ||
215 | block #1 | ||
216 | |||
217 | |||
218 | kmalloc allocates any number of bytes of phisically contiguous memory from | ||
219 | a pool of pre-determined sizes. This pool of memory is mantained by the slab | ||
220 | allocator wich is at the end the responsible for doing the allocation and | ||
221 | hence wich imposes the maximum memory that kmalloc can allocate. | ||
222 | |||
223 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The | ||
224 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" | ||
225 | entries of /proc/slabinfo | ||
226 | |||
227 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of | ||
228 | pointers to blocks is | ||
229 | |||
230 | 131072/4 = 32768 blocks | ||
231 | |||
232 | |||
233 | PACKET_MMAP buffer size calculator | ||
234 | ------------------------------------ | ||
235 | |||
236 | Definitions: | ||
237 | |||
238 | <size-max> : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | ||
239 | <pointer size>: depends on the architecture -- sizeof(void *) | ||
240 | <page size> : depends on the architecture -- PAGE_SIZE or getpagesize (2) | ||
241 | <max-order> : is the value defined with MAX_ORDER | ||
242 | <frame size> : it's an upper bound of frame's capture size (more on this later) | ||
243 | |||
244 | from these definitions we will derive | ||
245 | |||
246 | <block number> = <size-max>/<pointer size> | ||
247 | <block size> = <pagesize> << <max-order> | ||
248 | |||
249 | so, the max buffer size is | ||
250 | |||
251 | <block number> * <block size> | ||
252 | |||
253 | and, the number of frames be | ||
254 | |||
255 | <block number> * <block size> / <frame size> | ||
256 | |||
257 | Suposse the following parameters, wich apply for 2.6 kernel and an | ||
258 | i386 architecture: | ||
259 | |||
260 | <size-max> = 131072 bytes | ||
261 | <pointer size> = 4 bytes | ||
262 | <pagesize> = 4096 bytes | ||
263 | <max-order> = 11 | ||
264 | |||
265 | and a value for <frame size> of 2048 byteas. These parameters will yield | ||
266 | |||
267 | <block number> = 131072/4 = 32768 blocks | ||
268 | <block size> = 4096 << 11 = 8 MiB. | ||
269 | |||
270 | and hence the buffer will have a 262144 MiB size. So it can hold | ||
271 | 262144 MiB / 2048 bytes = 134217728 frames | ||
272 | |||
273 | |||
274 | Actually, this buffer size is not possible with an i386 architecture. | ||
275 | Remember that the memory is allocated in kernel space, in the case of | ||
276 | an i386 kernel's memory size is limited to 1GiB. | ||
277 | |||
278 | All memory allocations are not freed until the socket is closed. The memory | ||
279 | allocations are done with GFP_KERNEL priority, this basically means that | ||
280 | the allocation can wait and swap other process' memory in order to allocate | ||
281 | the nececessary memory, so normally limits can be reached. | ||
282 | |||
283 | Other constraints | ||
284 | ------------------- | ||
285 | |||
286 | If you check the source code you will see that what I draw here as a frame | ||
287 | is not only the link level frame. At the begining of each frame there is a | ||
288 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame | ||
289 | meta information like timestamp. So what we draw here a frame it's really | ||
290 | the following (from include/linux/if_packet.h): | ||
291 | |||
292 | /* | ||
293 | Frame structure: | ||
294 | |||
295 | - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | ||
296 | - struct tpacket_hdr | ||
297 | - pad to TPACKET_ALIGNMENT=16 | ||
298 | - struct sockaddr_ll | ||
299 | - Gap, chosen so that packet data (Start+tp_net) alignes to | ||
300 | TPACKET_ALIGNMENT=16 | ||
301 | - Start+tp_mac: [ Optional MAC header ] | ||
302 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | ||
303 | - Pad to align to TPACKET_ALIGNMENT=16 | ||
304 | */ | ||
305 | |||
306 | |||
307 | The following are conditions that are checked in packet_set_ring | ||
308 | |||
309 | tp_block_size must be a multiple of PAGE_SIZE (1) | ||
310 | tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | ||
311 | tp_frame_size must be a multiple of TPACKET_ALIGNMENT | ||
312 | tp_frame_nr must be exactly frames_per_block*tp_block_nr | ||
313 | |||
314 | Note that tp_block_size should be choosed to be a power of two or there will | ||
315 | be a waste of memory. | ||
316 | |||
317 | -------------------------------------------------------------------------------- | ||
318 | + Maping and use of the circular buffer (ring) | ||
319 | -------------------------------------------------------------------------------- | ||
320 | |||
321 | The maping of the buffer in the user process is done with the conventional | ||
322 | mmap function. Even the circular buffer is compound of several physically | ||
323 | discontiguous blocks of memory, they are contiguous to the user space, hence | ||
324 | just one call to mmap is needed: | ||
325 | |||
326 | mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | ||
327 | |||
328 | If tp_frame_size is a divisor of tp_block_size frames will be | ||
329 | contiguosly spaced by tp_frame_size bytes. If not, each | ||
330 | tp_block_size/tp_frame_size frames there will be a gap between | ||
331 | the frames. This is because a frame cannot be spawn across two | ||
332 | blocks. | ||
333 | |||
334 | At the beginning of each frame there is an status field (see | ||
335 | struct tpacket_hdr). If this field is 0 means that the frame is ready | ||
336 | to be used for the kernel, If not, there is a frame the user can read | ||
337 | and the following flags apply: | ||
338 | |||
339 | from include/linux/if_packet.h | ||
340 | |||
341 | #define TP_STATUS_COPY 2 | ||
342 | #define TP_STATUS_LOSING 4 | ||
343 | #define TP_STATUS_CSUMNOTREADY 8 | ||
344 | |||
345 | |||
346 | TP_STATUS_COPY : This flag indicates that the frame (and associated | ||
347 | meta information) has been truncated because it's | ||
348 | larger than tp_frame_size. This packet can be | ||
349 | read entirely with recvfrom(). | ||
350 | |||
351 | In order to make this work it must to be | ||
352 | enabled previously with setsockopt() and | ||
353 | the PACKET_COPY_THRESH option. | ||
354 | |||
355 | The number of frames than can be buffered to | ||
356 | be read with recvfrom is limited like a normal socket. | ||
357 | See the SO_RCVBUF option in the socket (7) man page. | ||
358 | |||
359 | TP_STATUS_LOSING : indicates there were packet drops from last time | ||
360 | statistics where checked with getsockopt() and | ||
361 | the PACKET_STATISTICS option. | ||
362 | |||
363 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets wich | ||
364 | it's checksum will be done in hardware. So while | ||
365 | reading the packet we should not try to check the | ||
366 | checksum. | ||
367 | |||
368 | for convenience there are also the following defines: | ||
369 | |||
370 | #define TP_STATUS_KERNEL 0 | ||
371 | #define TP_STATUS_USER 1 | ||
372 | |||
373 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | ||
374 | receives a packet it puts in the buffer and updates the status with | ||
375 | at least the TP_STATUS_USER flag. Then the user can read the packet, | ||
376 | once the packet is read the user must zero the status field, so the kernel | ||
377 | can use again that frame buffer. | ||
378 | |||
379 | The user can use poll (any other variant should apply too) to check if new | ||
380 | packets are in the ring: | ||
381 | |||
382 | struct pollfd pfd; | ||
383 | |||
384 | pfd.fd = fd; | ||
385 | pfd.revents = 0; | ||
386 | pfd.events = POLLIN|POLLRDNORM|POLLERR; | ||
387 | |||
388 | if (status == TP_STATUS_KERNEL) | ||
389 | retval = poll(&pfd, 1, timeout); | ||
390 | |||
391 | It doesn't incur in a race condition to first check the status value and | ||
392 | then poll for frames. | ||
393 | |||
394 | -------------------------------------------------------------------------------- | ||
395 | + THANKS | ||
396 | -------------------------------------------------------------------------------- | ||
397 | |||
398 | Jesse Brandeburg, for fixing my grammathical/spelling errors | ||
399 | |||