diff options
author | Daniel Borkmann <dxchgb@gmail.com> | 2012-11-07 21:37:01 -0500 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2012-11-09 16:45:49 -0500 |
commit | d1ee40f96036e838f0849dd31c16e548a904176c (patch) | |
tree | cfe215729cfc5a002b830dfd3f08ede05553aa21 /Documentation/networking | |
parent | 56277f40d73f12d9ffb12b5a6bd7e0ad1f2284b3 (diff) |
doc: packet_mmap: update doc to implementation status
This improves the packet_mmap.txt document in the following ways:
* Add initial information about different TPACKET versions
* Add initial information about packet fanout
* Add pointer to BPF document (since this also could be of interest)
* 'Fix' minor, rather cosmetic things
Information partially taken from related commit messages.
Reported-by: Ronny Meeus <ronny.meeus@gmail.com>
Signed-off-by: Daniel Borkmann <daniel.borkmann@tik.ee.ethz.ch>
Cc: Ulisses Alonso CamarĂ³ <uaca@alumni.uv.es>
Cc: Johann Baudy <johann.baudy@gnu-log.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/packet_mmap.txt | 233 |
1 files changed, 209 insertions, 24 deletions
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 7cd879eba5dc..94444b152fbc 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt | |||
@@ -3,9 +3,9 @@ | |||
3 | -------------------------------------------------------------------------------- | 3 | -------------------------------------------------------------------------------- |
4 | 4 | ||
5 | This file documents the mmap() facility available with the PACKET | 5 | This file documents the mmap() facility available with the PACKET |
6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for | 6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for |
7 | capture network traffic with utilities like tcpdump or any other that needs | 7 | i) capture network traffic with utilities like tcpdump, ii) transmit network |
8 | raw access to network interface. | 8 | traffic, or any other that needs raw access to network interface. |
9 | 9 | ||
10 | You can find the latest version of this document at: | 10 | You can find the latest version of this document at: |
11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap | 11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap |
@@ -21,19 +21,18 @@ Please send your comments to | |||
21 | + Why use PACKET_MMAP | 21 | + Why use PACKET_MMAP |
22 | -------------------------------------------------------------------------------- | 22 | -------------------------------------------------------------------------------- |
23 | 23 | ||
24 | In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very | 24 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very |
25 | inefficient. It uses very limited buffers and requires one system call | 25 | inefficient. It uses very limited buffers and requires one system call to |
26 | to capture each packet, it requires two if you want to get packet's | 26 | capture each packet, it requires two if you want to get packet's timestamp |
27 | timestamp (like libpcap always does). | 27 | (like libpcap always does). |
28 | 28 | ||
29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size |
30 | configurable circular buffer mapped in user space that can be used to either | 30 | configurable circular buffer mapped in user space that can be used to either |
31 | send or receive packets. This way reading packets just needs to wait for them, | 31 | send or receive packets. This way reading packets just needs to wait for them, |
32 | most of the time there is no need to issue a single system call. Concerning | 32 | most of the time there is no need to issue a single system call. Concerning |
33 | transmission, multiple packets can be sent through one system call to get the | 33 | transmission, multiple packets can be sent through one system call to get the |
34 | highest bandwidth. | 34 | highest bandwidth. By using a shared buffer between the kernel and the user |
35 | By using a shared buffer between the kernel and the user also has the benefit | 35 | also has the benefit of minimizing packet copies. |
36 | of minimizing packet copies. | ||
37 | 36 | ||
38 | It's fine to use PACKET_MMAP to improve the performance of the capture and | 37 | It's fine to use PACKET_MMAP to improve the performance of the capture and |
39 | transmission process, but it isn't everything. At least, if you are capturing | 38 | transmission process, but it isn't everything. At least, if you are capturing |
@@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the | |||
41 | device driver of your network interface card supports some sort of interrupt | 40 | device driver of your network interface card supports some sort of interrupt |
42 | load mitigation or (even better) if it supports NAPI, also make sure it is | 41 | load mitigation or (even better) if it supports NAPI, also make sure it is |
43 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | 42 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and |
44 | supported by devices of your network. | 43 | supported by devices of your network. CPU IRQ pinning of your network interface |
44 | card can also be an advantage. | ||
45 | 45 | ||
46 | -------------------------------------------------------------------------------- | 46 | -------------------------------------------------------------------------------- |
47 | + How to use mmap() to improve capture process | 47 | + How to use mmap() to improve capture process |
@@ -87,9 +87,7 @@ the following process: | |||
87 | socket creation and destruction is straight forward, and is done | 87 | socket creation and destruction is straight forward, and is done |
88 | the same way with or without PACKET_MMAP: | 88 | the same way with or without PACKET_MMAP: |
89 | 89 | ||
90 | int fd; | 90 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); |
91 | |||
92 | fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) | ||
93 | 91 | ||
94 | where mode is SOCK_RAW for the raw interface were link level | 92 | where mode is SOCK_RAW for the raw interface were link level |
95 | information can be captured or SOCK_DGRAM for the cooked | 93 | information can be captured or SOCK_DGRAM for the cooked |
@@ -180,7 +178,6 @@ and the PACKET_TX_HAS_OFF option. | |||
180 | + PACKET_MMAP settings | 178 | + PACKET_MMAP settings |
181 | -------------------------------------------------------------------------------- | 179 | -------------------------------------------------------------------------------- |
182 | 180 | ||
183 | |||
184 | To setup PACKET_MMAP from user level code is done with a call like | 181 | To setup PACKET_MMAP from user level code is done with a call like |
185 | 182 | ||
186 | - Capture process | 183 | - Capture process |
@@ -214,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true | |||
214 | 211 | ||
215 | frames_per_block * tp_block_nr == tp_frame_nr | 212 | frames_per_block * tp_block_nr == tp_frame_nr |
216 | 213 | ||
217 | |||
218 | Lets see an example, with the following values: | 214 | Lets see an example, with the following values: |
219 | 215 | ||
220 | tp_block_size= 4096 | 216 | tp_block_size= 4096 |
@@ -240,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into | |||
240 | account when choosing the frame_size. See "Mapping and use of the circular | 236 | account when choosing the frame_size. See "Mapping and use of the circular |
241 | buffer (ring)". | 237 | buffer (ring)". |
242 | 238 | ||
243 | |||
244 | -------------------------------------------------------------------------------- | 239 | -------------------------------------------------------------------------------- |
245 | + PACKET_MMAP setting constraints | 240 | + PACKET_MMAP setting constraints |
246 | -------------------------------------------------------------------------------- | 241 | -------------------------------------------------------------------------------- |
@@ -277,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and | |||
277 | The pagesize can also be determined dynamically with the getpagesize (2) | 272 | The pagesize can also be determined dynamically with the getpagesize (2) |
278 | system call. | 273 | system call. |
279 | 274 | ||
280 | |||
281 | Block number limit | 275 | Block number limit |
282 | -------------------- | 276 | -------------------- |
283 | 277 | ||
@@ -297,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated. | |||
297 | v block #2 | 291 | v block #2 |
298 | block #1 | 292 | block #1 |
299 | 293 | ||
300 | |||
301 | kmalloc allocates any number of bytes of physically contiguous memory from | 294 | kmalloc allocates any number of bytes of physically contiguous memory from |
302 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | 295 | a pool of pre-determined sizes. This pool of memory is maintained by the slab |
303 | allocator which is at the end the responsible for doing the allocation and | 296 | allocator which is at the end the responsible for doing the allocation and |
@@ -312,7 +305,6 @@ pointers to blocks is | |||
312 | 305 | ||
313 | 131072/4 = 32768 blocks | 306 | 131072/4 = 32768 blocks |
314 | 307 | ||
315 | |||
316 | PACKET_MMAP buffer size calculator | 308 | PACKET_MMAP buffer size calculator |
317 | ------------------------------------ | 309 | ------------------------------------ |
318 | 310 | ||
@@ -353,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield | |||
353 | and hence the buffer will have a 262144 MiB size. So it can hold | 345 | and hence the buffer will have a 262144 MiB size. So it can hold |
354 | 262144 MiB / 2048 bytes = 134217728 frames | 346 | 262144 MiB / 2048 bytes = 134217728 frames |
355 | 347 | ||
356 | |||
357 | Actually, this buffer size is not possible with an i386 architecture. | 348 | Actually, this buffer size is not possible with an i386 architecture. |
358 | Remember that the memory is allocated in kernel space, in the case of | 349 | Remember that the memory is allocated in kernel space, in the case of |
359 | an i386 kernel's memory size is limited to 1GiB. | 350 | an i386 kernel's memory size is limited to 1GiB. |
@@ -385,7 +376,6 @@ the following (from include/linux/if_packet.h): | |||
385 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | 376 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. |
386 | - Pad to align to TPACKET_ALIGNMENT=16 | 377 | - Pad to align to TPACKET_ALIGNMENT=16 |
387 | */ | 378 | */ |
388 | |||
389 | 379 | ||
390 | The following are conditions that are checked in packet_set_ring | 380 | The following are conditions that are checked in packet_set_ring |
391 | 381 | ||
@@ -426,7 +416,6 @@ and the following flags apply: | |||
426 | #define TP_STATUS_LOSING 4 | 416 | #define TP_STATUS_LOSING 4 |
427 | #define TP_STATUS_CSUMNOTREADY 8 | 417 | #define TP_STATUS_CSUMNOTREADY 8 |
428 | 418 | ||
429 | |||
430 | TP_STATUS_COPY : This flag indicates that the frame (and associated | 419 | TP_STATUS_COPY : This flag indicates that the frame (and associated |
431 | meta information) has been truncated because it's | 420 | meta information) has been truncated because it's |
432 | larger than tp_frame_size. This packet can be | 421 | larger than tp_frame_size. This packet can be |
@@ -475,7 +464,6 @@ packets are in the ring: | |||
475 | It doesn't incur in a race condition to first check the status value and | 464 | It doesn't incur in a race condition to first check the status value and |
476 | then poll for frames. | 465 | then poll for frames. |
477 | 466 | ||
478 | |||
479 | ++ Transmission process | 467 | ++ Transmission process |
480 | Those defines are also used for transmission: | 468 | Those defines are also used for transmission: |
481 | 469 | ||
@@ -507,6 +495,196 @@ The user can also use poll() to check if a buffer is available: | |||
507 | retval = poll(&pfd, 1, timeout); | 495 | retval = poll(&pfd, 1, timeout); |
508 | 496 | ||
509 | ------------------------------------------------------------------------------- | 497 | ------------------------------------------------------------------------------- |
498 | + What TPACKET versions are available and when to use them? | ||
499 | ------------------------------------------------------------------------------- | ||
500 | |||
501 | int val = tpacket_version; | ||
502 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | ||
503 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | ||
504 | |||
505 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. | ||
506 | |||
507 | TPACKET_V1: | ||
508 | - Default if not otherwise specified by setsockopt(2) | ||
509 | - RX_RING, TX_RING available | ||
510 | - VLAN metadata information available for packets | ||
511 | (TP_STATUS_VLAN_VALID) | ||
512 | |||
513 | TPACKET_V1 --> TPACKET_V2: | ||
514 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 | ||
515 | structures, thus this also works on 64 bit kernel with 32 bit | ||
516 | userspace and the like | ||
517 | - Timestamp resolution in nanoseconds instead of microseconds | ||
518 | - RX_RING, TX_RING available | ||
519 | - How to switch to TPACKET_V2: | ||
520 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr | ||
521 | 2. Query header len and save | ||
522 | 3. Set protocol version to 2, set up ring as usual | ||
523 | 4. For getting the sockaddr_ll, | ||
524 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of | ||
525 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | ||
526 | |||
527 | TPACKET_V2 --> TPACKET_V3: | ||
528 | - Flexible buffer implementation: | ||
529 | 1. Blocks can be configured with non-static frame-size | ||
530 | 2. Read/poll is at a block-level (as opposed to packet-level) | ||
531 | 3. Added poll timeout to avoid indefinite user-space wait | ||
532 | on idle links | ||
533 | 4. Added user-configurable knobs: | ||
534 | 4.1 block::timeout | ||
535 | 4.2 tpkt_hdr::sk_rxhash | ||
536 | - RX Hash data available in user space | ||
537 | - Currently only RX_RING available | ||
538 | |||
539 | ------------------------------------------------------------------------------- | ||
540 | + AF_PACKET fanout mode | ||
541 | ------------------------------------------------------------------------------- | ||
542 | |||
543 | In the AF_PACKET fanout mode, packet reception can be load balanced among | ||
544 | processes. This also works in combination with mmap(2) on packet sockets. | ||
545 | |||
546 | Minimal example code by David S. Miller (try things like "./test eth0 hash", | ||
547 | "./test eth0 lb", etc.): | ||
548 | |||
549 | #include <stddef.h> | ||
550 | #include <stdlib.h> | ||
551 | #include <stdio.h> | ||
552 | #include <string.h> | ||
553 | |||
554 | #include <sys/types.h> | ||
555 | #include <sys/wait.h> | ||
556 | #include <sys/socket.h> | ||
557 | #include <sys/ioctl.h> | ||
558 | |||
559 | #include <unistd.h> | ||
560 | |||
561 | #include <linux/if_ether.h> | ||
562 | #include <linux/if_packet.h> | ||
563 | |||
564 | #include <net/if.h> | ||
565 | |||
566 | static const char *device_name; | ||
567 | static int fanout_type; | ||
568 | static int fanout_id; | ||
569 | |||
570 | #ifndef PACKET_FANOUT | ||
571 | # define PACKET_FANOUT 18 | ||
572 | # define PACKET_FANOUT_HASH 0 | ||
573 | # define PACKET_FANOUT_LB 1 | ||
574 | #endif | ||
575 | |||
576 | static int setup_socket(void) | ||
577 | { | ||
578 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); | ||
579 | struct sockaddr_ll ll; | ||
580 | struct ifreq ifr; | ||
581 | int fanout_arg; | ||
582 | |||
583 | if (fd < 0) { | ||
584 | perror("socket"); | ||
585 | return EXIT_FAILURE; | ||
586 | } | ||
587 | |||
588 | memset(&ifr, 0, sizeof(ifr)); | ||
589 | strcpy(ifr.ifr_name, device_name); | ||
590 | err = ioctl(fd, SIOCGIFINDEX, &ifr); | ||
591 | if (err < 0) { | ||
592 | perror("SIOCGIFINDEX"); | ||
593 | return EXIT_FAILURE; | ||
594 | } | ||
595 | |||
596 | memset(&ll, 0, sizeof(ll)); | ||
597 | ll.sll_family = AF_PACKET; | ||
598 | ll.sll_ifindex = ifr.ifr_ifindex; | ||
599 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | ||
600 | if (err < 0) { | ||
601 | perror("bind"); | ||
602 | return EXIT_FAILURE; | ||
603 | } | ||
604 | |||
605 | fanout_arg = (fanout_id | (fanout_type << 16)); | ||
606 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, | ||
607 | &fanout_arg, sizeof(fanout_arg)); | ||
608 | if (err) { | ||
609 | perror("setsockopt"); | ||
610 | return EXIT_FAILURE; | ||
611 | } | ||
612 | |||
613 | return fd; | ||
614 | } | ||
615 | |||
616 | static void fanout_thread(void) | ||
617 | { | ||
618 | int fd = setup_socket(); | ||
619 | int limit = 10000; | ||
620 | |||
621 | if (fd < 0) | ||
622 | exit(fd); | ||
623 | |||
624 | while (limit-- > 0) { | ||
625 | char buf[1600]; | ||
626 | int err; | ||
627 | |||
628 | err = read(fd, buf, sizeof(buf)); | ||
629 | if (err < 0) { | ||
630 | perror("read"); | ||
631 | exit(EXIT_FAILURE); | ||
632 | } | ||
633 | if ((limit % 10) == 0) | ||
634 | fprintf(stdout, "(%d) \n", getpid()); | ||
635 | } | ||
636 | |||
637 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); | ||
638 | |||
639 | close(fd); | ||
640 | exit(0); | ||
641 | } | ||
642 | |||
643 | int main(int argc, char **argp) | ||
644 | { | ||
645 | int fd, err; | ||
646 | int i; | ||
647 | |||
648 | if (argc != 3) { | ||
649 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); | ||
650 | return EXIT_FAILURE; | ||
651 | } | ||
652 | |||
653 | if (!strcmp(argp[2], "hash")) | ||
654 | fanout_type = PACKET_FANOUT_HASH; | ||
655 | else if (!strcmp(argp[2], "lb")) | ||
656 | fanout_type = PACKET_FANOUT_LB; | ||
657 | else { | ||
658 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); | ||
659 | exit(EXIT_FAILURE); | ||
660 | } | ||
661 | |||
662 | device_name = argp[1]; | ||
663 | fanout_id = getpid() & 0xffff; | ||
664 | |||
665 | for (i = 0; i < 4; i++) { | ||
666 | pid_t pid = fork(); | ||
667 | |||
668 | switch (pid) { | ||
669 | case 0: | ||
670 | fanout_thread(); | ||
671 | |||
672 | case -1: | ||
673 | perror("fork"); | ||
674 | exit(EXIT_FAILURE); | ||
675 | } | ||
676 | } | ||
677 | |||
678 | for (i = 0; i < 4; i++) { | ||
679 | int status; | ||
680 | |||
681 | wait(&status); | ||
682 | } | ||
683 | |||
684 | return 0; | ||
685 | } | ||
686 | |||
687 | ------------------------------------------------------------------------------- | ||
510 | + PACKET_TIMESTAMP | 688 | + PACKET_TIMESTAMP |
511 | ------------------------------------------------------------------------------- | 689 | ------------------------------------------------------------------------------- |
512 | 690 | ||
@@ -532,6 +710,13 @@ the networking stack is used (the behavior before this setting was added). | |||
532 | See include/linux/net_tstamp.h and Documentation/networking/timestamping | 710 | See include/linux/net_tstamp.h and Documentation/networking/timestamping |
533 | for more information on hardware timestamps. | 711 | for more information on hardware timestamps. |
534 | 712 | ||
713 | ------------------------------------------------------------------------------- | ||
714 | + Miscellaneous bits | ||
715 | ------------------------------------------------------------------------------- | ||
716 | |||
717 | - Packet sockets work well together with Linux socket filters, thus you also | ||
718 | might want to have a look at Documentation/networking/filter.txt | ||
719 | |||
535 | -------------------------------------------------------------------------------- | 720 | -------------------------------------------------------------------------------- |
536 | + THANKS | 721 | + THANKS |
537 | -------------------------------------------------------------------------------- | 722 | -------------------------------------------------------------------------------- |