diff options
Diffstat (limited to 'Documentation/networking/packet_mmap.txt')
| -rw-r--r-- | Documentation/networking/packet_mmap.txt | 246 |
1 files changed, 222 insertions, 24 deletions
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 1c08a4b0981f..94444b152fbc 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt | |||
| @@ -3,9 +3,9 @@ | |||
| 3 | -------------------------------------------------------------------------------- | 3 | -------------------------------------------------------------------------------- |
| 4 | 4 | ||
| 5 | This file documents the mmap() facility available with the PACKET | 5 | This file documents the mmap() facility available with the PACKET |
| 6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for | 6 | socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for |
| 7 | capture network traffic with utilities like tcpdump or any other that needs | 7 | i) capture network traffic with utilities like tcpdump, ii) transmit network |
| 8 | raw access to network interface. | 8 | traffic, or any other that needs raw access to network interface. |
| 9 | 9 | ||
| 10 | You can find the latest version of this document at: | 10 | You can find the latest version of this document at: |
| 11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap | 11 | http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap |
| @@ -21,19 +21,18 @@ Please send your comments to | |||
| 21 | + Why use PACKET_MMAP | 21 | + Why use PACKET_MMAP |
| 22 | -------------------------------------------------------------------------------- | 22 | -------------------------------------------------------------------------------- |
| 23 | 23 | ||
| 24 | In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very | 24 | In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very |
| 25 | inefficient. It uses very limited buffers and requires one system call | 25 | inefficient. It uses very limited buffers and requires one system call to |
| 26 | to capture each packet, it requires two if you want to get packet's | 26 | capture each packet, it requires two if you want to get packet's timestamp |
| 27 | timestamp (like libpcap always does). | 27 | (like libpcap always does). |
| 28 | 28 | ||
| 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size |
| 30 | configurable circular buffer mapped in user space that can be used to either | 30 | configurable circular buffer mapped in user space that can be used to either |
| 31 | send or receive packets. This way reading packets just needs to wait for them, | 31 | send or receive packets. This way reading packets just needs to wait for them, |
| 32 | most of the time there is no need to issue a single system call. Concerning | 32 | most of the time there is no need to issue a single system call. Concerning |
| 33 | transmission, multiple packets can be sent through one system call to get the | 33 | transmission, multiple packets can be sent through one system call to get the |
| 34 | highest bandwidth. | 34 | highest bandwidth. By using a shared buffer between the kernel and the user |
| 35 | By using a shared buffer between the kernel and the user also has the benefit | 35 | also has the benefit of minimizing packet copies. |
| 36 | of minimizing packet copies. | ||
| 37 | 36 | ||
| 38 | It's fine to use PACKET_MMAP to improve the performance of the capture and | 37 | It's fine to use PACKET_MMAP to improve the performance of the capture and |
| 39 | transmission process, but it isn't everything. At least, if you are capturing | 38 | transmission process, but it isn't everything. At least, if you are capturing |
| @@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the | |||
| 41 | device driver of your network interface card supports some sort of interrupt | 40 | device driver of your network interface card supports some sort of interrupt |
| 42 | load mitigation or (even better) if it supports NAPI, also make sure it is | 41 | load mitigation or (even better) if it supports NAPI, also make sure it is |
| 43 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | 42 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and |
| 44 | supported by devices of your network. | 43 | supported by devices of your network. CPU IRQ pinning of your network interface |
| 44 | card can also be an advantage. | ||
| 45 | 45 | ||
| 46 | -------------------------------------------------------------------------------- | 46 | -------------------------------------------------------------------------------- |
| 47 | + How to use mmap() to improve capture process | 47 | + How to use mmap() to improve capture process |
| @@ -87,9 +87,7 @@ the following process: | |||
| 87 | socket creation and destruction is straight forward, and is done | 87 | socket creation and destruction is straight forward, and is done |
| 88 | the same way with or without PACKET_MMAP: | 88 | the same way with or without PACKET_MMAP: |
| 89 | 89 | ||
| 90 | int fd; | 90 | int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); |
| 91 | |||
| 92 | fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) | ||
| 93 | 91 | ||
| 94 | where mode is SOCK_RAW for the raw interface were link level | 92 | where mode is SOCK_RAW for the raw interface were link level |
| 95 | information can be captured or SOCK_DGRAM for the cooked | 93 | information can be captured or SOCK_DGRAM for the cooked |
| @@ -163,11 +161,23 @@ As capture, each frame contains two parts: | |||
| 163 | 161 | ||
| 164 | A complete tutorial is available at: http://wiki.gnu-log.net/ | 162 | A complete tutorial is available at: http://wiki.gnu-log.net/ |
| 165 | 163 | ||
| 164 | By default, the user should put data at : | ||
| 165 | frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) | ||
| 166 | |||
| 167 | So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), | ||
| 168 | the beginning of the user data will be at : | ||
| 169 | frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | ||
| 170 | |||
| 171 | If you wish to put user data at a custom offset from the beginning of | ||
| 172 | the frame (for payload alignment with SOCK_RAW mode for instance) you | ||
| 173 | can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order | ||
| 174 | to make this work it must be enabled previously with setsockopt() | ||
| 175 | and the PACKET_TX_HAS_OFF option. | ||
| 176 | |||
| 166 | -------------------------------------------------------------------------------- | 177 | -------------------------------------------------------------------------------- |
| 167 | + PACKET_MMAP settings | 178 | + PACKET_MMAP settings |
| 168 | -------------------------------------------------------------------------------- | 179 | -------------------------------------------------------------------------------- |
| 169 | 180 | ||
| 170 | |||
| 171 | To setup PACKET_MMAP from user level code is done with a call like | 181 | To setup PACKET_MMAP from user level code is done with a call like |
| 172 | 182 | ||
| 173 | - Capture process | 183 | - Capture process |
| @@ -201,7 +211,6 @@ indeed, packet_set_ring checks that the following condition is true | |||
| 201 | 211 | ||
| 202 | frames_per_block * tp_block_nr == tp_frame_nr | 212 | frames_per_block * tp_block_nr == tp_frame_nr |
| 203 | 213 | ||
| 204 | |||
| 205 | Lets see an example, with the following values: | 214 | Lets see an example, with the following values: |
| 206 | 215 | ||
| 207 | tp_block_size= 4096 | 216 | tp_block_size= 4096 |
| @@ -227,7 +236,6 @@ be spawned across two blocks, so there are some details you have to take into | |||
| 227 | account when choosing the frame_size. See "Mapping and use of the circular | 236 | account when choosing the frame_size. See "Mapping and use of the circular |
| 228 | buffer (ring)". | 237 | buffer (ring)". |
| 229 | 238 | ||
| 230 | |||
| 231 | -------------------------------------------------------------------------------- | 239 | -------------------------------------------------------------------------------- |
| 232 | + PACKET_MMAP setting constraints | 240 | + PACKET_MMAP setting constraints |
| 233 | -------------------------------------------------------------------------------- | 241 | -------------------------------------------------------------------------------- |
| @@ -264,7 +272,6 @@ User space programs can include /usr/include/sys/user.h and | |||
| 264 | The pagesize can also be determined dynamically with the getpagesize (2) | 272 | The pagesize can also be determined dynamically with the getpagesize (2) |
| 265 | system call. | 273 | system call. |
| 266 | 274 | ||
| 267 | |||
| 268 | Block number limit | 275 | Block number limit |
| 269 | -------------------- | 276 | -------------------- |
| 270 | 277 | ||
| @@ -284,7 +291,6 @@ called pg_vec, its size limits the number of blocks that can be allocated. | |||
| 284 | v block #2 | 291 | v block #2 |
| 285 | block #1 | 292 | block #1 |
| 286 | 293 | ||
| 287 | |||
| 288 | kmalloc allocates any number of bytes of physically contiguous memory from | 294 | kmalloc allocates any number of bytes of physically contiguous memory from |
| 289 | a pool of pre-determined sizes. This pool of memory is maintained by the slab | 295 | a pool of pre-determined sizes. This pool of memory is maintained by the slab |
| 290 | allocator which is at the end the responsible for doing the allocation and | 296 | allocator which is at the end the responsible for doing the allocation and |
| @@ -299,7 +305,6 @@ pointers to blocks is | |||
| 299 | 305 | ||
| 300 | 131072/4 = 32768 blocks | 306 | 131072/4 = 32768 blocks |
| 301 | 307 | ||
| 302 | |||
| 303 | PACKET_MMAP buffer size calculator | 308 | PACKET_MMAP buffer size calculator |
| 304 | ------------------------------------ | 309 | ------------------------------------ |
| 305 | 310 | ||
| @@ -340,7 +345,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield | |||
| 340 | and hence the buffer will have a 262144 MiB size. So it can hold | 345 | and hence the buffer will have a 262144 MiB size. So it can hold |
| 341 | 262144 MiB / 2048 bytes = 134217728 frames | 346 | 262144 MiB / 2048 bytes = 134217728 frames |
| 342 | 347 | ||
| 343 | |||
| 344 | Actually, this buffer size is not possible with an i386 architecture. | 348 | Actually, this buffer size is not possible with an i386 architecture. |
| 345 | Remember that the memory is allocated in kernel space, in the case of | 349 | Remember that the memory is allocated in kernel space, in the case of |
| 346 | an i386 kernel's memory size is limited to 1GiB. | 350 | an i386 kernel's memory size is limited to 1GiB. |
| @@ -372,7 +376,6 @@ the following (from include/linux/if_packet.h): | |||
| 372 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | 376 | - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. |
| 373 | - Pad to align to TPACKET_ALIGNMENT=16 | 377 | - Pad to align to TPACKET_ALIGNMENT=16 |
| 374 | */ | 378 | */ |
| 375 | |||
| 376 | 379 | ||
| 377 | The following are conditions that are checked in packet_set_ring | 380 | The following are conditions that are checked in packet_set_ring |
| 378 | 381 | ||
| @@ -413,7 +416,6 @@ and the following flags apply: | |||
| 413 | #define TP_STATUS_LOSING 4 | 416 | #define TP_STATUS_LOSING 4 |
| 414 | #define TP_STATUS_CSUMNOTREADY 8 | 417 | #define TP_STATUS_CSUMNOTREADY 8 |
| 415 | 418 | ||
| 416 | |||
| 417 | TP_STATUS_COPY : This flag indicates that the frame (and associated | 419 | TP_STATUS_COPY : This flag indicates that the frame (and associated |
| 418 | meta information) has been truncated because it's | 420 | meta information) has been truncated because it's |
| 419 | larger than tp_frame_size. This packet can be | 421 | larger than tp_frame_size. This packet can be |
| @@ -462,7 +464,6 @@ packets are in the ring: | |||
| 462 | It doesn't incur in a race condition to first check the status value and | 464 | It doesn't incur in a race condition to first check the status value and |
| 463 | then poll for frames. | 465 | then poll for frames. |
| 464 | 466 | ||
| 465 | |||
| 466 | ++ Transmission process | 467 | ++ Transmission process |
| 467 | Those defines are also used for transmission: | 468 | Those defines are also used for transmission: |
| 468 | 469 | ||
| @@ -494,6 +495,196 @@ The user can also use poll() to check if a buffer is available: | |||
| 494 | retval = poll(&pfd, 1, timeout); | 495 | retval = poll(&pfd, 1, timeout); |
| 495 | 496 | ||
| 496 | ------------------------------------------------------------------------------- | 497 | ------------------------------------------------------------------------------- |
| 498 | + What TPACKET versions are available and when to use them? | ||
| 499 | ------------------------------------------------------------------------------- | ||
| 500 | |||
| 501 | int val = tpacket_version; | ||
| 502 | setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | ||
| 503 | getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); | ||
| 504 | |||
| 505 | where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. | ||
| 506 | |||
| 507 | TPACKET_V1: | ||
| 508 | - Default if not otherwise specified by setsockopt(2) | ||
| 509 | - RX_RING, TX_RING available | ||
| 510 | - VLAN metadata information available for packets | ||
| 511 | (TP_STATUS_VLAN_VALID) | ||
| 512 | |||
| 513 | TPACKET_V1 --> TPACKET_V2: | ||
| 514 | - Made 64 bit clean due to unsigned long usage in TPACKET_V1 | ||
| 515 | structures, thus this also works on 64 bit kernel with 32 bit | ||
| 516 | userspace and the like | ||
| 517 | - Timestamp resolution in nanoseconds instead of microseconds | ||
| 518 | - RX_RING, TX_RING available | ||
| 519 | - How to switch to TPACKET_V2: | ||
| 520 | 1. Replace struct tpacket_hdr by struct tpacket2_hdr | ||
| 521 | 2. Query header len and save | ||
| 522 | 3. Set protocol version to 2, set up ring as usual | ||
| 523 | 4. For getting the sockaddr_ll, | ||
| 524 | use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of | ||
| 525 | (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) | ||
| 526 | |||
| 527 | TPACKET_V2 --> TPACKET_V3: | ||
| 528 | - Flexible buffer implementation: | ||
| 529 | 1. Blocks can be configured with non-static frame-size | ||
| 530 | 2. Read/poll is at a block-level (as opposed to packet-level) | ||
| 531 | 3. Added poll timeout to avoid indefinite user-space wait | ||
| 532 | on idle links | ||
| 533 | 4. Added user-configurable knobs: | ||
| 534 | 4.1 block::timeout | ||
| 535 | 4.2 tpkt_hdr::sk_rxhash | ||
| 536 | - RX Hash data available in user space | ||
| 537 | - Currently only RX_RING available | ||
| 538 | |||
| 539 | ------------------------------------------------------------------------------- | ||
| 540 | + AF_PACKET fanout mode | ||
| 541 | ------------------------------------------------------------------------------- | ||
| 542 | |||
| 543 | In the AF_PACKET fanout mode, packet reception can be load balanced among | ||
| 544 | processes. This also works in combination with mmap(2) on packet sockets. | ||
| 545 | |||
| 546 | Minimal example code by David S. Miller (try things like "./test eth0 hash", | ||
| 547 | "./test eth0 lb", etc.): | ||
| 548 | |||
| 549 | #include <stddef.h> | ||
| 550 | #include <stdlib.h> | ||
| 551 | #include <stdio.h> | ||
| 552 | #include <string.h> | ||
| 553 | |||
| 554 | #include <sys/types.h> | ||
| 555 | #include <sys/wait.h> | ||
| 556 | #include <sys/socket.h> | ||
| 557 | #include <sys/ioctl.h> | ||
| 558 | |||
| 559 | #include <unistd.h> | ||
| 560 | |||
| 561 | #include <linux/if_ether.h> | ||
| 562 | #include <linux/if_packet.h> | ||
| 563 | |||
| 564 | #include <net/if.h> | ||
| 565 | |||
| 566 | static const char *device_name; | ||
| 567 | static int fanout_type; | ||
| 568 | static int fanout_id; | ||
| 569 | |||
| 570 | #ifndef PACKET_FANOUT | ||
| 571 | # define PACKET_FANOUT 18 | ||
| 572 | # define PACKET_FANOUT_HASH 0 | ||
| 573 | # define PACKET_FANOUT_LB 1 | ||
| 574 | #endif | ||
| 575 | |||
| 576 | static int setup_socket(void) | ||
| 577 | { | ||
| 578 | int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); | ||
| 579 | struct sockaddr_ll ll; | ||
| 580 | struct ifreq ifr; | ||
| 581 | int fanout_arg; | ||
| 582 | |||
| 583 | if (fd < 0) { | ||
| 584 | perror("socket"); | ||
| 585 | return EXIT_FAILURE; | ||
| 586 | } | ||
| 587 | |||
| 588 | memset(&ifr, 0, sizeof(ifr)); | ||
| 589 | strcpy(ifr.ifr_name, device_name); | ||
| 590 | err = ioctl(fd, SIOCGIFINDEX, &ifr); | ||
| 591 | if (err < 0) { | ||
| 592 | perror("SIOCGIFINDEX"); | ||
| 593 | return EXIT_FAILURE; | ||
| 594 | } | ||
| 595 | |||
| 596 | memset(&ll, 0, sizeof(ll)); | ||
| 597 | ll.sll_family = AF_PACKET; | ||
| 598 | ll.sll_ifindex = ifr.ifr_ifindex; | ||
| 599 | err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); | ||
| 600 | if (err < 0) { | ||
| 601 | perror("bind"); | ||
| 602 | return EXIT_FAILURE; | ||
| 603 | } | ||
| 604 | |||
| 605 | fanout_arg = (fanout_id | (fanout_type << 16)); | ||
| 606 | err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, | ||
| 607 | &fanout_arg, sizeof(fanout_arg)); | ||
| 608 | if (err) { | ||
| 609 | perror("setsockopt"); | ||
| 610 | return EXIT_FAILURE; | ||
| 611 | } | ||
| 612 | |||
| 613 | return fd; | ||
| 614 | } | ||
| 615 | |||
| 616 | static void fanout_thread(void) | ||
| 617 | { | ||
| 618 | int fd = setup_socket(); | ||
| 619 | int limit = 10000; | ||
| 620 | |||
| 621 | if (fd < 0) | ||
| 622 | exit(fd); | ||
| 623 | |||
| 624 | while (limit-- > 0) { | ||
| 625 | char buf[1600]; | ||
| 626 | int err; | ||
| 627 | |||
| 628 | err = read(fd, buf, sizeof(buf)); | ||
| 629 | if (err < 0) { | ||
| 630 | perror("read"); | ||
| 631 | exit(EXIT_FAILURE); | ||
| 632 | } | ||
| 633 | if ((limit % 10) == 0) | ||
| 634 | fprintf(stdout, "(%d) \n", getpid()); | ||
| 635 | } | ||
| 636 | |||
| 637 | fprintf(stdout, "%d: Received 10000 packets\n", getpid()); | ||
| 638 | |||
| 639 | close(fd); | ||
| 640 | exit(0); | ||
| 641 | } | ||
| 642 | |||
| 643 | int main(int argc, char **argp) | ||
| 644 | { | ||
| 645 | int fd, err; | ||
| 646 | int i; | ||
| 647 | |||
| 648 | if (argc != 3) { | ||
| 649 | fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); | ||
| 650 | return EXIT_FAILURE; | ||
| 651 | } | ||
| 652 | |||
| 653 | if (!strcmp(argp[2], "hash")) | ||
| 654 | fanout_type = PACKET_FANOUT_HASH; | ||
| 655 | else if (!strcmp(argp[2], "lb")) | ||
| 656 | fanout_type = PACKET_FANOUT_LB; | ||
| 657 | else { | ||
| 658 | fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); | ||
| 659 | exit(EXIT_FAILURE); | ||
| 660 | } | ||
| 661 | |||
| 662 | device_name = argp[1]; | ||
| 663 | fanout_id = getpid() & 0xffff; | ||
| 664 | |||
| 665 | for (i = 0; i < 4; i++) { | ||
| 666 | pid_t pid = fork(); | ||
| 667 | |||
| 668 | switch (pid) { | ||
| 669 | case 0: | ||
| 670 | fanout_thread(); | ||
| 671 | |||
| 672 | case -1: | ||
| 673 | perror("fork"); | ||
| 674 | exit(EXIT_FAILURE); | ||
| 675 | } | ||
| 676 | } | ||
| 677 | |||
| 678 | for (i = 0; i < 4; i++) { | ||
| 679 | int status; | ||
| 680 | |||
| 681 | wait(&status); | ||
| 682 | } | ||
| 683 | |||
| 684 | return 0; | ||
| 685 | } | ||
| 686 | |||
| 687 | ------------------------------------------------------------------------------- | ||
| 497 | + PACKET_TIMESTAMP | 688 | + PACKET_TIMESTAMP |
| 498 | ------------------------------------------------------------------------------- | 689 | ------------------------------------------------------------------------------- |
| 499 | 690 | ||
| @@ -519,6 +710,13 @@ the networking stack is used (the behavior before this setting was added). | |||
| 519 | See include/linux/net_tstamp.h and Documentation/networking/timestamping | 710 | See include/linux/net_tstamp.h and Documentation/networking/timestamping |
| 520 | for more information on hardware timestamps. | 711 | for more information on hardware timestamps. |
| 521 | 712 | ||
| 713 | ------------------------------------------------------------------------------- | ||
| 714 | + Miscellaneous bits | ||
| 715 | ------------------------------------------------------------------------------- | ||
| 716 | |||
| 717 | - Packet sockets work well together with Linux socket filters, thus you also | ||
| 718 | might want to have a look at Documentation/networking/filter.txt | ||
| 719 | |||
| 522 | -------------------------------------------------------------------------------- | 720 | -------------------------------------------------------------------------------- |
| 523 | + THANKS | 721 | + THANKS |
| 524 | -------------------------------------------------------------------------------- | 722 | -------------------------------------------------------------------------------- |
