diff options
author | Johann Baudy <johann.baudy@gnu-log.net> | 2009-05-19 01:11:22 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2009-05-19 01:11:22 -0400 |
commit | 69e3c75f4d541a6eb151b3ef91f34033cb3ad6e1 (patch) | |
tree | 24920f17ea435627978af9d5fe0e99763bf6a533 /Documentation/networking | |
parent | f67f34084914144de55c785163d047d5d8dddd2d (diff) |
net: TX_RING and packet mmap
New packet socket feature that makes packet socket more efficient for
transmission.
- It reduces number of system call through a PACKET_TX_RING mechanism,
based on PACKET_RX_RING (Circular buffer allocated in kernel space
which is mmapped from user space).
- It minimizes CPU copy using fragmented SKB (almost zero copy).
Signed-off-by: Johann Baudy <johann.baudy@gnu-log.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/packet_mmap.txt | 140 |
1 files changed, 121 insertions, 19 deletions
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 07c53d596035..a22fd85e3796 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt | |||
@@ -4,16 +4,18 @@ | |||
4 | 4 | ||
5 | This file documents the CONFIG_PACKET_MMAP option available with the PACKET | 5 | This file documents the CONFIG_PACKET_MMAP option available with the PACKET |
6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for | 6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for |
7 | capture network traffic with utilities like tcpdump or any other that uses | 7 | capture network traffic with utilities like tcpdump or any other that needs |
8 | the libpcap library. | 8 | raw access to network interface. |
9 | |||
10 | You can find the latest version of this document at | ||
11 | 9 | ||
10 | You can find the latest version of this document at: | ||
12 | http://pusa.uv.es/~ulisses/packet_mmap/ | 11 | http://pusa.uv.es/~ulisses/packet_mmap/ |
13 | 12 | ||
14 | Please send me your comments to | 13 | Howto can be found at: |
14 | http://wiki.gnu-log.net (packet_mmap) | ||
15 | 15 | ||
16 | Please send your comments to | ||
16 | Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> | 17 | Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> |
18 | Johann Baudy <johann.baudy@gnu-log.net> | ||
17 | 19 | ||
18 | ------------------------------------------------------------------------------- | 20 | ------------------------------------------------------------------------------- |
19 | + Why use PACKET_MMAP | 21 | + Why use PACKET_MMAP |
@@ -25,19 +27,24 @@ to capture each packet, it requires two if you want to get packet's | |||
25 | timestamp (like libpcap always does). | 27 | timestamp (like libpcap always does). |
26 | 28 | ||
27 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size | 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size |
28 | configurable circular buffer mapped in user space. This way reading packets just | 30 | configurable circular buffer mapped in user space that can be used to either |
29 | needs to wait for them, most of the time there is no need to issue a single | 31 | send or receive packets. This way reading packets just needs to wait for them, |
30 | system call. By using a shared buffer between the kernel and the user | 32 | most of the time there is no need to issue a single system call. Concerning |
31 | also has the benefit of minimizing packet copies. | 33 | transmission, multiple packets can be sent through one system call to get the |
32 | 34 | highest bandwidth. | |
33 | It's fine to use PACKET_MMAP to improve the performance of the capture process, | 35 | By using a shared buffer between the kernel and the user also has the benefit |
34 | but it isn't everything. At least, if you are capturing at high speeds (this | 36 | of minimizing packet copies. |
35 | is relative to the cpu speed), you should check if the device driver of your | 37 | |
36 | network interface card supports some sort of interrupt load mitigation or | 38 | It's fine to use PACKET_MMAP to improve the performance of the capture and |
37 | (even better) if it supports NAPI, also make sure it is enabled. | 39 | transmission process, but it isn't everything. At least, if you are capturing |
40 | at high speeds (this is relative to the cpu speed), you should check if the | ||
41 | device driver of your network interface card supports some sort of interrupt | ||
42 | load mitigation or (even better) if it supports NAPI, also make sure it is | ||
43 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | ||
44 | supported by devices of your network. | ||
38 | 45 | ||
39 | -------------------------------------------------------------------------------- | 46 | -------------------------------------------------------------------------------- |
40 | + How to use CONFIG_PACKET_MMAP | 47 | + How to use CONFIG_PACKET_MMAP to improve capture process |
41 | -------------------------------------------------------------------------------- | 48 | -------------------------------------------------------------------------------- |
42 | 49 | ||
43 | From the user standpoint, you should use the higher level libpcap library, which | 50 | From the user standpoint, you should use the higher level libpcap library, which |
@@ -57,7 +64,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP | |||
57 | support. | 64 | support. |
58 | 65 | ||
59 | -------------------------------------------------------------------------------- | 66 | -------------------------------------------------------------------------------- |
60 | + How to use CONFIG_PACKET_MMAP directly | 67 | + How to use CONFIG_PACKET_MMAP directly to improve capture process |
61 | -------------------------------------------------------------------------------- | 68 | -------------------------------------------------------------------------------- |
62 | 69 | ||
63 | From the system calls stand point, the use of PACKET_MMAP involves | 70 | From the system calls stand point, the use of PACKET_MMAP involves |
@@ -66,6 +73,7 @@ the following process: | |||
66 | 73 | ||
67 | [setup] socket() -------> creation of the capture socket | 74 | [setup] socket() -------> creation of the capture socket |
68 | setsockopt() ---> allocation of the circular buffer (ring) | 75 | setsockopt() ---> allocation of the circular buffer (ring) |
76 | option: PACKET_RX_RING | ||
69 | mmap() ---------> mapping of the allocated buffer to the | 77 | mmap() ---------> mapping of the allocated buffer to the |
70 | user process | 78 | user process |
71 | 79 | ||
@@ -97,13 +105,75 @@ also the mapping of the circular buffer in the user process and | |||
97 | the use of this buffer. | 105 | the use of this buffer. |
98 | 106 | ||
99 | -------------------------------------------------------------------------------- | 107 | -------------------------------------------------------------------------------- |
108 | + How to use CONFIG_PACKET_MMAP directly to improve transmission process | ||
109 | -------------------------------------------------------------------------------- | ||
110 | Transmission process is similar to capture as shown below. | ||
111 | |||
112 | [setup] socket() -------> creation of the transmission socket | ||
113 | setsockopt() ---> allocation of the circular buffer (ring) | ||
114 | option: PACKET_TX_RING | ||
115 | bind() ---------> bind transmission socket with a network interface | ||
116 | mmap() ---------> mapping of the allocated buffer to the | ||
117 | user process | ||
118 | |||
119 | [transmission] poll() ---------> wait for free packets (optional) | ||
120 | send() ---------> send all packets that are set as ready in | ||
121 | the ring | ||
122 | The flag MSG_DONTWAIT can be used to return | ||
123 | before end of transfer. | ||
124 | |||
125 | [shutdown] close() --------> destruction of the transmission socket and | ||
126 | deallocation of all associated resources. | ||
127 | |||
128 | Binding the socket to your network interface is mandatory (with zero copy) to | ||
129 | know the header size of frames used in the circular buffer. | ||
130 | |||
131 | As capture, each frame contains two parts: | ||
132 | |||
133 | -------------------- | ||
134 | | struct tpacket_hdr | Header. It contains the status of | ||
135 | | | of this frame | ||
136 | |--------------------| | ||
137 | | data buffer | | ||
138 | . . Data that will be sent over the network interface. | ||
139 | . . | ||
140 | -------------------- | ||
141 | |||
142 | bind() associates the socket to your network interface thanks to | ||
143 | sll_ifindex parameter of struct sockaddr_ll. | ||
144 | |||
145 | Initialization example: | ||
146 | |||
147 | struct sockaddr_ll my_addr; | ||
148 | struct ifreq s_ifr; | ||
149 | ... | ||
150 | |||
151 | strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); | ||
152 | |||
153 | /* get interface index of eth0 */ | ||
154 | ioctl(this->socket, SIOCGIFINDEX, &s_ifr); | ||
155 | |||
156 | /* fill sockaddr_ll struct to prepare binding */ | ||
157 | my_addr.sll_family = AF_PACKET; | ||
158 | my_addr.sll_protocol = ETH_P_ALL; | ||
159 | my_addr.sll_ifindex = s_ifr.ifr_ifindex; | ||
160 | |||
161 | /* bind socket to eth0 */ | ||
162 | bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); | ||
163 | |||
164 | A complete tutorial is available at: http://wiki.gnu-log.net/ | ||
165 | |||
166 | -------------------------------------------------------------------------------- | ||
100 | + PACKET_MMAP settings | 167 | + PACKET_MMAP settings |
101 | -------------------------------------------------------------------------------- | 168 | -------------------------------------------------------------------------------- |
102 | 169 | ||
103 | 170 | ||
104 | To setup PACKET_MMAP from user level code is done with a call like | 171 | To setup PACKET_MMAP from user level code is done with a call like |
105 | 172 | ||
173 | - Capture process | ||
106 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) | 174 | setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) |
175 | - Transmission process | ||
176 | setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) | ||
107 | 177 | ||
108 | The most significant argument in the previous call is the req parameter, | 178 | The most significant argument in the previous call is the req parameter, |
109 | this parameter must to have the following structure: | 179 | this parameter must to have the following structure: |
@@ -117,11 +187,11 @@ this parameter must to have the following structure: | |||
117 | }; | 187 | }; |
118 | 188 | ||
119 | This structure is defined in /usr/include/linux/if_packet.h and establishes a | 189 | This structure is defined in /usr/include/linux/if_packet.h and establishes a |
120 | circular buffer (ring) of unswappable memory mapped in the capture process. | 190 | circular buffer (ring) of unswappable memory. |
121 | Being mapped in the capture process allows reading the captured frames and | 191 | Being mapped in the capture process allows reading the captured frames and |
122 | related meta-information like timestamps without requiring a system call. | 192 | related meta-information like timestamps without requiring a system call. |
123 | 193 | ||
124 | Captured frames are grouped in blocks. Each block is a physically contiguous | 194 | Frames are grouped in blocks. Each block is a physically contiguous |
125 | region of memory and holds tp_block_size/tp_frame_size frames. The total number | 195 | region of memory and holds tp_block_size/tp_frame_size frames. The total number |
126 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | 196 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because |
127 | 197 | ||
@@ -336,6 +406,7 @@ struct tpacket_hdr). If this field is 0 means that the frame is ready | |||
336 | to be used for the kernel, If not, there is a frame the user can read | 406 | to be used for the kernel, If not, there is a frame the user can read |
337 | and the following flags apply: | 407 | and the following flags apply: |
338 | 408 | ||
409 | +++ Capture process: | ||
339 | from include/linux/if_packet.h | 410 | from include/linux/if_packet.h |
340 | 411 | ||
341 | #define TP_STATUS_COPY 2 | 412 | #define TP_STATUS_COPY 2 |
@@ -391,6 +462,37 @@ packets are in the ring: | |||
391 | It doesn't incur in a race condition to first check the status value and | 462 | It doesn't incur in a race condition to first check the status value and |
392 | then poll for frames. | 463 | then poll for frames. |
393 | 464 | ||
465 | |||
466 | ++ Transmission process | ||
467 | Those defines are also used for transmission: | ||
468 | |||
469 | #define TP_STATUS_AVAILABLE 0 // Frame is available | ||
470 | #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() | ||
471 | #define TP_STATUS_SENDING 2 // Frame is currently in transmission | ||
472 | #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct | ||
473 | |||
474 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a | ||
475 | packet, the user fills a data buffer of an available frame, sets tp_len to | ||
476 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. | ||
477 | This can be done on multiple frames. Once the user is ready to transmit, it | ||
478 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are | ||
479 | forwarded to the network device. The kernel updates each status of sent | ||
480 | frames with TP_STATUS_SENDING until the end of transfer. | ||
481 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. | ||
482 | |||
483 | header->tp_len = in_i_size; | ||
484 | header->tp_status = TP_STATUS_SEND_REQUEST; | ||
485 | retval = send(this->socket, NULL, 0, 0); | ||
486 | |||
487 | The user can also use poll() to check if a buffer is available: | ||
488 | (status == TP_STATUS_SENDING) | ||
489 | |||
490 | struct pollfd pfd; | ||
491 | pfd.fd = fd; | ||
492 | pfd.revents = 0; | ||
493 | pfd.events = POLLOUT; | ||
494 | retval = poll(&pfd, 1, timeout); | ||
495 | |||
394 | -------------------------------------------------------------------------------- | 496 | -------------------------------------------------------------------------------- |
395 | + THANKS | 497 | + THANKS |
396 | -------------------------------------------------------------------------------- | 498 | -------------------------------------------------------------------------------- |