diff options
-rw-r--r-- | Documentation/networking/scaling.txt | 371 |
1 files changed, 371 insertions, 0 deletions
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt new file mode 100644 index 000000000000..7254b4b5910e --- /dev/null +++ b/Documentation/networking/scaling.txt | |||
@@ -0,0 +1,371 @@ | |||
1 | Scaling in the Linux Networking Stack | ||
2 | |||
3 | |||
4 | Introduction | ||
5 | ============ | ||
6 | |||
7 | This document describes a set of complementary techniques in the Linux | ||
8 | networking stack to increase parallelism and improve performance for | ||
9 | multi-processor systems. | ||
10 | |||
11 | The following technologies are described: | ||
12 | |||
13 | RSS: Receive Side Scaling | ||
14 | RPS: Receive Packet Steering | ||
15 | RFS: Receive Flow Steering | ||
16 | Accelerated Receive Flow Steering | ||
17 | XPS: Transmit Packet Steering | ||
18 | |||
19 | |||
20 | RSS: Receive Side Scaling | ||
21 | ========================= | ||
22 | |||
23 | Contemporary NICs support multiple receive and transmit descriptor queues | ||
24 | (multi-queue). On reception, a NIC can send different packets to different | ||
25 | queues to distribute processing among CPUs. The NIC distributes packets by | ||
26 | applying a filter to each packet that assigns it to one of a small number | ||
27 | of logical flows. Packets for each flow are steered to a separate receive | ||
28 | queue, which in turn can be processed by separate CPUs. This mechanism is | ||
29 | generally known as “Receive-side Scaling” (RSS). The goal of RSS and | ||
30 | the other scaling techniques to increase performance uniformly. | ||
31 | Multi-queue distribution can also be used for traffic prioritization, but | ||
32 | that is not the focus of these techniques. | ||
33 | |||
34 | The filter used in RSS is typically a hash function over the network | ||
35 | and/or transport layer headers-- for example, a 4-tuple hash over | ||
36 | IP addresses and TCP ports of a packet. The most common hardware | ||
37 | implementation of RSS uses a 128-entry indirection table where each entry | ||
38 | stores a queue number. The receive queue for a packet is determined | ||
39 | by masking out the low order seven bits of the computed hash for the | ||
40 | packet (usually a Toeplitz hash), taking this number as a key into the | ||
41 | indirection table and reading the corresponding value. | ||
42 | |||
43 | Some advanced NICs allow steering packets to queues based on | ||
44 | programmable filters. For example, webserver bound TCP port 80 packets | ||
45 | can be directed to their own receive queue. Such “n-tuple” filters can | ||
46 | be configured from ethtool (--config-ntuple). | ||
47 | |||
48 | ==== RSS Configuration | ||
49 | |||
50 | The driver for a multi-queue capable NIC typically provides a kernel | ||
51 | module parameter for specifying the number of hardware queues to | ||
52 | configure. In the bnx2x driver, for instance, this parameter is called | ||
53 | num_queues. A typical RSS configuration would be to have one receive queue | ||
54 | for each CPU if the device supports enough queues, or otherwise at least | ||
55 | one for each cache domain at a particular cache level (L1, L2, etc.). | ||
56 | |||
57 | The indirection table of an RSS device, which resolves a queue by masked | ||
58 | hash, is usually programmed by the driver at initialization. The | ||
59 | default mapping is to distribute the queues evenly in the table, but the | ||
60 | indirection table can be retrieved and modified at runtime using ethtool | ||
61 | commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the | ||
62 | indirection table could be done to give different queues different | ||
63 | relative weights. | ||
64 | |||
65 | == RSS IRQ Configuration | ||
66 | |||
67 | Each receive queue has a separate IRQ associated with it. The NIC triggers | ||
68 | this to notify a CPU when new packets arrive on the given queue. The | ||
69 | signaling path for PCIe devices uses message signaled interrupts (MSI-X), | ||
70 | that can route each interrupt to a particular CPU. The active mapping | ||
71 | of queues to IRQs can be determined from /proc/interrupts. By default, | ||
72 | an IRQ may be handled on any CPU. Because a non-negligible part of packet | ||
73 | processing takes place in receive interrupt handling, it is advantageous | ||
74 | to spread receive interrupts between CPUs. To manually adjust the IRQ | ||
75 | affinity of each interrupt see Documentation/IRQ-affinity. Some systems | ||
76 | will be running irqbalance, a daemon that dynamically optimizes IRQ | ||
77 | assignments and as a result may override any manual settings. | ||
78 | |||
79 | == Suggested Configuration | ||
80 | |||
81 | RSS should be enabled when latency is a concern or whenever receive | ||
82 | interrupt processing forms a bottleneck. Spreading load between CPUs | ||
83 | decreases queue length. For low latency networking, the optimal setting | ||
84 | is to allocate as many queues as there are CPUs in the system (or the | ||
85 | NIC maximum, if lower). Because the aggregate number of interrupts grows | ||
86 | with each additional queue, the most efficient high-rate configuration | ||
87 | is likely the one with the smallest number of receive queues where no | ||
88 | CPU that processes receive interrupts reaches 100% utilization. Per-cpu | ||
89 | load can be observed using the mpstat utility. | ||
90 | |||
91 | |||
92 | RPS: Receive Packet Steering | ||
93 | ============================ | ||
94 | |||
95 | Receive Packet Steering (RPS) is logically a software implementation of | ||
96 | RSS. Being in software, it is necessarily called later in the datapath. | ||
97 | Whereas RSS selects the queue and hence CPU that will run the hardware | ||
98 | interrupt handler, RPS selects the CPU to perform protocol processing | ||
99 | above the interrupt handler. This is accomplished by placing the packet | ||
100 | on the desired CPU’s backlog queue and waking up the CPU for processing. | ||
101 | RPS has some advantages over RSS: 1) it can be used with any NIC, | ||
102 | 2) software filters can easily be added to hash over new protocols, | ||
103 | 3) it does not increase hardware device interrupt rate (although it does | ||
104 | introduce inter-processor interrupts (IPIs)). | ||
105 | |||
106 | RPS is called during bottom half of the receive interrupt handler, when | ||
107 | a driver sends a packet up the network stack with netif_rx() or | ||
108 | netif_receive_skb(). These call the get_rps_cpu() function, which | ||
109 | selects the queue that should process a packet. | ||
110 | |||
111 | The first step in determining the target CPU for RPS is to calculate a | ||
112 | flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash | ||
113 | depending on the protocol). This serves as a consistent hash of the | ||
114 | associated flow of the packet. The hash is either provided by hardware | ||
115 | or will be computed in the stack. Capable hardware can pass the hash in | ||
116 | the receive descriptor for the packet; this would usually be the same | ||
117 | hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in | ||
118 | skb->rx_hash and can be used elsewhere in the stack as a hash of the | ||
119 | packet’s flow. | ||
120 | |||
121 | Each receive hardware queue has an associated list of CPUs to which | ||
122 | RPS may enqueue packets for processing. For each received packet, | ||
123 | an index into the list is computed from the flow hash modulo the size | ||
124 | of the list. The indexed CPU is the target for processing the packet, | ||
125 | and the packet is queued to the tail of that CPU’s backlog queue. At | ||
126 | the end of the bottom half routine, IPIs are sent to any CPUs for which | ||
127 | packets have been queued to their backlog queue. The IPI wakes backlog | ||
128 | processing on the remote CPU, and any queued packets are then processed | ||
129 | up the networking stack. | ||
130 | |||
131 | ==== RPS Configuration | ||
132 | |||
133 | RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on | ||
134 | by default for SMP). Even when compiled in, RPS remains disabled until | ||
135 | explicitly configured. The list of CPUs to which RPS may forward traffic | ||
136 | can be configured for each receive queue using a sysfs file entry: | ||
137 | |||
138 | /sys/class/net/<dev>/queues/rx-<n>/rps_cpus | ||
139 | |||
140 | This file implements a bitmap of CPUs. RPS is disabled when it is zero | ||
141 | (the default), in which case packets are processed on the interrupting | ||
142 | CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to | ||
143 | the bitmap. | ||
144 | |||
145 | == Suggested Configuration | ||
146 | |||
147 | For a single queue device, a typical RPS configuration would be to set | ||
148 | the rps_cpus to the CPUs in the same cache domain of the interrupting | ||
149 | CPU. If NUMA locality is not an issue, this could also be all CPUs in | ||
150 | the system. At high interrupt rate, it might be wise to exclude the | ||
151 | interrupting CPU from the map since that already performs much work. | ||
152 | |||
153 | For a multi-queue system, if RSS is configured so that a hardware | ||
154 | receive queue is mapped to each CPU, then RPS is probably redundant | ||
155 | and unnecessary. If there are fewer hardware queues than CPUs, then | ||
156 | RPS might be beneficial if the rps_cpus for each queue are the ones that | ||
157 | share the same cache domain as the interrupting CPU for that queue. | ||
158 | |||
159 | |||
160 | RFS: Receive Flow Steering | ||
161 | ========================== | ||
162 | |||
163 | While RPS steers packets solely based on hash, and thus generally | ||
164 | provides good load distribution, it does not take into account | ||
165 | application locality. This is accomplished by Receive Flow Steering | ||
166 | (RFS). The goal of RFS is to increase datacache hitrate by steering | ||
167 | kernel processing of packets to the CPU where the application thread | ||
168 | consuming the packet is running. RFS relies on the same RPS mechanisms | ||
169 | to enqueue packets onto the backlog of another CPU and to wake up that | ||
170 | CPU. | ||
171 | |||
172 | In RFS, packets are not forwarded directly by the value of their hash, | ||
173 | but the hash is used as index into a flow lookup table. This table maps | ||
174 | flows to the CPUs where those flows are being processed. The flow hash | ||
175 | (see RPS section above) is used to calculate the index into this table. | ||
176 | The CPU recorded in each entry is the one which last processed the flow. | ||
177 | If an entry does not hold a valid CPU, then packets mapped to that entry | ||
178 | are steered using plain RPS. Multiple table entries may point to the | ||
179 | same CPU. Indeed, with many flows and few CPUs, it is very likely that | ||
180 | a single application thread handles flows with many different flow hashes. | ||
181 | |||
182 | rps_sock_table is a global flow table that contains the *desired* CPU for | ||
183 | flows: the CPU that is currently processing the flow in userspace. Each | ||
184 | table value is a CPU index that is updated during calls to recvmsg and | ||
185 | sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() | ||
186 | and tcp_splice_read()). | ||
187 | |||
188 | When the scheduler moves a thread to a new CPU while it has outstanding | ||
189 | receive packets on the old CPU, packets may arrive out of order. To | ||
190 | avoid this, RFS uses a second flow table to track outstanding packets | ||
191 | for each flow: rps_dev_flow_table is a table specific to each hardware | ||
192 | receive queue of each device. Each table value stores a CPU index and a | ||
193 | counter. The CPU index represents the *current* CPU onto which packets | ||
194 | for this flow are enqueued for further kernel processing. Ideally, kernel | ||
195 | and userspace processing occur on the same CPU, and hence the CPU index | ||
196 | in both tables is identical. This is likely false if the scheduler has | ||
197 | recently migrated a userspace thread while the kernel still has packets | ||
198 | enqueued for kernel processing on the old CPU. | ||
199 | |||
200 | The counter in rps_dev_flow_table values records the length of the current | ||
201 | CPU's backlog when a packet in this flow was last enqueued. Each backlog | ||
202 | queue has a head counter that is incremented on dequeue. A tail counter | ||
203 | is computed as head counter + queue length. In other words, the counter | ||
204 | in rps_dev_flow_table[i] records the last element in flow i that has | ||
205 | been enqueued onto the currently designated CPU for flow i (of course, | ||
206 | entry i is actually selected by hash and multiple flows may hash to the | ||
207 | same entry i). | ||
208 | |||
209 | And now the trick for avoiding out of order packets: when selecting the | ||
210 | CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table | ||
211 | and the rps_dev_flow table of the queue that the packet was received on | ||
212 | are compared. If the desired CPU for the flow (found in the | ||
213 | rps_sock_flow table) matches the current CPU (found in the rps_dev_flow | ||
214 | table), the packet is enqueued onto that CPU’s backlog. If they differ, | ||
215 | the current CPU is updated to match the desired CPU if one of the | ||
216 | following is true: | ||
217 | |||
218 | - The current CPU's queue head counter >= the recorded tail counter | ||
219 | value in rps_dev_flow[i] | ||
220 | - The current CPU is unset (equal to NR_CPUS) | ||
221 | - The current CPU is offline | ||
222 | |||
223 | After this check, the packet is sent to the (possibly updated) current | ||
224 | CPU. These rules aim to ensure that a flow only moves to a new CPU when | ||
225 | there are no packets outstanding on the old CPU, as the outstanding | ||
226 | packets could arrive later than those about to be processed on the new | ||
227 | CPU. | ||
228 | |||
229 | ==== RFS Configuration | ||
230 | |||
231 | RFS is only available if the kconfig symbol CONFIG_RFS is enabled (on | ||
232 | by default for SMP). The functionality remains disabled until explicitly | ||
233 | configured. The number of entries in the global flow table is set through: | ||
234 | |||
235 | /proc/sys/net/core/rps_sock_flow_entries | ||
236 | |||
237 | The number of entries in the per-queue flow table are set through: | ||
238 | |||
239 | /sys/class/net/<dev>/queues/tx-<n>/rps_flow_cnt | ||
240 | |||
241 | == Suggested Configuration | ||
242 | |||
243 | Both of these need to be set before RFS is enabled for a receive queue. | ||
244 | Values for both are rounded up to the nearest power of two. The | ||
245 | suggested flow count depends on the expected number of active connections | ||
246 | at any given time, which may be significantly less than the number of open | ||
247 | connections. We have found that a value of 32768 for rps_sock_flow_entries | ||
248 | works fairly well on a moderately loaded server. | ||
249 | |||
250 | For a single queue device, the rps_flow_cnt value for the single queue | ||
251 | would normally be configured to the same value as rps_sock_flow_entries. | ||
252 | For a multi-queue device, the rps_flow_cnt for each queue might be | ||
253 | configured as rps_sock_flow_entries / N, where N is the number of | ||
254 | queues. So for instance, if rps_flow_entries is set to 32768 and there | ||
255 | are 16 configured receive queues, rps_flow_cnt for each queue might be | ||
256 | configured as 2048. | ||
257 | |||
258 | |||
259 | Accelerated RFS | ||
260 | =============== | ||
261 | |||
262 | Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load | ||
263 | balancing mechanism that uses soft state to steer flows based on where | ||
264 | the application thread consuming the packets of each flow is running. | ||
265 | Accelerated RFS should perform better than RFS since packets are sent | ||
266 | directly to a CPU local to the thread consuming the data. The target CPU | ||
267 | will either be the same CPU where the application runs, or at least a CPU | ||
268 | which is local to the application thread’s CPU in the cache hierarchy. | ||
269 | |||
270 | To enable accelerated RFS, the networking stack calls the | ||
271 | ndo_rx_flow_steer driver function to communicate the desired hardware | ||
272 | queue for packets matching a particular flow. The network stack | ||
273 | automatically calls this function every time a flow entry in | ||
274 | rps_dev_flow_table is updated. The driver in turn uses a device specific | ||
275 | method to program the NIC to steer the packets. | ||
276 | |||
277 | The hardware queue for a flow is derived from the CPU recorded in | ||
278 | rps_dev_flow_table. The stack consults a CPU to hardware queue map which | ||
279 | is maintained by the NIC driver. This is an auto-generated reverse map of | ||
280 | the IRQ affinity table shown by /proc/interrupts. Drivers can use | ||
281 | functions in the cpu_rmap (“CPU affinity reverse map”) kernel library | ||
282 | to populate the map. For each CPU, the corresponding queue in the map is | ||
283 | set to be one whose processing CPU is closest in cache locality. | ||
284 | |||
285 | ==== Accelerated RFS Configuration | ||
286 | |||
287 | Accelerated RFS is only available if the kernel is compiled with | ||
288 | CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. | ||
289 | It also requires that ntuple filtering is enabled via ethtool. The map | ||
290 | of CPU to queues is automatically deduced from the IRQ affinities | ||
291 | configured for each receive queue by the driver, so no additional | ||
292 | configuration should be necessary. | ||
293 | |||
294 | == Suggested Configuration | ||
295 | |||
296 | This technique should be enabled whenever one wants to use RFS and the | ||
297 | NIC supports hardware acceleration. | ||
298 | |||
299 | XPS: Transmit Packet Steering | ||
300 | ============================= | ||
301 | |||
302 | Transmit Packet Steering is a mechanism for intelligently selecting | ||
303 | which transmit queue to use when transmitting a packet on a multi-queue | ||
304 | device. To accomplish this, a mapping from CPU to hardware queue(s) is | ||
305 | recorded. The goal of this mapping is usually to assign queues | ||
306 | exclusively to a subset of CPUs, where the transmit completions for | ||
307 | these queues are processed on a CPU within this set. This choice | ||
308 | provides two benefits. First, contention on the device queue lock is | ||
309 | significantly reduced since fewer CPUs contend for the same queue | ||
310 | (contention can be eliminated completely if each CPU has its own | ||
311 | transmit queue). Secondly, cache miss rate on transmit completion is | ||
312 | reduced, in particular for data cache lines that hold the sk_buff | ||
313 | structures. | ||
314 | |||
315 | XPS is configured per transmit queue by setting a bitmap of CPUs that | ||
316 | may use that queue to transmit. The reverse mapping, from CPUs to | ||
317 | transmit queues, is computed and maintained for each network device. | ||
318 | When transmitting the first packet in a flow, the function | ||
319 | get_xps_queue() is called to select a queue. This function uses the ID | ||
320 | of the running CPU as a key into the CPU-to-queue lookup table. If the | ||
321 | ID matches a single queue, that is used for transmission. If multiple | ||
322 | queues match, one is selected by using the flow hash to compute an index | ||
323 | into the set. | ||
324 | |||
325 | The queue chosen for transmitting a particular flow is saved in the | ||
326 | corresponding socket structure for the flow (e.g. a TCP connection). | ||
327 | This transmit queue is used for subsequent packets sent on the flow to | ||
328 | prevent out of order (ooo) packets. The choice also amortizes the cost | ||
329 | of calling get_xps_queues() over all packets in the connection. To avoid | ||
330 | ooo packets, the queue for a flow can subsequently only be changed if | ||
331 | skb->ooo_okay is set for a packet in the flow. This flag indicates that | ||
332 | there are no outstanding packets in the flow, so the transmit queue can | ||
333 | change without the risk of generating out of order packets. The | ||
334 | transport layer is responsible for setting ooo_okay appropriately. TCP, | ||
335 | for instance, sets the flag when all data for a connection has been | ||
336 | acknowledged. | ||
337 | |||
338 | ==== XPS Configuration | ||
339 | |||
340 | XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by | ||
341 | default for SMP). The functionality remains disabled until explicitly | ||
342 | configured. To enable XPS, the bitmap of CPUs that may use a transmit | ||
343 | queue is configured using the sysfs file entry: | ||
344 | |||
345 | /sys/class/net/<dev>/queues/tx-<n>/xps_cpus | ||
346 | |||
347 | == Suggested Configuration | ||
348 | |||
349 | For a network device with a single transmission queue, XPS configuration | ||
350 | has no effect, since there is no choice in this case. In a multi-queue | ||
351 | system, XPS is preferably configured so that each CPU maps onto one queue. | ||
352 | If there are as many queues as there are CPUs in the system, then each | ||
353 | queue can also map onto one CPU, resulting in exclusive pairings that | ||
354 | experience no contention. If there are fewer queues than CPUs, then the | ||
355 | best CPUs to share a given queue are probably those that share the cache | ||
356 | with the CPU that processes transmit completions for that queue | ||
357 | (transmit interrupts). | ||
358 | |||
359 | |||
360 | Further Information | ||
361 | =================== | ||
362 | RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into | ||
363 | 2.6.38. Original patches were submitted by Tom Herbert | ||
364 | (therbert@google.com) | ||
365 | |||
366 | Accelerated RFS was introduced in 2.6.35. Original patches were | ||
367 | submitted by Ben Hutchings (bhutchings@solarflare.com) | ||
368 | |||
369 | Authors: | ||
370 | Tom Herbert (therbert@google.com) | ||
371 | Willem de Bruijn (willemb@google.com) | ||