diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/NAPI_HOWTO.txt | 766 | ||||
-rw-r--r-- | Documentation/networking/dccp.txt | 21 | ||||
-rw-r--r-- | Documentation/networking/dgrs.txt | 52 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 17 | ||||
-rw-r--r-- | Documentation/networking/mac80211-injection.txt | 32 | ||||
-rw-r--r-- | Documentation/networking/netconsole.txt | 99 | ||||
-rw-r--r-- | Documentation/networking/netdevices.txt | 15 |
7 files changed, 163 insertions, 839 deletions
diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt deleted file mode 100644 index 7907435a661c..000000000000 --- a/Documentation/networking/NAPI_HOWTO.txt +++ /dev/null | |||
@@ -1,766 +0,0 @@ | |||
1 | HISTORY: | ||
2 | February 16/2002 -- revision 0.2.1: | ||
3 | COR typo corrected | ||
4 | February 10/2002 -- revision 0.2: | ||
5 | some spell checking ;-> | ||
6 | January 12/2002 -- revision 0.1 | ||
7 | This is still work in progress so may change. | ||
8 | To keep up to date please watch this space. | ||
9 | |||
10 | Introduction to NAPI | ||
11 | ==================== | ||
12 | |||
13 | NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique | ||
14 | to improve network performance on Linux. For more details please | ||
15 | read that paper. | ||
16 | NAPI provides a "inherent mitigation" which is bound by system capacity | ||
17 | as can be seen from the following data collected by Robert on Gigabit | ||
18 | ethernet (e1000): | ||
19 | |||
20 | Psize Ipps Tput Rxint Txint Done Ndone | ||
21 | --------------------------------------------------------------- | ||
22 | 60 890000 409362 17 27622 7 6823 | ||
23 | 128 758150 464364 21 9301 10 7738 | ||
24 | 256 445632 774646 42 15507 21 12906 | ||
25 | 512 232666 994445 241292 19147 241192 1062 | ||
26 | 1024 119061 1000003 872519 19258 872511 0 | ||
27 | 1440 85193 1000003 946576 19505 946569 0 | ||
28 | |||
29 | |||
30 | Legend: | ||
31 | "Ipps" stands for input packets per second. | ||
32 | "Tput" == packets out of total 1M that made it out. | ||
33 | "txint" == transmit completion interrupts seen | ||
34 | "Done" == The number of times that the poll() managed to pull all | ||
35 | packets out of the rx ring. Note from this that the lower the | ||
36 | load the more we could clean up the rxring | ||
37 | "Ndone" == is the converse of "Done". Note again, that the higher | ||
38 | the load the more times we couldn't clean up the rxring. | ||
39 | |||
40 | Observe that: | ||
41 | when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. | ||
42 | The system cant handle the processing at 1 interrupt/packet at that load level. | ||
43 | At lower rates on the other hand, rx interrupts go up and therefore the | ||
44 | interrupt/packet ratio goes up (as observable from that table). So there is | ||
45 | possibility that under low enough input, you get one poll call for each | ||
46 | input packet caused by a single interrupt each time. And if the system | ||
47 | cant handle interrupt per packet ratio of 1, then it will just have to | ||
48 | chug along .... | ||
49 | |||
50 | |||
51 | 0) Prerequisites: | ||
52 | ================== | ||
53 | A driver MAY continue using the old 2.4 technique for interfacing | ||
54 | to the network stack and not benefit from the NAPI changes. | ||
55 | NAPI additions to the kernel do not break backward compatibility. | ||
56 | NAPI, however, requires the following features to be available: | ||
57 | |||
58 | A) DMA ring or enough RAM to store packets in software devices. | ||
59 | |||
60 | B) Ability to turn off interrupts or maybe events that send packets up | ||
61 | the stack. | ||
62 | |||
63 | NAPI processes packet events in what is known as dev->poll() method. | ||
64 | Typically, only packet receive events are processed in dev->poll(). | ||
65 | The rest of the events MAY be processed by the regular interrupt handler | ||
66 | to reduce processing latency (justified also because there are not that | ||
67 | many of them). | ||
68 | Note, however, NAPI does not enforce that dev->poll() only processes | ||
69 | receive events. | ||
70 | Tests with the tulip driver indicated slightly increased latency if | ||
71 | all of the interrupt handler is moved to dev->poll(). Also MII handling | ||
72 | gets a little trickier. | ||
73 | The example used in this document is to move the receive processing only | ||
74 | to dev->poll(); this is shown with the patch for the tulip driver. | ||
75 | For an example of code that moves all the interrupt driver to | ||
76 | dev->poll() look at the ported e1000 code. | ||
77 | |||
78 | There are caveats that might force you to go with moving everything to | ||
79 | dev->poll(). Different NICs work differently depending on their status/event | ||
80 | acknowledgement setup. | ||
81 | There are two types of event register ACK mechanisms. | ||
82 | I) what is known as Clear-on-read (COR). | ||
83 | when you read the status/event register, it clears everything! | ||
84 | The natsemi and sunbmac NICs are known to do this. | ||
85 | In this case your only choice is to move all to dev->poll() | ||
86 | |||
87 | II) Clear-on-write (COW) | ||
88 | i) you clear the status by writing a 1 in the bit-location you want. | ||
89 | These are the majority of the NICs and work the best with NAPI. | ||
90 | Put only receive events in dev->poll(); leave the rest in | ||
91 | the old interrupt handler. | ||
92 | ii) whatever you write in the status register clears every thing ;-> | ||
93 | Cant seem to find any supported by Linux which do this. If | ||
94 | someone knows such a chip email us please. | ||
95 | Move all to dev->poll() | ||
96 | |||
97 | C) Ability to detect new work correctly. | ||
98 | NAPI works by shutting down event interrupts when there's work and | ||
99 | turning them on when there's none. | ||
100 | New packets might show up in the small window while interrupts were being | ||
101 | re-enabled (refer to appendix 2). A packet might sneak in during the period | ||
102 | we are enabling interrupts. We only get to know about such a packet when the | ||
103 | next new packet arrives and generates an interrupt. | ||
104 | Essentially, there is a small window of opportunity for a race condition | ||
105 | which for clarity we'll refer to as the "rotting packet". | ||
106 | |||
107 | This is a very important topic and appendix 2 is dedicated for more | ||
108 | discussion. | ||
109 | |||
110 | Locking rules and environmental guarantees | ||
111 | ========================================== | ||
112 | |||
113 | -Guarantee: Only one CPU at any time can call dev->poll(); this is because | ||
114 | only one CPU can pick the initial interrupt and hence the initial | ||
115 | netif_rx_schedule(dev); | ||
116 | - The core layer invokes devices to send packets in a round robin format. | ||
117 | This implies receive is totally lockless because of the guarantee that only | ||
118 | one CPU is executing it. | ||
119 | - contention can only be the result of some other CPU accessing the rx | ||
120 | ring. This happens only in close() and suspend() (when these methods | ||
121 | try to clean the rx ring); | ||
122 | ****guarantee: driver authors need not worry about this; synchronization | ||
123 | is taken care for them by the top net layer. | ||
124 | -local interrupts are enabled (if you dont move all to dev->poll()). For | ||
125 | example link/MII and txcomplete continue functioning just same old way. | ||
126 | This improves the latency of processing these events. It is also assumed that | ||
127 | the receive interrupt is the largest cause of noise. Note this might not | ||
128 | always be true. | ||
129 | [according to Manfred Spraul, the winbond insists on sending one | ||
130 | txmitcomplete interrupt for each packet (although this can be mitigated)]. | ||
131 | For these broken drivers, move all to dev->poll(). | ||
132 | |||
133 | For the rest of this text, we'll assume that dev->poll() only | ||
134 | processes receive events. | ||
135 | |||
136 | new methods introduce by NAPI | ||
137 | ============================= | ||
138 | |||
139 | a) netif_rx_schedule(dev) | ||
140 | Called by an IRQ handler to schedule a poll for device | ||
141 | |||
142 | b) netif_rx_schedule_prep(dev) | ||
143 | puts the device in a state which allows for it to be added to the | ||
144 | CPU polling list if it is up and running. You can look at this as | ||
145 | the first half of netif_rx_schedule(dev) above; the second half | ||
146 | being c) below. | ||
147 | |||
148 | c) __netif_rx_schedule(dev) | ||
149 | Add device to the poll list for this CPU; assuming that _prep above | ||
150 | has already been called and returned 1. | ||
151 | |||
152 | d) netif_rx_reschedule(dev, undo) | ||
153 | Called to reschedule polling for device specifically for some | ||
154 | deficient hardware. Read Appendix 2 for more details. | ||
155 | |||
156 | e) netif_rx_complete(dev) | ||
157 | |||
158 | Remove interface from the CPU poll list: it must be in the poll list | ||
159 | on current cpu. This primitive is called by dev->poll(), when | ||
160 | it completes its work. The device cannot be out of poll list at this | ||
161 | call, if it is then clearly it is a BUG(). You'll know ;-> | ||
162 | |||
163 | All of the above methods are used below, so keep reading for clarity. | ||
164 | |||
165 | Device driver changes to be made when porting NAPI | ||
166 | ================================================== | ||
167 | |||
168 | Below we describe what kind of changes are required for NAPI to work. | ||
169 | |||
170 | 1) introduction of dev->poll() method | ||
171 | ===================================== | ||
172 | |||
173 | This is the method that is invoked by the network core when it requests | ||
174 | for new packets from the driver. A driver is allowed to send upto | ||
175 | dev->quota packets by the current CPU before yielding to the network | ||
176 | subsystem (so other devices can also get opportunity to send to the stack). | ||
177 | |||
178 | dev->poll() prototype looks as follows: | ||
179 | int my_poll(struct net_device *dev, int *budget) | ||
180 | |||
181 | budget is the remaining number of packets the network subsystem on the | ||
182 | current CPU can send up the stack before yielding to other system tasks. | ||
183 | *Each driver is responsible for decrementing budget by the total number of | ||
184 | packets sent. | ||
185 | Total number of packets cannot exceed dev->quota. | ||
186 | |||
187 | dev->poll() method is invoked by the top layer, the driver just sends if it | ||
188 | can to the stack the packet quantity requested. | ||
189 | |||
190 | more on dev->poll() below after the interrupt changes are explained. | ||
191 | |||
192 | 2) registering dev->poll() method | ||
193 | =================================== | ||
194 | |||
195 | dev->poll should be set in the dev->probe() method. | ||
196 | e.g: | ||
197 | dev->open = my_open; | ||
198 | . | ||
199 | . | ||
200 | /* two new additions */ | ||
201 | /* first register my poll method */ | ||
202 | dev->poll = my_poll; | ||
203 | /* next register my weight/quanta; can be overridden in /proc */ | ||
204 | dev->weight = 16; | ||
205 | . | ||
206 | . | ||
207 | dev->stop = my_close; | ||
208 | |||
209 | |||
210 | |||
211 | 3) scheduling dev->poll() | ||
212 | ============================= | ||
213 | This involves modifying the interrupt handler and the code | ||
214 | path which takes the packet off the NIC and sends them to the | ||
215 | stack. | ||
216 | |||
217 | it's important at this point to introduce the classical D Becker | ||
218 | interrupt processor: | ||
219 | |||
220 | ------------------ | ||
221 | static irqreturn_t | ||
222 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | ||
223 | { | ||
224 | |||
225 | struct net_device *dev = (struct net_device *)dev_instance; | ||
226 | struct my_private *tp = (struct my_private *)dev->priv; | ||
227 | |||
228 | int work_count = my_work_count; | ||
229 | status = read_interrupt_status_reg(); | ||
230 | if (status == 0) | ||
231 | return IRQ_NONE; /* Shared IRQ: not us */ | ||
232 | if (status == 0xffff) | ||
233 | return IRQ_HANDLED; /* Hot unplug */ | ||
234 | if (status & error) | ||
235 | do_some_error_handling() | ||
236 | |||
237 | do { | ||
238 | acknowledge_ints_ASAP(); | ||
239 | |||
240 | if (status & link_interrupt) { | ||
241 | spin_lock(&tp->link_lock); | ||
242 | do_some_link_stat_stuff(); | ||
243 | spin_lock(&tp->link_lock); | ||
244 | } | ||
245 | |||
246 | if (status & rx_interrupt) { | ||
247 | receive_packets(dev); | ||
248 | } | ||
249 | |||
250 | if (status & rx_nobufs) { | ||
251 | make_rx_buffs_avail(); | ||
252 | } | ||
253 | |||
254 | if (status & tx_related) { | ||
255 | spin_lock(&tp->lock); | ||
256 | tx_ring_free(dev); | ||
257 | if (tx_died) | ||
258 | restart_tx(); | ||
259 | spin_unlock(&tp->lock); | ||
260 | } | ||
261 | |||
262 | status = read_interrupt_status_reg(); | ||
263 | |||
264 | } while (!(status & error) || more_work_to_be_done); | ||
265 | return IRQ_HANDLED; | ||
266 | } | ||
267 | |||
268 | ---------------------------------------------------------------------- | ||
269 | |||
270 | We now change this to what is shown below to NAPI-enable it: | ||
271 | |||
272 | ---------------------------------------------------------------------- | ||
273 | static irqreturn_t | ||
274 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | ||
275 | { | ||
276 | struct net_device *dev = (struct net_device *)dev_instance; | ||
277 | struct my_private *tp = (struct my_private *)dev->priv; | ||
278 | |||
279 | status = read_interrupt_status_reg(); | ||
280 | if (status == 0) | ||
281 | return IRQ_NONE; /* Shared IRQ: not us */ | ||
282 | if (status == 0xffff) | ||
283 | return IRQ_HANDLED; /* Hot unplug */ | ||
284 | if (status & error) | ||
285 | do_some_error_handling(); | ||
286 | |||
287 | do { | ||
288 | /************************ start note *********************************/ | ||
289 | acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here | ||
290 | /************************ end note *********************************/ | ||
291 | |||
292 | if (status & link_interrupt) { | ||
293 | spin_lock(&tp->link_lock); | ||
294 | do_some_link_stat_stuff(); | ||
295 | spin_unlock(&tp->link_lock); | ||
296 | } | ||
297 | /************************ start note *********************************/ | ||
298 | if (status & rx_interrupt || (status & rx_nobuffs)) { | ||
299 | if (netif_rx_schedule_prep(dev)) { | ||
300 | |||
301 | /* disable interrupts caused | ||
302 | * by arriving packets */ | ||
303 | disable_rx_and_rxnobuff_ints(); | ||
304 | /* tell system we have work to be done. */ | ||
305 | __netif_rx_schedule(dev); | ||
306 | } else { | ||
307 | printk("driver bug! interrupt while in poll\n"); | ||
308 | /* FIX by disabling interrupts */ | ||
309 | disable_rx_and_rxnobuff_ints(); | ||
310 | } | ||
311 | } | ||
312 | /************************ end note note *********************************/ | ||
313 | |||
314 | if (status & tx_related) { | ||
315 | spin_lock(&tp->lock); | ||
316 | tx_ring_free(dev); | ||
317 | |||
318 | if (tx_died) | ||
319 | restart_tx(); | ||
320 | spin_unlock(&tp->lock); | ||
321 | } | ||
322 | |||
323 | status = read_interrupt_status_reg(); | ||
324 | |||
325 | /************************ start note *********************************/ | ||
326 | } while (!(status & error) || more_work_to_be_done(status)); | ||
327 | /************************ end note note *********************************/ | ||
328 | return IRQ_HANDLED; | ||
329 | } | ||
330 | |||
331 | --------------------------------------------------------------------- | ||
332 | |||
333 | |||
334 | We note several things from above: | ||
335 | |||
336 | I) Any interrupt source which is caused by arriving packets is now | ||
337 | turned off when it occurs. Depending on the hardware, there could be | ||
338 | several reasons that arriving packets would cause interrupts; these are the | ||
339 | interrupt sources we wish to avoid. The two common ones are a) a packet | ||
340 | arriving (rxint) b) a packet arriving and finding no DMA buffers available | ||
341 | (rxnobuff) . | ||
342 | This means also acknowledge_ints_ASAP() will not clear the status | ||
343 | register for those two items above; clearing is done in the place where | ||
344 | proper work is done within NAPI; at the poll() and refill_rx_ring() | ||
345 | discussed further below. | ||
346 | netif_rx_schedule_prep() returns 1 if device is in running state and | ||
347 | gets successfully added to the core poll list. If we get a zero value | ||
348 | we can _almost_ assume are already added to the list (instead of not running. | ||
349 | Logic based on the fact that you shouldn't get interrupt if not running) | ||
350 | We rectify this by disabling rx and rxnobuf interrupts. | ||
351 | |||
352 | II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. | ||
353 | These functionalities are still around actually...... | ||
354 | |||
355 | infact, receive_packets(dev) is very close to my_poll() and | ||
356 | make_rx_buffs_avail() is invoked from my_poll() | ||
357 | |||
358 | 4) converting receive_packets() to dev->poll() | ||
359 | =============================================== | ||
360 | |||
361 | We need to convert the classical D Becker receive_packets(dev) to my_poll() | ||
362 | |||
363 | First the typical receive_packets() below: | ||
364 | ------------------------------------------------------------------- | ||
365 | |||
366 | /* this is called by interrupt handler */ | ||
367 | static void receive_packets (struct net_device *dev) | ||
368 | { | ||
369 | |||
370 | struct my_private *tp = (struct my_private *)dev->priv; | ||
371 | rx_ring = tp->rx_ring; | ||
372 | cur_rx = tp->cur_rx; | ||
373 | int entry = cur_rx % RX_RING_SIZE; | ||
374 | int received = 0; | ||
375 | int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; | ||
376 | |||
377 | while (rx_ring_not_empty) { | ||
378 | u32 rx_status; | ||
379 | unsigned int rx_size; | ||
380 | unsigned int pkt_size; | ||
381 | struct sk_buff *skb; | ||
382 | /* read size+status of next frame from DMA ring buffer */ | ||
383 | /* the number 16 and 4 are just examples */ | ||
384 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | ||
385 | rx_size = rx_status >> 16; | ||
386 | pkt_size = rx_size - 4; | ||
387 | |||
388 | /* process errors */ | ||
389 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | ||
390 | (!(rx_status & RxStatusOK))) { | ||
391 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | ||
392 | return; | ||
393 | } | ||
394 | |||
395 | if (--rx_work_limit < 0) | ||
396 | break; | ||
397 | |||
398 | /* grab a skb */ | ||
399 | skb = dev_alloc_skb (pkt_size + 2); | ||
400 | if (skb) { | ||
401 | . | ||
402 | . | ||
403 | netif_rx (skb); | ||
404 | . | ||
405 | . | ||
406 | } else { /* OOM */ | ||
407 | /*seems very driver specific ... some just pass | ||
408 | whatever is on the ring already. */ | ||
409 | } | ||
410 | |||
411 | /* move to the next skb on the ring */ | ||
412 | entry = (++tp->cur_rx) % RX_RING_SIZE; | ||
413 | received++ ; | ||
414 | |||
415 | } | ||
416 | |||
417 | /* store current ring pointer state */ | ||
418 | tp->cur_rx = cur_rx; | ||
419 | |||
420 | /* Refill the Rx ring buffers if they are needed */ | ||
421 | refill_rx_ring(); | ||
422 | . | ||
423 | . | ||
424 | |||
425 | } | ||
426 | ------------------------------------------------------------------- | ||
427 | We change it to a new one below; note the additional parameter in | ||
428 | the call. | ||
429 | |||
430 | ------------------------------------------------------------------- | ||
431 | |||
432 | /* this is called by the network core */ | ||
433 | static int my_poll (struct net_device *dev, int *budget) | ||
434 | { | ||
435 | |||
436 | struct my_private *tp = (struct my_private *)dev->priv; | ||
437 | rx_ring = tp->rx_ring; | ||
438 | cur_rx = tp->cur_rx; | ||
439 | int entry = cur_rx % RX_BUF_LEN; | ||
440 | /* maximum packets to send to the stack */ | ||
441 | /************************ note note *********************************/ | ||
442 | int rx_work_limit = dev->quota; | ||
443 | |||
444 | /************************ end note note *********************************/ | ||
445 | do { // outer beginning loop starts here | ||
446 | |||
447 | clear_rx_status_register_bit(); | ||
448 | |||
449 | while (rx_ring_not_empty) { | ||
450 | u32 rx_status; | ||
451 | unsigned int rx_size; | ||
452 | unsigned int pkt_size; | ||
453 | struct sk_buff *skb; | ||
454 | /* read size+status of next frame from DMA ring buffer */ | ||
455 | /* the number 16 and 4 are just examples */ | ||
456 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | ||
457 | rx_size = rx_status >> 16; | ||
458 | pkt_size = rx_size - 4; | ||
459 | |||
460 | /* process errors */ | ||
461 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | ||
462 | (!(rx_status & RxStatusOK))) { | ||
463 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | ||
464 | return 1; | ||
465 | } | ||
466 | |||
467 | /************************ note note *********************************/ | ||
468 | if (--rx_work_limit < 0) { /* we got packets, but no quota */ | ||
469 | /* store current ring pointer state */ | ||
470 | tp->cur_rx = cur_rx; | ||
471 | |||
472 | /* Refill the Rx ring buffers if they are needed */ | ||
473 | refill_rx_ring(dev); | ||
474 | goto not_done; | ||
475 | } | ||
476 | /********************** end note **********************************/ | ||
477 | |||
478 | /* grab a skb */ | ||
479 | skb = dev_alloc_skb (pkt_size + 2); | ||
480 | if (skb) { | ||
481 | . | ||
482 | . | ||
483 | /************************ note note *********************************/ | ||
484 | netif_receive_skb (skb); | ||
485 | /********************** end note **********************************/ | ||
486 | . | ||
487 | . | ||
488 | } else { /* OOM */ | ||
489 | /*seems very driver specific ... common is just pass | ||
490 | whatever is on the ring already. */ | ||
491 | } | ||
492 | |||
493 | /* move to the next skb on the ring */ | ||
494 | entry = (++tp->cur_rx) % RX_RING_SIZE; | ||
495 | received++ ; | ||
496 | |||
497 | } | ||
498 | |||
499 | /* store current ring pointer state */ | ||
500 | tp->cur_rx = cur_rx; | ||
501 | |||
502 | /* Refill the Rx ring buffers if they are needed */ | ||
503 | refill_rx_ring(dev); | ||
504 | |||
505 | /* no packets on ring; but new ones can arrive since we last | ||
506 | checked */ | ||
507 | status = read_interrupt_status_reg(); | ||
508 | if (rx status is not set) { | ||
509 | /* If something arrives in this narrow window, | ||
510 | an interrupt will be generated */ | ||
511 | goto done; | ||
512 | } | ||
513 | /* done! at least that's what it looks like ;-> | ||
514 | if new packets came in after our last check on status bits | ||
515 | they'll be caught by the while check and we go back and clear them | ||
516 | since we havent exceeded our quota */ | ||
517 | } while (rx_status_is_set); | ||
518 | |||
519 | done: | ||
520 | |||
521 | /************************ note note *********************************/ | ||
522 | dev->quota -= received; | ||
523 | *budget -= received; | ||
524 | |||
525 | /* If RX ring is not full we are out of memory. */ | ||
526 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | ||
527 | goto oom; | ||
528 | |||
529 | /* we are happy/done, no more packets on ring; put us back | ||
530 | to where we can start processing interrupts again */ | ||
531 | netif_rx_complete(dev); | ||
532 | enable_rx_and_rxnobuf_ints(); | ||
533 | |||
534 | /* The last op happens after poll completion. Which means the following: | ||
535 | * 1. it can race with disabling irqs in irq handler (which are done to | ||
536 | * schedule polls) | ||
537 | * 2. it can race with dis/enabling irqs in other poll threads | ||
538 | * 3. if an irq raised after the beginning of the outer beginning | ||
539 | * loop (marked in the code above), it will be immediately | ||
540 | * triggered here. | ||
541 | * | ||
542 | * Summarizing: the logic may result in some redundant irqs both | ||
543 | * due to races in masking and due to too late acking of already | ||
544 | * processed irqs. The good news: no events are ever lost. | ||
545 | */ | ||
546 | |||
547 | return 0; /* done */ | ||
548 | |||
549 | not_done: | ||
550 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | ||
551 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | ||
552 | refill_rx_ring(dev); | ||
553 | |||
554 | if (!received) { | ||
555 | printk("received==0\n"); | ||
556 | received = 1; | ||
557 | } | ||
558 | dev->quota -= received; | ||
559 | *budget -= received; | ||
560 | return 1; /* not_done */ | ||
561 | |||
562 | oom: | ||
563 | /* Start timer, stop polling, but do not enable rx interrupts. */ | ||
564 | start_poll_timer(dev); | ||
565 | return 0; /* we'll take it from here so tell core "done"*/ | ||
566 | |||
567 | /************************ End note note *********************************/ | ||
568 | } | ||
569 | ------------------------------------------------------------------- | ||
570 | |||
571 | From above we note that: | ||
572 | 0) rx_work_limit = dev->quota | ||
573 | 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when | ||
574 | it does the work. | ||
575 | 2) We have a done and not_done state. | ||
576 | 3) instead of netif_rx() we call netif_receive_skb() to pass the skb. | ||
577 | 4) we have a new way of handling oom condition | ||
578 | 5) A new outer for (;;) loop has been added. This serves the purpose of | ||
579 | ensuring that if a new packet has come in, after we are all set and done, | ||
580 | and we have not exceeded our quota that we continue sending packets up. | ||
581 | |||
582 | |||
583 | ----------------------------------------------------------- | ||
584 | Poll timer code will need to do the following: | ||
585 | |||
586 | a) | ||
587 | |||
588 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | ||
589 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | ||
590 | refill_rx_ring(dev); | ||
591 | |||
592 | /* If RX ring is not full we are still out of memory. | ||
593 | Restart the timer again. Else we re-add ourselves | ||
594 | to the master poll list. | ||
595 | */ | ||
596 | |||
597 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | ||
598 | restart_timer(); | ||
599 | |||
600 | else netif_rx_schedule(dev); /* we are back on the poll list */ | ||
601 | |||
602 | 5) dev->close() and dev->suspend() issues | ||
603 | ========================================== | ||
604 | The driver writer needn't worry about this; the top net layer takes | ||
605 | care of it. | ||
606 | |||
607 | 6) Adding new Stats to /proc | ||
608 | ============================= | ||
609 | In order to debug some of the new features, we introduce new stats | ||
610 | that need to be collected. | ||
611 | TODO: Fill this later. | ||
612 | |||
613 | APPENDIX 1: discussion on using ethernet HW FC | ||
614 | ============================================== | ||
615 | Most chips with FC only send a pause packet when they run out of Rx buffers. | ||
616 | Since packets are pulled off the DMA ring by a softirq in NAPI, | ||
617 | if the system is slow in grabbing them and we have a high input | ||
618 | rate (faster than the system's capacity to remove packets), then theoretically | ||
619 | there will only be one rx interrupt for all packets during a given packetstorm. | ||
620 | Under low load, we might have a single interrupt per packet. | ||
621 | FC should be programmed to apply in the case when the system cant pull out | ||
622 | packets fast enough i.e send a pause only when you run out of rx buffers. | ||
623 | Note FC in itself is a good solution but we have found it to not be | ||
624 | much of a commodity feature (both in NICs and switches) and hence falls | ||
625 | under the same category as using NIC based mitigation. Also, experiments | ||
626 | indicate that it's much harder to resolve the resource allocation | ||
627 | issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness | ||
628 | proved harder. In any case, FC works even better with NAPI but is not | ||
629 | necessary. | ||
630 | |||
631 | |||
632 | APPENDIX 2: the "rotting packet" race-window avoidance scheme | ||
633 | ============================================================= | ||
634 | |||
635 | There are two types of associations seen here | ||
636 | |||
637 | 1) status/int which honors level triggered IRQ | ||
638 | |||
639 | If a status bit for receive or rxnobuff is set and the corresponding | ||
640 | interrupt-enable bit is not on, then no interrupts will be generated. However, | ||
641 | as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is | ||
642 | generated. [assuming the status bit was not turned off]. | ||
643 | Generally the concept of level triggered IRQs in association with a status and | ||
644 | interrupt-enable CSR register set is used to avoid the race. | ||
645 | |||
646 | If we take the example of the tulip: | ||
647 | "pending work" is indicated by the status bit(CSR5 in tulip). | ||
648 | the corresponding interrupt bit (CSR7 in tulip) might be turned off (but | ||
649 | the CSR5 will continue to be turned on with new packet arrivals even if | ||
650 | we clear it the first time) | ||
651 | Very important is the fact that if we turn on the interrupt bit on when | ||
652 | status is set that an immediate irq is triggered. | ||
653 | |||
654 | If we cleared the rx ring and proclaimed there was "no more work | ||
655 | to be done" and then went on to do a few other things; then when we enable | ||
656 | interrupts, there is a possibility that a new packet might sneak in during | ||
657 | this phase. It helps to look at the pseudo code for the tulip poll | ||
658 | routine: | ||
659 | |||
660 | -------------------------- | ||
661 | do { | ||
662 | ACK; | ||
663 | while (ring_is_not_empty()) { | ||
664 | work-work-work | ||
665 | if quota is exceeded: exit, no touching irq status/mask | ||
666 | } | ||
667 | /* No packets, but new can arrive while we are doing this*/ | ||
668 | CSR5 := read | ||
669 | if (CSR5 is not set) { | ||
670 | /* If something arrives in this narrow window here, | ||
671 | * where the comments are ;-> irq will be generated */ | ||
672 | unmask irqs; | ||
673 | exit poll; | ||
674 | } | ||
675 | } while (rx_status_is_set); | ||
676 | ------------------------ | ||
677 | |||
678 | CSR5 bit of interest is only the rx status. | ||
679 | If you look at the last if statement: | ||
680 | you just finished grabbing all the packets from the rx ring .. you check if | ||
681 | status bit says there are more packets just in ... it says none; you then | ||
682 | enable rx interrupts again; if a new packet just came in during this check, | ||
683 | we are counting that CSR5 will be set in that small window of opportunity | ||
684 | and that by re-enabling interrupts, we would actually trigger an interrupt | ||
685 | to register the new packet for processing. | ||
686 | |||
687 | [The above description nay be very verbose, if you have better wording | ||
688 | that will make this more understandable, please suggest it.] | ||
689 | |||
690 | 2) non-capable hardware | ||
691 | |||
692 | These do not generally respect level triggered IRQs. Normally, | ||
693 | irqs may be lost while being masked and the only way to leave poll is to do | ||
694 | a double check for new input after netif_rx_complete() is invoked | ||
695 | and re-enable polling (after seeing this new input). | ||
696 | |||
697 | Sample code: | ||
698 | |||
699 | --------- | ||
700 | . | ||
701 | . | ||
702 | restart_poll: | ||
703 | while (ring_is_not_empty()) { | ||
704 | work-work-work | ||
705 | if quota is exceeded: exit, not touching irq status/mask | ||
706 | } | ||
707 | . | ||
708 | . | ||
709 | . | ||
710 | enable_rx_interrupts() | ||
711 | netif_rx_complete(dev); | ||
712 | if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { | ||
713 | disable_rx_and_rxnobufs() | ||
714 | goto restart_poll | ||
715 | } while (rx_status_is_set); | ||
716 | --------- | ||
717 | |||
718 | Basically netif_rx_complete() removes us from the poll list, but because a | ||
719 | new packet which will never be caught due to the possibility of a race | ||
720 | might come in, we attempt to re-add ourselves to the poll list. | ||
721 | |||
722 | |||
723 | |||
724 | |||
725 | APPENDIX 3: Scheduling issues. | ||
726 | ============================== | ||
727 | As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the | ||
728 | general solution to schedule softirq's to run before next interrupt and by putting | ||
729 | them under scheduler control. Also this prevents consecutive softirq's from | ||
730 | monopolize the CPU. This also have the effect that the priority of ksoftirq needs | ||
731 | to be considered when running very CPU-intensive applications and networking to | ||
732 | get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 | ||
733 | (eventually more) is reported cure problems with low network performance at high | ||
734 | CPU load. | ||
735 | |||
736 | Most used processes in a GIGE router: | ||
737 | USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND | ||
738 | root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) | ||
739 | root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated | ||
740 | |||
741 | -------------------------------------------------------------------- | ||
742 | |||
743 | relevant sites: | ||
744 | ================== | ||
745 | ftp://robur.slu.se/pub/Linux/net-development/NAPI/ | ||
746 | |||
747 | |||
748 | -------------------------------------------------------------------- | ||
749 | TODO: Write net-skeleton.c driver. | ||
750 | ------------------------------------------------------------- | ||
751 | |||
752 | Authors: | ||
753 | ======== | ||
754 | Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> | ||
755 | Jamal Hadi Salim <hadi@cyberus.ca> | ||
756 | Robert Olsson <Robert.Olsson@data.slu.se> | ||
757 | |||
758 | Acknowledgements: | ||
759 | ================ | ||
760 | People who made this document better: | ||
761 | |||
762 | Lennert Buytenhek <buytenh@gnu.org> | ||
763 | Andrew Morton <akpm@zip.com.au> | ||
764 | Manfred Spraul <manfred@colorfullife.com> | ||
765 | Donald Becker <becker@scyld.com> | ||
766 | Jeff Garzik <jgarzik@pobox.com> | ||
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index 4504cc59e405..afb66f9a8aff 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt | |||
@@ -38,8 +38,13 @@ Socket options | |||
38 | DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of | 38 | DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of |
39 | service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, | 39 | service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, |
40 | the socket will fall back to 0 (which means that no meaningful service code | 40 | the socket will fall back to 0 (which means that no meaningful service code |
41 | is present). Connecting sockets set at most one service option; for | 41 | is present). On active sockets this is set before connect(); specifying more |
42 | listening sockets, multiple service codes can be specified. | 42 | than one code has no effect (all subsequent service codes are ignored). The |
43 | case is different for passive sockets, where multiple service codes (up to 32) | ||
44 | can be set before calling bind(). | ||
45 | |||
46 | DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet | ||
47 | size (application payload size) in bytes, see RFC 4340, section 14. | ||
43 | 48 | ||
44 | DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the | 49 | DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the |
45 | partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums | 50 | partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums |
@@ -50,12 +55,13 @@ be enabled at the receiver, too with suitable choice of CsCov. | |||
50 | DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the | 55 | DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the |
51 | range 0..15 are acceptable. The default setting is 0 (full coverage), | 56 | range 0..15 are acceptable. The default setting is 0 (full coverage), |
52 | values between 1..15 indicate partial coverage. | 57 | values between 1..15 indicate partial coverage. |
53 | DCCP_SOCKOPT_SEND_CSCOV is for the receiver and has a different meaning: it | 58 | DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it |
54 | sets a threshold, where again values 0..15 are acceptable. The default | 59 | sets a threshold, where again values 0..15 are acceptable. The default |
55 | of 0 means that all packets with a partial coverage will be discarded. | 60 | of 0 means that all packets with a partial coverage will be discarded. |
56 | Values in the range 1..15 indicate that packets with minimally such a | 61 | Values in the range 1..15 indicate that packets with minimally such a |
57 | coverage value are also acceptable. The higher the number, the more | 62 | coverage value are also acceptable. The higher the number, the more |
58 | restrictive this setting (see [RFC 4340, sec. 9.2.1]). | 63 | restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage |
64 | settings are inherited to the child socket after accept(). | ||
59 | 65 | ||
60 | The following two options apply to CCID 3 exclusively and are getsockopt()-only. | 66 | The following two options apply to CCID 3 exclusively and are getsockopt()-only. |
61 | In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. | 67 | In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. |
@@ -112,9 +118,14 @@ tx_qlen = 5 | |||
112 | The size of the transmit buffer in packets. A value of 0 corresponds | 118 | The size of the transmit buffer in packets. A value of 0 corresponds |
113 | to an unbounded transmit buffer. | 119 | to an unbounded transmit buffer. |
114 | 120 | ||
121 | sync_ratelimit = 125 ms | ||
122 | The timeout between subsequent DCCP-Sync packets sent in response to | ||
123 | sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit | ||
124 | of this parameter is milliseconds; a value of 0 disables rate-limiting. | ||
125 | |||
115 | Notes | 126 | Notes |
116 | ===== | 127 | ===== |
117 | 128 | ||
118 | DCCP does not travel through NAT successfully at present on many boxes. This is | 129 | DCCP does not travel through NAT successfully at present on many boxes. This is |
119 | because the checksum covers the psuedo-header as per TCP and UDP. Linux NAT | 130 | because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT |
120 | support for DCCP has been added. | 131 | support for DCCP has been added. |
diff --git a/Documentation/networking/dgrs.txt b/Documentation/networking/dgrs.txt deleted file mode 100644 index 1aa1bb3f94ab..000000000000 --- a/Documentation/networking/dgrs.txt +++ /dev/null | |||
@@ -1,52 +0,0 @@ | |||
1 | The Digi International RightSwitch SE-X (dgrs) Device Driver | ||
2 | |||
3 | This is a Linux driver for the Digi International RightSwitch SE-X | ||
4 | EISA and PCI boards. These are 4 (EISA) or 6 (PCI) port Ethernet | ||
5 | switches and a NIC combined into a single board. This driver can | ||
6 | be compiled into the kernel statically or as a loadable module. | ||
7 | |||
8 | There is also a companion management tool, called "xrightswitch". | ||
9 | The management tool lets you watch the performance graphically, | ||
10 | as well as set the SNMP agent IP and IPX addresses, IEEE Spanning | ||
11 | Tree, and Aging time. These can also be set from the command line | ||
12 | when the driver is loaded. The driver command line options are: | ||
13 | |||
14 | debug=NNN Debug printing level | ||
15 | dma=0/1 Disable/Enable DMA on PCI card | ||
16 | spantree=0/1 Disable/Enable IEEE spanning tree | ||
17 | hashexpire=NNN Change address aging time (default 300 seconds) | ||
18 | ipaddr=A,B,C,D Set SNMP agent IP address i.e. 199,86,8,221 | ||
19 | iptrap=A,B,C,D Set SNMP agent IP trap address i.e. 199,86,8,221 | ||
20 | ipxnet=NNN Set SNMP agent IPX network number | ||
21 | nicmode=0/1 Disable/Enable multiple NIC mode | ||
22 | |||
23 | There is also a tool for setting up input and output packet filters | ||
24 | on each port, called "dgrsfilt". | ||
25 | |||
26 | Both the management tool and the filtering tool are available | ||
27 | separately from the following FTP site: | ||
28 | |||
29 | ftp://ftp.dgii.com/drivers/rightswitch/linux/ | ||
30 | |||
31 | When nicmode=1, the board and driver operate as 4 or 6 individual | ||
32 | NIC ports (eth0...eth5) instead of as a switch. All switching | ||
33 | functions are disabled. In the future, the board firmware may include | ||
34 | a routing cache when in this mode. | ||
35 | |||
36 | Copyright 1995-1996 Digi International Inc. | ||
37 | |||
38 | This software may be used and distributed according to the terms | ||
39 | of the GNU General Public License, incorporated herein by reference. | ||
40 | |||
41 | For information on purchasing a RightSwitch SE-4 or SE-6 | ||
42 | board, please contact Digi's sales department at 1-612-912-3444 | ||
43 | or 1-800-DIGIBRD. Outside the U.S., please check our Web page at: | ||
44 | |||
45 | http://www.dgii.com | ||
46 | |||
47 | for sales offices worldwide. Tech support is also available through | ||
48 | the channels listed on the Web site, although as long as I am | ||
49 | employed on networking products at Digi I will be happy to provide | ||
50 | any bug fixes that may be needed. | ||
51 | |||
52 | -Rick Richardson, rick@dgii.com | ||
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 32c2e9da5f3a..6ae2feff3087 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt | |||
@@ -180,13 +180,20 @@ tcp_fin_timeout - INTEGER | |||
180 | to live longer. Cf. tcp_max_orphans. | 180 | to live longer. Cf. tcp_max_orphans. |
181 | 181 | ||
182 | tcp_frto - INTEGER | 182 | tcp_frto - INTEGER |
183 | Enables F-RTO, an enhanced recovery algorithm for TCP retransmission | 183 | Enables Forward RTO-Recovery (F-RTO) defined in RFC4138. |
184 | F-RTO is an enhanced recovery algorithm for TCP retransmission | ||
184 | timeouts. It is particularly beneficial in wireless environments | 185 | timeouts. It is particularly beneficial in wireless environments |
185 | where packet loss is typically due to random radio interference | 186 | where packet loss is typically due to random radio interference |
186 | rather than intermediate router congestion. If set to 1, basic | 187 | rather than intermediate router congestion. FRTO is sender-side |
187 | version is enabled. 2 enables SACK enhanced F-RTO, which is | 188 | only modification. Therefore it does not require any support from |
188 | EXPERIMENTAL. The basic version can be used also when SACK is | 189 | the peer, but in a typical case, however, where wireless link is |
189 | enabled for a flow through tcp_sack sysctl. | 190 | the local access link and most of the data flows downlink, the |
191 | faraway servers should have FRTO enabled to take advantage of it. | ||
192 | If set to 1, basic version is enabled. 2 enables SACK enhanced | ||
193 | F-RTO if flow uses SACK. The basic version can be used also when | ||
194 | SACK is in use though scenario(s) with it exists where FRTO | ||
195 | interacts badly with the packet counting of the SACK enabled TCP | ||
196 | flow. | ||
190 | 197 | ||
191 | tcp_frto_response - INTEGER | 198 | tcp_frto_response - INTEGER |
192 | When F-RTO has detected that a TCP retransmission timeout was | 199 | When F-RTO has detected that a TCP retransmission timeout was |
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt index 53ef7a06f49c..84906ef3ed6e 100644 --- a/Documentation/networking/mac80211-injection.txt +++ b/Documentation/networking/mac80211-injection.txt | |||
@@ -13,15 +13,35 @@ The radiotap format is discussed in | |||
13 | ./Documentation/networking/radiotap-headers.txt. | 13 | ./Documentation/networking/radiotap-headers.txt. |
14 | 14 | ||
15 | Despite 13 radiotap argument types are currently defined, most only make sense | 15 | Despite 13 radiotap argument types are currently defined, most only make sense |
16 | to appear on received packets. Currently three kinds of argument are used by | 16 | to appear on received packets. The following information is parsed from the |
17 | the injection code, although it knows to skip any other arguments that are | 17 | radiotap headers and used to control injection: |
18 | present (facilitating replay of captured radiotap headers directly): | ||
19 | 18 | ||
20 | - IEEE80211_RADIOTAP_RATE - u8 arg in 500kbps units (0x02 --> 1Mbps) | 19 | * IEEE80211_RADIOTAP_RATE |
21 | 20 | ||
22 | - IEEE80211_RADIOTAP_ANTENNA - u8 arg, 0x00 = ant1, 0x01 = ant2 | 21 | rate in 500kbps units, automatic if invalid or not present |
23 | 22 | ||
24 | - IEEE80211_RADIOTAP_DBM_TX_POWER - u8 arg, dBm | 23 | |
24 | * IEEE80211_RADIOTAP_ANTENNA | ||
25 | |||
26 | antenna to use, automatic if not present | ||
27 | |||
28 | |||
29 | * IEEE80211_RADIOTAP_DBM_TX_POWER | ||
30 | |||
31 | transmit power in dBm, automatic if not present | ||
32 | |||
33 | |||
34 | * IEEE80211_RADIOTAP_FLAGS | ||
35 | |||
36 | IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated | ||
37 | IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available | ||
38 | IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the | ||
39 | current fragmentation threshold. Note that | ||
40 | this flag is only reliable when software | ||
41 | fragmentation is enabled) | ||
42 | |||
43 | The injection code can also skip all other currently defined radiotap fields | ||
44 | facilitating replay of captured radiotap headers directly. | ||
25 | 45 | ||
26 | Here is an example valid radiotap header defining these three parameters | 46 | Here is an example valid radiotap header defining these three parameters |
27 | 47 | ||
diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt index 1caa6c734691..3c2f2b328638 100644 --- a/Documentation/networking/netconsole.txt +++ b/Documentation/networking/netconsole.txt | |||
@@ -3,6 +3,10 @@ started by Ingo Molnar <mingo@redhat.com>, 2001.09.17 | |||
3 | 2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 | 3 | 2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 |
4 | 4 | ||
5 | Please send bug reports to Matt Mackall <mpm@selenic.com> | 5 | Please send bug reports to Matt Mackall <mpm@selenic.com> |
6 | and Satyam Sharma <satyam.sharma@gmail.com> | ||
7 | |||
8 | Introduction: | ||
9 | ============= | ||
6 | 10 | ||
7 | This module logs kernel printk messages over UDP allowing debugging of | 11 | This module logs kernel printk messages over UDP allowing debugging of |
8 | problem where disk logging fails and serial consoles are impractical. | 12 | problem where disk logging fails and serial consoles are impractical. |
@@ -13,6 +17,9 @@ the specified interface as soon as possible. While this doesn't allow | |||
13 | capture of early kernel panics, it does capture most of the boot | 17 | capture of early kernel panics, it does capture most of the boot |
14 | process. | 18 | process. |
15 | 19 | ||
20 | Sender and receiver configuration: | ||
21 | ================================== | ||
22 | |||
16 | It takes a string configuration parameter "netconsole" in the | 23 | It takes a string configuration parameter "netconsole" in the |
17 | following format: | 24 | following format: |
18 | 25 | ||
@@ -34,21 +41,113 @@ Examples: | |||
34 | 41 | ||
35 | insmod netconsole netconsole=@/,@10.0.0.2/ | 42 | insmod netconsole netconsole=@/,@10.0.0.2/ |
36 | 43 | ||
44 | It also supports logging to multiple remote agents by specifying | ||
45 | parameters for the multiple agents separated by semicolons and the | ||
46 | complete string enclosed in "quotes", thusly: | ||
47 | |||
48 | modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/" | ||
49 | |||
37 | Built-in netconsole starts immediately after the TCP stack is | 50 | Built-in netconsole starts immediately after the TCP stack is |
38 | initialized and attempts to bring up the supplied dev at the supplied | 51 | initialized and attempts to bring up the supplied dev at the supplied |
39 | address. | 52 | address. |
40 | 53 | ||
41 | The remote host can run either 'netcat -u -l -p <port>' or syslogd. | 54 | The remote host can run either 'netcat -u -l -p <port>' or syslogd. |
42 | 55 | ||
56 | Dynamic reconfiguration: | ||
57 | ======================== | ||
58 | |||
59 | Dynamic reconfigurability is a useful addition to netconsole that enables | ||
60 | remote logging targets to be dynamically added, removed, or have their | ||
61 | parameters reconfigured at runtime from a configfs-based userspace interface. | ||
62 | [ Note that the parameters of netconsole targets that were specified/created | ||
63 | from the boot/module option are not exposed via this interface, and hence | ||
64 | cannot be modified dynamically. ] | ||
65 | |||
66 | To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the | ||
67 | netconsole module (or kernel, if netconsole is built-in). | ||
68 | |||
69 | Some examples follow (where configfs is mounted at the /sys/kernel/config | ||
70 | mountpoint). | ||
71 | |||
72 | To add a remote logging target (target names can be arbitrary): | ||
73 | |||
74 | cd /sys/kernel/config/netconsole/ | ||
75 | mkdir target1 | ||
76 | |||
77 | Note that newly created targets have default parameter values (as mentioned | ||
78 | above) and are disabled by default -- they must first be enabled by writing | ||
79 | "1" to the "enabled" attribute (usually after setting parameters accordingly) | ||
80 | as described below. | ||
81 | |||
82 | To remove a target: | ||
83 | |||
84 | rmdir /sys/kernel/config/netconsole/othertarget/ | ||
85 | |||
86 | The interface exposes these parameters of a netconsole target to userspace: | ||
87 | |||
88 | enabled Is this target currently enabled? (read-write) | ||
89 | dev_name Local network interface name (read-write) | ||
90 | local_port Source UDP port to use (read-write) | ||
91 | remote_port Remote agent's UDP port (read-write) | ||
92 | local_ip Source IP address to use (read-write) | ||
93 | remote_ip Remote agent's IP address (read-write) | ||
94 | local_mac Local interface's MAC address (read-only) | ||
95 | remote_mac Remote agent's MAC address (read-write) | ||
96 | |||
97 | The "enabled" attribute is also used to control whether the parameters of | ||
98 | a target can be updated or not -- you can modify the parameters of only | ||
99 | disabled targets (i.e. if "enabled" is 0). | ||
100 | |||
101 | To update a target's parameters: | ||
102 | |||
103 | cat enabled # check if enabled is 1 | ||
104 | echo 0 > enabled # disable the target (if required) | ||
105 | echo eth2 > dev_name # set local interface | ||
106 | echo 10.0.0.4 > remote_ip # update some parameter | ||
107 | echo cb:a9:87:65:43:21 > remote_mac # update more parameters | ||
108 | echo 1 > enabled # enable target again | ||
109 | |||
110 | You can also update the local interface dynamically. This is especially | ||
111 | useful if you want to use interfaces that have newly come up (and may not | ||
112 | have existed when netconsole was loaded / initialized). | ||
113 | |||
114 | Miscellaneous notes: | ||
115 | ==================== | ||
116 | |||
43 | WARNING: the default target ethernet setting uses the broadcast | 117 | WARNING: the default target ethernet setting uses the broadcast |
44 | ethernet address to send packets, which can cause increased load on | 118 | ethernet address to send packets, which can cause increased load on |
45 | other systems on the same ethernet segment. | 119 | other systems on the same ethernet segment. |
46 | 120 | ||
121 | TIP: some LAN switches may be configured to suppress ethernet broadcasts | ||
122 | so it is advised to explicitly specify the remote agents' MAC addresses | ||
123 | from the config parameters passed to netconsole. | ||
124 | |||
125 | TIP: to find out the MAC address of, say, 10.0.0.2, you may try using: | ||
126 | |||
127 | ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2 | ||
128 | |||
129 | TIP: in case the remote logging agent is on a separate LAN subnet than | ||
130 | the sender, it is suggested to try specifying the MAC address of the | ||
131 | default gateway (you may use /sbin/route -n to find it out) as the | ||
132 | remote MAC address instead. | ||
133 | |||
47 | NOTE: the network device (eth1 in the above case) can run any kind | 134 | NOTE: the network device (eth1 in the above case) can run any kind |
48 | of other network traffic, netconsole is not intrusive. Netconsole | 135 | of other network traffic, netconsole is not intrusive. Netconsole |
49 | might cause slight delays in other traffic if the volume of kernel | 136 | might cause slight delays in other traffic if the volume of kernel |
50 | messages is high, but should have no other impact. | 137 | messages is high, but should have no other impact. |
51 | 138 | ||
139 | NOTE: if you find that the remote logging agent is not receiving or | ||
140 | printing all messages from the sender, it is likely that you have set | ||
141 | the "console_loglevel" parameter (on the sender) to only send high | ||
142 | priority messages to the console. You can change this at runtime using: | ||
143 | |||
144 | dmesg -n 8 | ||
145 | |||
146 | or by specifying "debug" on the kernel command line at boot, to send | ||
147 | all kernel messages to the console. A specific value for this parameter | ||
148 | can also be set using the "loglevel" kernel boot option. See the | ||
149 | dmesg(8) man page and Documentation/kernel-parameters.txt for details. | ||
150 | |||
52 | Netconsole was designed to be as instantaneous as possible, to | 151 | Netconsole was designed to be as instantaneous as possible, to |
53 | enable the logging of even the most critical kernel bugs. It works | 152 | enable the logging of even the most critical kernel bugs. It works |
54 | from IRQ contexts as well, and does not enable interrupts while | 153 | from IRQ contexts as well, and does not enable interrupts while |
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index 37869295fc70..d0f71fc7f782 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt | |||
@@ -73,7 +73,8 @@ dev->hard_start_xmit: | |||
73 | has to lock by itself when needed. It is recommended to use a try lock | 73 | has to lock by itself when needed. It is recommended to use a try lock |
74 | for this and return NETDEV_TX_LOCKED when the spin lock fails. | 74 | for this and return NETDEV_TX_LOCKED when the spin lock fails. |
75 | The locking there should also properly protect against | 75 | The locking there should also properly protect against |
76 | set_multicast_list. | 76 | set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated. |
77 | Dont use it for new drivers. | ||
77 | 78 | ||
78 | Context: Process with BHs disabled or BH (timer), | 79 | Context: Process with BHs disabled or BH (timer), |
79 | will be called with interrupts disabled by netconsole. | 80 | will be called with interrupts disabled by netconsole. |
@@ -95,9 +96,13 @@ dev->set_multicast_list: | |||
95 | Synchronization: netif_tx_lock spinlock. | 96 | Synchronization: netif_tx_lock spinlock. |
96 | Context: BHs disabled | 97 | Context: BHs disabled |
97 | 98 | ||
98 | dev->poll: | 99 | struct napi_struct synchronization rules |
99 | Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See | 100 | ======================================== |
100 | dev_close code and comments in net/core/dev.c for more info. | 101 | napi->poll: |
102 | Synchronization: NAPI_STATE_SCHED bit in napi->state. Device | ||
103 | driver's dev->close method will invoke napi_disable() on | ||
104 | all NAPI instances which will do a sleeping poll on the | ||
105 | NAPI_STATE_SCHED napi->state bit, waiting for all pending | ||
106 | NAPI activity to cease. | ||
101 | Context: softirq | 107 | Context: softirq |
102 | will be called with interrupts disabled by netconsole. | 108 | will be called with interrupts disabled by netconsole. |
103 | |||