aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/NAPI_HOWTO.txt766
-rw-r--r--Documentation/networking/dccp.txt21
-rw-r--r--Documentation/networking/dgrs.txt52
-rw-r--r--Documentation/networking/ip-sysctl.txt17
-rw-r--r--Documentation/networking/mac80211-injection.txt32
-rw-r--r--Documentation/networking/netconsole.txt99
-rw-r--r--Documentation/networking/netdevices.txt15
7 files changed, 163 insertions, 839 deletions
diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt
deleted file mode 100644
index 7907435a661c..000000000000
--- a/Documentation/networking/NAPI_HOWTO.txt
+++ /dev/null
@@ -1,766 +0,0 @@
1HISTORY:
2February 16/2002 -- revision 0.2.1:
3COR typo corrected
4February 10/2002 -- revision 0.2:
5some spell checking ;->
6January 12/2002 -- revision 0.1
7This is still work in progress so may change.
8To keep up to date please watch this space.
9
10Introduction to NAPI
11====================
12
13NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique
14to improve network performance on Linux. For more details please
15read that paper.
16NAPI provides a "inherent mitigation" which is bound by system capacity
17as can be seen from the following data collected by Robert on Gigabit
18ethernet (e1000):
19
20 Psize Ipps Tput Rxint Txint Done Ndone
21 ---------------------------------------------------------------
22 60 890000 409362 17 27622 7 6823
23 128 758150 464364 21 9301 10 7738
24 256 445632 774646 42 15507 21 12906
25 512 232666 994445 241292 19147 241192 1062
26 1024 119061 1000003 872519 19258 872511 0
27 1440 85193 1000003 946576 19505 946569 0
28
29
30Legend:
31"Ipps" stands for input packets per second.
32"Tput" == packets out of total 1M that made it out.
33"txint" == transmit completion interrupts seen
34"Done" == The number of times that the poll() managed to pull all
35packets out of the rx ring. Note from this that the lower the
36load the more we could clean up the rxring
37"Ndone" == is the converse of "Done". Note again, that the higher
38the load the more times we couldn't clean up the rxring.
39
40Observe that:
41when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated.
42The system cant handle the processing at 1 interrupt/packet at that load level.
43At lower rates on the other hand, rx interrupts go up and therefore the
44interrupt/packet ratio goes up (as observable from that table). So there is
45possibility that under low enough input, you get one poll call for each
46input packet caused by a single interrupt each time. And if the system
47cant handle interrupt per packet ratio of 1, then it will just have to
48chug along ....
49
50
510) Prerequisites:
52==================
53A driver MAY continue using the old 2.4 technique for interfacing
54to the network stack and not benefit from the NAPI changes.
55NAPI additions to the kernel do not break backward compatibility.
56NAPI, however, requires the following features to be available:
57
58A) DMA ring or enough RAM to store packets in software devices.
59
60B) Ability to turn off interrupts or maybe events that send packets up
61the stack.
62
63NAPI processes packet events in what is known as dev->poll() method.
64Typically, only packet receive events are processed in dev->poll().
65The rest of the events MAY be processed by the regular interrupt handler
66to reduce processing latency (justified also because there are not that
67many of them).
68Note, however, NAPI does not enforce that dev->poll() only processes
69receive events.
70Tests with the tulip driver indicated slightly increased latency if
71all of the interrupt handler is moved to dev->poll(). Also MII handling
72gets a little trickier.
73The example used in this document is to move the receive processing only
74to dev->poll(); this is shown with the patch for the tulip driver.
75For an example of code that moves all the interrupt driver to
76dev->poll() look at the ported e1000 code.
77
78There are caveats that might force you to go with moving everything to
79dev->poll(). Different NICs work differently depending on their status/event
80acknowledgement setup.
81There are two types of event register ACK mechanisms.
82 I) what is known as Clear-on-read (COR).
83 when you read the status/event register, it clears everything!
84 The natsemi and sunbmac NICs are known to do this.
85 In this case your only choice is to move all to dev->poll()
86
87 II) Clear-on-write (COW)
88 i) you clear the status by writing a 1 in the bit-location you want.
89 These are the majority of the NICs and work the best with NAPI.
90 Put only receive events in dev->poll(); leave the rest in
91 the old interrupt handler.
92 ii) whatever you write in the status register clears every thing ;->
93 Cant seem to find any supported by Linux which do this. If
94 someone knows such a chip email us please.
95 Move all to dev->poll()
96
97C) Ability to detect new work correctly.
98NAPI works by shutting down event interrupts when there's work and
99turning them on when there's none.
100New packets might show up in the small window while interrupts were being
101re-enabled (refer to appendix 2). A packet might sneak in during the period
102we are enabling interrupts. We only get to know about such a packet when the
103next new packet arrives and generates an interrupt.
104Essentially, there is a small window of opportunity for a race condition
105which for clarity we'll refer to as the "rotting packet".
106
107This is a very important topic and appendix 2 is dedicated for more
108discussion.
109
110Locking rules and environmental guarantees
111==========================================
112
113-Guarantee: Only one CPU at any time can call dev->poll(); this is because
114only one CPU can pick the initial interrupt and hence the initial
115netif_rx_schedule(dev);
116- The core layer invokes devices to send packets in a round robin format.
117This implies receive is totally lockless because of the guarantee that only
118one CPU is executing it.
119- contention can only be the result of some other CPU accessing the rx
120ring. This happens only in close() and suspend() (when these methods
121try to clean the rx ring);
122****guarantee: driver authors need not worry about this; synchronization
123is taken care for them by the top net layer.
124-local interrupts are enabled (if you dont move all to dev->poll()). For
125example link/MII and txcomplete continue functioning just same old way.
126This improves the latency of processing these events. It is also assumed that
127the receive interrupt is the largest cause of noise. Note this might not
128always be true.
129[according to Manfred Spraul, the winbond insists on sending one
130txmitcomplete interrupt for each packet (although this can be mitigated)].
131For these broken drivers, move all to dev->poll().
132
133For the rest of this text, we'll assume that dev->poll() only
134processes receive events.
135
136new methods introduce by NAPI
137=============================
138
139a) netif_rx_schedule(dev)
140Called by an IRQ handler to schedule a poll for device
141
142b) netif_rx_schedule_prep(dev)
143puts the device in a state which allows for it to be added to the
144CPU polling list if it is up and running. You can look at this as
145the first half of netif_rx_schedule(dev) above; the second half
146being c) below.
147
148c) __netif_rx_schedule(dev)
149Add device to the poll list for this CPU; assuming that _prep above
150has already been called and returned 1.
151
152d) netif_rx_reschedule(dev, undo)
153Called to reschedule polling for device specifically for some
154deficient hardware. Read Appendix 2 for more details.
155
156e) netif_rx_complete(dev)
157
158Remove interface from the CPU poll list: it must be in the poll list
159on current cpu. This primitive is called by dev->poll(), when
160it completes its work. The device cannot be out of poll list at this
161call, if it is then clearly it is a BUG(). You'll know ;->
162
163All of the above methods are used below, so keep reading for clarity.
164
165Device driver changes to be made when porting NAPI
166==================================================
167
168Below we describe what kind of changes are required for NAPI to work.
169
1701) introduction of dev->poll() method
171=====================================
172
173This is the method that is invoked by the network core when it requests
174for new packets from the driver. A driver is allowed to send upto
175dev->quota packets by the current CPU before yielding to the network
176subsystem (so other devices can also get opportunity to send to the stack).
177
178dev->poll() prototype looks as follows:
179int my_poll(struct net_device *dev, int *budget)
180
181budget is the remaining number of packets the network subsystem on the
182current CPU can send up the stack before yielding to other system tasks.
183*Each driver is responsible for decrementing budget by the total number of
184packets sent.
185 Total number of packets cannot exceed dev->quota.
186
187dev->poll() method is invoked by the top layer, the driver just sends if it
188can to the stack the packet quantity requested.
189
190more on dev->poll() below after the interrupt changes are explained.
191
1922) registering dev->poll() method
193===================================
194
195dev->poll should be set in the dev->probe() method.
196e.g:
197dev->open = my_open;
198.
199.
200/* two new additions */
201/* first register my poll method */
202dev->poll = my_poll;
203/* next register my weight/quanta; can be overridden in /proc */
204dev->weight = 16;
205.
206.
207dev->stop = my_close;
208
209
210
2113) scheduling dev->poll()
212=============================
213This involves modifying the interrupt handler and the code
214path which takes the packet off the NIC and sends them to the
215stack.
216
217it's important at this point to introduce the classical D Becker
218interrupt processor:
219
220------------------
221static irqreturn_t
222netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
223{
224
225 struct net_device *dev = (struct net_device *)dev_instance;
226 struct my_private *tp = (struct my_private *)dev->priv;
227
228 int work_count = my_work_count;
229 status = read_interrupt_status_reg();
230 if (status == 0)
231 return IRQ_NONE; /* Shared IRQ: not us */
232 if (status == 0xffff)
233 return IRQ_HANDLED; /* Hot unplug */
234 if (status & error)
235 do_some_error_handling()
236
237 do {
238 acknowledge_ints_ASAP();
239
240 if (status & link_interrupt) {
241 spin_lock(&tp->link_lock);
242 do_some_link_stat_stuff();
243 spin_lock(&tp->link_lock);
244 }
245
246 if (status & rx_interrupt) {
247 receive_packets(dev);
248 }
249
250 if (status & rx_nobufs) {
251 make_rx_buffs_avail();
252 }
253
254 if (status & tx_related) {
255 spin_lock(&tp->lock);
256 tx_ring_free(dev);
257 if (tx_died)
258 restart_tx();
259 spin_unlock(&tp->lock);
260 }
261
262 status = read_interrupt_status_reg();
263
264 } while (!(status & error) || more_work_to_be_done);
265 return IRQ_HANDLED;
266}
267
268----------------------------------------------------------------------
269
270We now change this to what is shown below to NAPI-enable it:
271
272----------------------------------------------------------------------
273static irqreturn_t
274netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs)
275{
276 struct net_device *dev = (struct net_device *)dev_instance;
277 struct my_private *tp = (struct my_private *)dev->priv;
278
279 status = read_interrupt_status_reg();
280 if (status == 0)
281 return IRQ_NONE; /* Shared IRQ: not us */
282 if (status == 0xffff)
283 return IRQ_HANDLED; /* Hot unplug */
284 if (status & error)
285 do_some_error_handling();
286
287 do {
288/************************ start note *********************************/
289 acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here
290/************************ end note *********************************/
291
292 if (status & link_interrupt) {
293 spin_lock(&tp->link_lock);
294 do_some_link_stat_stuff();
295 spin_unlock(&tp->link_lock);
296 }
297/************************ start note *********************************/
298 if (status & rx_interrupt || (status & rx_nobuffs)) {
299 if (netif_rx_schedule_prep(dev)) {
300
301 /* disable interrupts caused
302 * by arriving packets */
303 disable_rx_and_rxnobuff_ints();
304 /* tell system we have work to be done. */
305 __netif_rx_schedule(dev);
306 } else {
307 printk("driver bug! interrupt while in poll\n");
308 /* FIX by disabling interrupts */
309 disable_rx_and_rxnobuff_ints();
310 }
311 }
312/************************ end note note *********************************/
313
314 if (status & tx_related) {
315 spin_lock(&tp->lock);
316 tx_ring_free(dev);
317
318 if (tx_died)
319 restart_tx();
320 spin_unlock(&tp->lock);
321 }
322
323 status = read_interrupt_status_reg();
324
325/************************ start note *********************************/
326 } while (!(status & error) || more_work_to_be_done(status));
327/************************ end note note *********************************/
328 return IRQ_HANDLED;
329}
330
331---------------------------------------------------------------------
332
333
334We note several things from above:
335
336I) Any interrupt source which is caused by arriving packets is now
337turned off when it occurs. Depending on the hardware, there could be
338several reasons that arriving packets would cause interrupts; these are the
339interrupt sources we wish to avoid. The two common ones are a) a packet
340arriving (rxint) b) a packet arriving and finding no DMA buffers available
341(rxnobuff) .
342This means also acknowledge_ints_ASAP() will not clear the status
343register for those two items above; clearing is done in the place where
344proper work is done within NAPI; at the poll() and refill_rx_ring()
345discussed further below.
346netif_rx_schedule_prep() returns 1 if device is in running state and
347gets successfully added to the core poll list. If we get a zero value
348we can _almost_ assume are already added to the list (instead of not running.
349Logic based on the fact that you shouldn't get interrupt if not running)
350We rectify this by disabling rx and rxnobuf interrupts.
351
352II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared.
353These functionalities are still around actually......
354
355infact, receive_packets(dev) is very close to my_poll() and
356make_rx_buffs_avail() is invoked from my_poll()
357
3584) converting receive_packets() to dev->poll()
359===============================================
360
361We need to convert the classical D Becker receive_packets(dev) to my_poll()
362
363First the typical receive_packets() below:
364-------------------------------------------------------------------
365
366/* this is called by interrupt handler */
367static void receive_packets (struct net_device *dev)
368{
369
370 struct my_private *tp = (struct my_private *)dev->priv;
371 rx_ring = tp->rx_ring;
372 cur_rx = tp->cur_rx;
373 int entry = cur_rx % RX_RING_SIZE;
374 int received = 0;
375 int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx;
376
377 while (rx_ring_not_empty) {
378 u32 rx_status;
379 unsigned int rx_size;
380 unsigned int pkt_size;
381 struct sk_buff *skb;
382 /* read size+status of next frame from DMA ring buffer */
383 /* the number 16 and 4 are just examples */
384 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
385 rx_size = rx_status >> 16;
386 pkt_size = rx_size - 4;
387
388 /* process errors */
389 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
390 (!(rx_status & RxStatusOK))) {
391 netdrv_rx_err (rx_status, dev, tp, ioaddr);
392 return;
393 }
394
395 if (--rx_work_limit < 0)
396 break;
397
398 /* grab a skb */
399 skb = dev_alloc_skb (pkt_size + 2);
400 if (skb) {
401 .
402 .
403 netif_rx (skb);
404 .
405 .
406 } else { /* OOM */
407 /*seems very driver specific ... some just pass
408 whatever is on the ring already. */
409 }
410
411 /* move to the next skb on the ring */
412 entry = (++tp->cur_rx) % RX_RING_SIZE;
413 received++ ;
414
415 }
416
417 /* store current ring pointer state */
418 tp->cur_rx = cur_rx;
419
420 /* Refill the Rx ring buffers if they are needed */
421 refill_rx_ring();
422 .
423 .
424
425}
426-------------------------------------------------------------------
427We change it to a new one below; note the additional parameter in
428the call.
429
430-------------------------------------------------------------------
431
432/* this is called by the network core */
433static int my_poll (struct net_device *dev, int *budget)
434{
435
436 struct my_private *tp = (struct my_private *)dev->priv;
437 rx_ring = tp->rx_ring;
438 cur_rx = tp->cur_rx;
439 int entry = cur_rx % RX_BUF_LEN;
440 /* maximum packets to send to the stack */
441/************************ note note *********************************/
442 int rx_work_limit = dev->quota;
443
444/************************ end note note *********************************/
445 do { // outer beginning loop starts here
446
447 clear_rx_status_register_bit();
448
449 while (rx_ring_not_empty) {
450 u32 rx_status;
451 unsigned int rx_size;
452 unsigned int pkt_size;
453 struct sk_buff *skb;
454 /* read size+status of next frame from DMA ring buffer */
455 /* the number 16 and 4 are just examples */
456 rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset));
457 rx_size = rx_status >> 16;
458 pkt_size = rx_size - 4;
459
460 /* process errors */
461 if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) ||
462 (!(rx_status & RxStatusOK))) {
463 netdrv_rx_err (rx_status, dev, tp, ioaddr);
464 return 1;
465 }
466
467/************************ note note *********************************/
468 if (--rx_work_limit < 0) { /* we got packets, but no quota */
469 /* store current ring pointer state */
470 tp->cur_rx = cur_rx;
471
472 /* Refill the Rx ring buffers if they are needed */
473 refill_rx_ring(dev);
474 goto not_done;
475 }
476/********************** end note **********************************/
477
478 /* grab a skb */
479 skb = dev_alloc_skb (pkt_size + 2);
480 if (skb) {
481 .
482 .
483/************************ note note *********************************/
484 netif_receive_skb (skb);
485/********************** end note **********************************/
486 .
487 .
488 } else { /* OOM */
489 /*seems very driver specific ... common is just pass
490 whatever is on the ring already. */
491 }
492
493 /* move to the next skb on the ring */
494 entry = (++tp->cur_rx) % RX_RING_SIZE;
495 received++ ;
496
497 }
498
499 /* store current ring pointer state */
500 tp->cur_rx = cur_rx;
501
502 /* Refill the Rx ring buffers if they are needed */
503 refill_rx_ring(dev);
504
505 /* no packets on ring; but new ones can arrive since we last
506 checked */
507 status = read_interrupt_status_reg();
508 if (rx status is not set) {
509 /* If something arrives in this narrow window,
510 an interrupt will be generated */
511 goto done;
512 }
513 /* done! at least that's what it looks like ;->
514 if new packets came in after our last check on status bits
515 they'll be caught by the while check and we go back and clear them
516 since we havent exceeded our quota */
517 } while (rx_status_is_set);
518
519done:
520
521/************************ note note *********************************/
522 dev->quota -= received;
523 *budget -= received;
524
525 /* If RX ring is not full we are out of memory. */
526 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
527 goto oom;
528
529 /* we are happy/done, no more packets on ring; put us back
530 to where we can start processing interrupts again */
531 netif_rx_complete(dev);
532 enable_rx_and_rxnobuf_ints();
533
534 /* The last op happens after poll completion. Which means the following:
535 * 1. it can race with disabling irqs in irq handler (which are done to
536 * schedule polls)
537 * 2. it can race with dis/enabling irqs in other poll threads
538 * 3. if an irq raised after the beginning of the outer beginning
539 * loop (marked in the code above), it will be immediately
540 * triggered here.
541 *
542 * Summarizing: the logic may result in some redundant irqs both
543 * due to races in masking and due to too late acking of already
544 * processed irqs. The good news: no events are ever lost.
545 */
546
547 return 0; /* done */
548
549not_done:
550 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
551 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
552 refill_rx_ring(dev);
553
554 if (!received) {
555 printk("received==0\n");
556 received = 1;
557 }
558 dev->quota -= received;
559 *budget -= received;
560 return 1; /* not_done */
561
562oom:
563 /* Start timer, stop polling, but do not enable rx interrupts. */
564 start_poll_timer(dev);
565 return 0; /* we'll take it from here so tell core "done"*/
566
567/************************ End note note *********************************/
568}
569-------------------------------------------------------------------
570
571From above we note that:
5720) rx_work_limit = dev->quota
5731) refill_rx_ring() is in charge of clearing the bit for rxnobuff when
574it does the work.
5752) We have a done and not_done state.
5763) instead of netif_rx() we call netif_receive_skb() to pass the skb.
5774) we have a new way of handling oom condition
5785) A new outer for (;;) loop has been added. This serves the purpose of
579ensuring that if a new packet has come in, after we are all set and done,
580and we have not exceeded our quota that we continue sending packets up.
581
582
583-----------------------------------------------------------
584Poll timer code will need to do the following:
585
586a)
587
588 if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 ||
589 tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
590 refill_rx_ring(dev);
591
592 /* If RX ring is not full we are still out of memory.
593 Restart the timer again. Else we re-add ourselves
594 to the master poll list.
595 */
596
597 if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL)
598 restart_timer();
599
600 else netif_rx_schedule(dev); /* we are back on the poll list */
601
6025) dev->close() and dev->suspend() issues
603==========================================
604The driver writer needn't worry about this; the top net layer takes
605care of it.
606
6076) Adding new Stats to /proc
608=============================
609In order to debug some of the new features, we introduce new stats
610that need to be collected.
611TODO: Fill this later.
612
613APPENDIX 1: discussion on using ethernet HW FC
614==============================================
615Most chips with FC only send a pause packet when they run out of Rx buffers.
616Since packets are pulled off the DMA ring by a softirq in NAPI,
617if the system is slow in grabbing them and we have a high input
618rate (faster than the system's capacity to remove packets), then theoretically
619there will only be one rx interrupt for all packets during a given packetstorm.
620Under low load, we might have a single interrupt per packet.
621FC should be programmed to apply in the case when the system cant pull out
622packets fast enough i.e send a pause only when you run out of rx buffers.
623Note FC in itself is a good solution but we have found it to not be
624much of a commodity feature (both in NICs and switches) and hence falls
625under the same category as using NIC based mitigation. Also, experiments
626indicate that it's much harder to resolve the resource allocation
627issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness
628proved harder. In any case, FC works even better with NAPI but is not
629necessary.
630
631
632APPENDIX 2: the "rotting packet" race-window avoidance scheme
633=============================================================
634
635There are two types of associations seen here
636
6371) status/int which honors level triggered IRQ
638
639If a status bit for receive or rxnobuff is set and the corresponding
640interrupt-enable bit is not on, then no interrupts will be generated. However,
641as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is
642generated. [assuming the status bit was not turned off].
643Generally the concept of level triggered IRQs in association with a status and
644interrupt-enable CSR register set is used to avoid the race.
645
646If we take the example of the tulip:
647"pending work" is indicated by the status bit(CSR5 in tulip).
648the corresponding interrupt bit (CSR7 in tulip) might be turned off (but
649the CSR5 will continue to be turned on with new packet arrivals even if
650we clear it the first time)
651Very important is the fact that if we turn on the interrupt bit on when
652status is set that an immediate irq is triggered.
653
654If we cleared the rx ring and proclaimed there was "no more work
655to be done" and then went on to do a few other things; then when we enable
656interrupts, there is a possibility that a new packet might sneak in during
657this phase. It helps to look at the pseudo code for the tulip poll
658routine:
659
660--------------------------
661 do {
662 ACK;
663 while (ring_is_not_empty()) {
664 work-work-work
665 if quota is exceeded: exit, no touching irq status/mask
666 }
667 /* No packets, but new can arrive while we are doing this*/
668 CSR5 := read
669 if (CSR5 is not set) {
670 /* If something arrives in this narrow window here,
671 * where the comments are ;-> irq will be generated */
672 unmask irqs;
673 exit poll;
674 }
675 } while (rx_status_is_set);
676------------------------
677
678CSR5 bit of interest is only the rx status.
679If you look at the last if statement:
680you just finished grabbing all the packets from the rx ring .. you check if
681status bit says there are more packets just in ... it says none; you then
682enable rx interrupts again; if a new packet just came in during this check,
683we are counting that CSR5 will be set in that small window of opportunity
684and that by re-enabling interrupts, we would actually trigger an interrupt
685to register the new packet for processing.
686
687[The above description nay be very verbose, if you have better wording
688that will make this more understandable, please suggest it.]
689
6902) non-capable hardware
691
692These do not generally respect level triggered IRQs. Normally,
693irqs may be lost while being masked and the only way to leave poll is to do
694a double check for new input after netif_rx_complete() is invoked
695and re-enable polling (after seeing this new input).
696
697Sample code:
698
699---------
700 .
701 .
702restart_poll:
703 while (ring_is_not_empty()) {
704 work-work-work
705 if quota is exceeded: exit, not touching irq status/mask
706 }
707 .
708 .
709 .
710 enable_rx_interrupts()
711 netif_rx_complete(dev);
712 if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) {
713 disable_rx_and_rxnobufs()
714 goto restart_poll
715 } while (rx_status_is_set);
716---------
717
718Basically netif_rx_complete() removes us from the poll list, but because a
719new packet which will never be caught due to the possibility of a race
720might come in, we attempt to re-add ourselves to the poll list.
721
722
723
724
725APPENDIX 3: Scheduling issues.
726==============================
727As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the
728general solution to schedule softirq's to run before next interrupt and by putting
729them under scheduler control. Also this prevents consecutive softirq's from
730monopolize the CPU. This also have the effect that the priority of ksoftirq needs
731to be considered when running very CPU-intensive applications and networking to
732get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0
733(eventually more) is reported cure problems with low network performance at high
734CPU load.
735
736Most used processes in a GIGE router:
737USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND
738root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0)
739root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated
740
741--------------------------------------------------------------------
742
743relevant sites:
744==================
745ftp://robur.slu.se/pub/Linux/net-development/NAPI/
746
747
748--------------------------------------------------------------------
749TODO: Write net-skeleton.c driver.
750-------------------------------------------------------------
751
752Authors:
753========
754Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
755Jamal Hadi Salim <hadi@cyberus.ca>
756Robert Olsson <Robert.Olsson@data.slu.se>
757
758Acknowledgements:
759================
760People who made this document better:
761
762Lennert Buytenhek <buytenh@gnu.org>
763Andrew Morton <akpm@zip.com.au>
764Manfred Spraul <manfred@colorfullife.com>
765Donald Becker <becker@scyld.com>
766Jeff Garzik <jgarzik@pobox.com>
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index 4504cc59e405..afb66f9a8aff 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -38,8 +38,13 @@ Socket options
38DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of 38DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of
39service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, 39service codes (RFC 4340, sec. 8.1.2); if this socket option is not set,
40the socket will fall back to 0 (which means that no meaningful service code 40the socket will fall back to 0 (which means that no meaningful service code
41is present). Connecting sockets set at most one service option; for 41is present). On active sockets this is set before connect(); specifying more
42listening sockets, multiple service codes can be specified. 42than one code has no effect (all subsequent service codes are ignored). The
43case is different for passive sockets, where multiple service codes (up to 32)
44can be set before calling bind().
45
46DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet
47size (application payload size) in bytes, see RFC 4340, section 14.
43 48
44DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the 49DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the
45partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums 50partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums
@@ -50,12 +55,13 @@ be enabled at the receiver, too with suitable choice of CsCov.
50DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the 55DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the
51 range 0..15 are acceptable. The default setting is 0 (full coverage), 56 range 0..15 are acceptable. The default setting is 0 (full coverage),
52 values between 1..15 indicate partial coverage. 57 values between 1..15 indicate partial coverage.
53DCCP_SOCKOPT_SEND_CSCOV is for the receiver and has a different meaning: it 58DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it
54 sets a threshold, where again values 0..15 are acceptable. The default 59 sets a threshold, where again values 0..15 are acceptable. The default
55 of 0 means that all packets with a partial coverage will be discarded. 60 of 0 means that all packets with a partial coverage will be discarded.
56 Values in the range 1..15 indicate that packets with minimally such a 61 Values in the range 1..15 indicate that packets with minimally such a
57 coverage value are also acceptable. The higher the number, the more 62 coverage value are also acceptable. The higher the number, the more
58 restrictive this setting (see [RFC 4340, sec. 9.2.1]). 63 restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage
64 settings are inherited to the child socket after accept().
59 65
60The following two options apply to CCID 3 exclusively and are getsockopt()-only. 66The following two options apply to CCID 3 exclusively and are getsockopt()-only.
61In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. 67In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
@@ -112,9 +118,14 @@ tx_qlen = 5
112 The size of the transmit buffer in packets. A value of 0 corresponds 118 The size of the transmit buffer in packets. A value of 0 corresponds
113 to an unbounded transmit buffer. 119 to an unbounded transmit buffer.
114 120
121sync_ratelimit = 125 ms
122 The timeout between subsequent DCCP-Sync packets sent in response to
123 sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit
124 of this parameter is milliseconds; a value of 0 disables rate-limiting.
125
115Notes 126Notes
116===== 127=====
117 128
118DCCP does not travel through NAT successfully at present on many boxes. This is 129DCCP does not travel through NAT successfully at present on many boxes. This is
119because the checksum covers the psuedo-header as per TCP and UDP. Linux NAT 130because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT
120support for DCCP has been added. 131support for DCCP has been added.
diff --git a/Documentation/networking/dgrs.txt b/Documentation/networking/dgrs.txt
deleted file mode 100644
index 1aa1bb3f94ab..000000000000
--- a/Documentation/networking/dgrs.txt
+++ /dev/null
@@ -1,52 +0,0 @@
1 The Digi International RightSwitch SE-X (dgrs) Device Driver
2
3This is a Linux driver for the Digi International RightSwitch SE-X
4EISA and PCI boards. These are 4 (EISA) or 6 (PCI) port Ethernet
5switches and a NIC combined into a single board. This driver can
6be compiled into the kernel statically or as a loadable module.
7
8There is also a companion management tool, called "xrightswitch".
9The management tool lets you watch the performance graphically,
10as well as set the SNMP agent IP and IPX addresses, IEEE Spanning
11Tree, and Aging time. These can also be set from the command line
12when the driver is loaded. The driver command line options are:
13
14 debug=NNN Debug printing level
15 dma=0/1 Disable/Enable DMA on PCI card
16 spantree=0/1 Disable/Enable IEEE spanning tree
17 hashexpire=NNN Change address aging time (default 300 seconds)
18 ipaddr=A,B,C,D Set SNMP agent IP address i.e. 199,86,8,221
19 iptrap=A,B,C,D Set SNMP agent IP trap address i.e. 199,86,8,221
20 ipxnet=NNN Set SNMP agent IPX network number
21 nicmode=0/1 Disable/Enable multiple NIC mode
22
23There is also a tool for setting up input and output packet filters
24on each port, called "dgrsfilt".
25
26Both the management tool and the filtering tool are available
27separately from the following FTP site:
28
29 ftp://ftp.dgii.com/drivers/rightswitch/linux/
30
31When nicmode=1, the board and driver operate as 4 or 6 individual
32NIC ports (eth0...eth5) instead of as a switch. All switching
33functions are disabled. In the future, the board firmware may include
34a routing cache when in this mode.
35
36Copyright 1995-1996 Digi International Inc.
37
38This software may be used and distributed according to the terms
39of the GNU General Public License, incorporated herein by reference.
40
41For information on purchasing a RightSwitch SE-4 or SE-6
42board, please contact Digi's sales department at 1-612-912-3444
43or 1-800-DIGIBRD. Outside the U.S., please check our Web page at:
44
45 http://www.dgii.com
46
47for sales offices worldwide. Tech support is also available through
48the channels listed on the Web site, although as long as I am
49employed on networking products at Digi I will be happy to provide
50any bug fixes that may be needed.
51
52-Rick Richardson, rick@dgii.com
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 32c2e9da5f3a..6ae2feff3087 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -180,13 +180,20 @@ tcp_fin_timeout - INTEGER
180 to live longer. Cf. tcp_max_orphans. 180 to live longer. Cf. tcp_max_orphans.
181 181
182tcp_frto - INTEGER 182tcp_frto - INTEGER
183 Enables F-RTO, an enhanced recovery algorithm for TCP retransmission 183 Enables Forward RTO-Recovery (F-RTO) defined in RFC4138.
184 F-RTO is an enhanced recovery algorithm for TCP retransmission
184 timeouts. It is particularly beneficial in wireless environments 185 timeouts. It is particularly beneficial in wireless environments
185 where packet loss is typically due to random radio interference 186 where packet loss is typically due to random radio interference
186 rather than intermediate router congestion. If set to 1, basic 187 rather than intermediate router congestion. FRTO is sender-side
187 version is enabled. 2 enables SACK enhanced F-RTO, which is 188 only modification. Therefore it does not require any support from
188 EXPERIMENTAL. The basic version can be used also when SACK is 189 the peer, but in a typical case, however, where wireless link is
189 enabled for a flow through tcp_sack sysctl. 190 the local access link and most of the data flows downlink, the
191 faraway servers should have FRTO enabled to take advantage of it.
192 If set to 1, basic version is enabled. 2 enables SACK enhanced
193 F-RTO if flow uses SACK. The basic version can be used also when
194 SACK is in use though scenario(s) with it exists where FRTO
195 interacts badly with the packet counting of the SACK enabled TCP
196 flow.
190 197
191tcp_frto_response - INTEGER 198tcp_frto_response - INTEGER
192 When F-RTO has detected that a TCP retransmission timeout was 199 When F-RTO has detected that a TCP retransmission timeout was
diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt
index 53ef7a06f49c..84906ef3ed6e 100644
--- a/Documentation/networking/mac80211-injection.txt
+++ b/Documentation/networking/mac80211-injection.txt
@@ -13,15 +13,35 @@ The radiotap format is discussed in
13./Documentation/networking/radiotap-headers.txt. 13./Documentation/networking/radiotap-headers.txt.
14 14
15Despite 13 radiotap argument types are currently defined, most only make sense 15Despite 13 radiotap argument types are currently defined, most only make sense
16to appear on received packets. Currently three kinds of argument are used by 16to appear on received packets. The following information is parsed from the
17the injection code, although it knows to skip any other arguments that are 17radiotap headers and used to control injection:
18present (facilitating replay of captured radiotap headers directly):
19 18
20 - IEEE80211_RADIOTAP_RATE - u8 arg in 500kbps units (0x02 --> 1Mbps) 19 * IEEE80211_RADIOTAP_RATE
21 20
22 - IEEE80211_RADIOTAP_ANTENNA - u8 arg, 0x00 = ant1, 0x01 = ant2 21 rate in 500kbps units, automatic if invalid or not present
23 22
24 - IEEE80211_RADIOTAP_DBM_TX_POWER - u8 arg, dBm 23
24 * IEEE80211_RADIOTAP_ANTENNA
25
26 antenna to use, automatic if not present
27
28
29 * IEEE80211_RADIOTAP_DBM_TX_POWER
30
31 transmit power in dBm, automatic if not present
32
33
34 * IEEE80211_RADIOTAP_FLAGS
35
36 IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated
37 IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available
38 IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the
39 current fragmentation threshold. Note that
40 this flag is only reliable when software
41 fragmentation is enabled)
42
43The injection code can also skip all other currently defined radiotap fields
44facilitating replay of captured radiotap headers directly.
25 45
26Here is an example valid radiotap header defining these three parameters 46Here is an example valid radiotap header defining these three parameters
27 47
diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt
index 1caa6c734691..3c2f2b328638 100644
--- a/Documentation/networking/netconsole.txt
+++ b/Documentation/networking/netconsole.txt
@@ -3,6 +3,10 @@ started by Ingo Molnar <mingo@redhat.com>, 2001.09.17
32.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 32.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003
4 4
5Please send bug reports to Matt Mackall <mpm@selenic.com> 5Please send bug reports to Matt Mackall <mpm@selenic.com>
6and Satyam Sharma <satyam.sharma@gmail.com>
7
8Introduction:
9=============
6 10
7This module logs kernel printk messages over UDP allowing debugging of 11This module logs kernel printk messages over UDP allowing debugging of
8problem where disk logging fails and serial consoles are impractical. 12problem where disk logging fails and serial consoles are impractical.
@@ -13,6 +17,9 @@ the specified interface as soon as possible. While this doesn't allow
13capture of early kernel panics, it does capture most of the boot 17capture of early kernel panics, it does capture most of the boot
14process. 18process.
15 19
20Sender and receiver configuration:
21==================================
22
16It takes a string configuration parameter "netconsole" in the 23It takes a string configuration parameter "netconsole" in the
17following format: 24following format:
18 25
@@ -34,21 +41,113 @@ Examples:
34 41
35 insmod netconsole netconsole=@/,@10.0.0.2/ 42 insmod netconsole netconsole=@/,@10.0.0.2/
36 43
44It also supports logging to multiple remote agents by specifying
45parameters for the multiple agents separated by semicolons and the
46complete string enclosed in "quotes", thusly:
47
48 modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/"
49
37Built-in netconsole starts immediately after the TCP stack is 50Built-in netconsole starts immediately after the TCP stack is
38initialized and attempts to bring up the supplied dev at the supplied 51initialized and attempts to bring up the supplied dev at the supplied
39address. 52address.
40 53
41The remote host can run either 'netcat -u -l -p <port>' or syslogd. 54The remote host can run either 'netcat -u -l -p <port>' or syslogd.
42 55
56Dynamic reconfiguration:
57========================
58
59Dynamic reconfigurability is a useful addition to netconsole that enables
60remote logging targets to be dynamically added, removed, or have their
61parameters reconfigured at runtime from a configfs-based userspace interface.
62[ Note that the parameters of netconsole targets that were specified/created
63from the boot/module option are not exposed via this interface, and hence
64cannot be modified dynamically. ]
65
66To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the
67netconsole module (or kernel, if netconsole is built-in).
68
69Some examples follow (where configfs is mounted at the /sys/kernel/config
70mountpoint).
71
72To add a remote logging target (target names can be arbitrary):
73
74 cd /sys/kernel/config/netconsole/
75 mkdir target1
76
77Note that newly created targets have default parameter values (as mentioned
78above) and are disabled by default -- they must first be enabled by writing
79"1" to the "enabled" attribute (usually after setting parameters accordingly)
80as described below.
81
82To remove a target:
83
84 rmdir /sys/kernel/config/netconsole/othertarget/
85
86The interface exposes these parameters of a netconsole target to userspace:
87
88 enabled Is this target currently enabled? (read-write)
89 dev_name Local network interface name (read-write)
90 local_port Source UDP port to use (read-write)
91 remote_port Remote agent's UDP port (read-write)
92 local_ip Source IP address to use (read-write)
93 remote_ip Remote agent's IP address (read-write)
94 local_mac Local interface's MAC address (read-only)
95 remote_mac Remote agent's MAC address (read-write)
96
97The "enabled" attribute is also used to control whether the parameters of
98a target can be updated or not -- you can modify the parameters of only
99disabled targets (i.e. if "enabled" is 0).
100
101To update a target's parameters:
102
103 cat enabled # check if enabled is 1
104 echo 0 > enabled # disable the target (if required)
105 echo eth2 > dev_name # set local interface
106 echo 10.0.0.4 > remote_ip # update some parameter
107 echo cb:a9:87:65:43:21 > remote_mac # update more parameters
108 echo 1 > enabled # enable target again
109
110You can also update the local interface dynamically. This is especially
111useful if you want to use interfaces that have newly come up (and may not
112have existed when netconsole was loaded / initialized).
113
114Miscellaneous notes:
115====================
116
43WARNING: the default target ethernet setting uses the broadcast 117WARNING: the default target ethernet setting uses the broadcast
44ethernet address to send packets, which can cause increased load on 118ethernet address to send packets, which can cause increased load on
45other systems on the same ethernet segment. 119other systems on the same ethernet segment.
46 120
121TIP: some LAN switches may be configured to suppress ethernet broadcasts
122so it is advised to explicitly specify the remote agents' MAC addresses
123from the config parameters passed to netconsole.
124
125TIP: to find out the MAC address of, say, 10.0.0.2, you may try using:
126
127 ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2
128
129TIP: in case the remote logging agent is on a separate LAN subnet than
130the sender, it is suggested to try specifying the MAC address of the
131default gateway (you may use /sbin/route -n to find it out) as the
132remote MAC address instead.
133
47NOTE: the network device (eth1 in the above case) can run any kind 134NOTE: the network device (eth1 in the above case) can run any kind
48of other network traffic, netconsole is not intrusive. Netconsole 135of other network traffic, netconsole is not intrusive. Netconsole
49might cause slight delays in other traffic if the volume of kernel 136might cause slight delays in other traffic if the volume of kernel
50messages is high, but should have no other impact. 137messages is high, but should have no other impact.
51 138
139NOTE: if you find that the remote logging agent is not receiving or
140printing all messages from the sender, it is likely that you have set
141the "console_loglevel" parameter (on the sender) to only send high
142priority messages to the console. You can change this at runtime using:
143
144 dmesg -n 8
145
146or by specifying "debug" on the kernel command line at boot, to send
147all kernel messages to the console. A specific value for this parameter
148can also be set using the "loglevel" kernel boot option. See the
149dmesg(8) man page and Documentation/kernel-parameters.txt for details.
150
52Netconsole was designed to be as instantaneous as possible, to 151Netconsole was designed to be as instantaneous as possible, to
53enable the logging of even the most critical kernel bugs. It works 152enable the logging of even the most critical kernel bugs. It works
54from IRQ contexts as well, and does not enable interrupts while 153from IRQ contexts as well, and does not enable interrupts while
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
index 37869295fc70..d0f71fc7f782 100644
--- a/Documentation/networking/netdevices.txt
+++ b/Documentation/networking/netdevices.txt
@@ -73,7 +73,8 @@ dev->hard_start_xmit:
73 has to lock by itself when needed. It is recommended to use a try lock 73 has to lock by itself when needed. It is recommended to use a try lock
74 for this and return NETDEV_TX_LOCKED when the spin lock fails. 74 for this and return NETDEV_TX_LOCKED when the spin lock fails.
75 The locking there should also properly protect against 75 The locking there should also properly protect against
76 set_multicast_list. 76 set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated.
77 Dont use it for new drivers.
77 78
78 Context: Process with BHs disabled or BH (timer), 79 Context: Process with BHs disabled or BH (timer),
79 will be called with interrupts disabled by netconsole. 80 will be called with interrupts disabled by netconsole.
@@ -95,9 +96,13 @@ dev->set_multicast_list:
95 Synchronization: netif_tx_lock spinlock. 96 Synchronization: netif_tx_lock spinlock.
96 Context: BHs disabled 97 Context: BHs disabled
97 98
98dev->poll: 99struct napi_struct synchronization rules
99 Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See 100========================================
100 dev_close code and comments in net/core/dev.c for more info. 101napi->poll:
102 Synchronization: NAPI_STATE_SCHED bit in napi->state. Device
103 driver's dev->close method will invoke napi_disable() on
104 all NAPI instances which will do a sleeping poll on the
105 NAPI_STATE_SCHED napi->state bit, waiting for all pending
106 NAPI activity to cease.
101 Context: softirq 107 Context: softirq
102 will be called with interrupts disabled by netconsole. 108 will be called with interrupts disabled by netconsole.
103