aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-07-09 21:24:39 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-07-09 21:24:39 -0400
commit496322bc91e35007ed754184dcd447a02b6dd685 (patch)
treef5298d0a74c0a6e65c0e98050b594b8d020904c1 /Documentation/networking
parent2e17c5a97e231f3cb426f4b7895eab5be5c5442e (diff)
parent56e0ef527b184b3de2d7f88c6190812b2b2ac6bf (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "This is a re-do of the net-next pull request for the current merge window. The only difference from the one I made the other day is that this has Eliezer's interface renames and the timeout handling changes made based upon your feedback, as well as a few bug fixes that have trickeled in. Highlights: 1) Low latency device polling, eliminating the cost of interrupt handling and context switches. Allows direct polling of a network device from socket operations, such as recvmsg() and poll(). Currently ixgbe, mlx4, and bnx2x support this feature. Full high level description, performance numbers, and design in commit 0a4db187a999 ("Merge branch 'll_poll'") From Eliezer Tamir. 2) With the routing cache removed, ip_check_mc_rcu() gets exercised more than ever before in the case where we have lots of multicast addresses. Use a hash table instead of a simple linked list, from Eric Dumazet. 3) Add driver for Atheros CQA98xx 802.11ac wireless devices, from Bartosz Markowski, Janusz Dziedzic, Kalle Valo, Marek Kwaczynski, Marek Puzyniak, Michal Kazior, and Sujith Manoharan. 4) Support reporting the TUN device persist flag to userspace, from Pavel Emelyanov. 5) Allow controlling network device VF link state using netlink, from Rony Efraim. 6) Support GRE tunneling in openvswitch, from Pravin B Shelar. 7) Adjust SOCK_MIN_RCVBUF and SOCK_MIN_SNDBUF for modern times, from Daniel Borkmann and Eric Dumazet. 8) Allow controlling of TCP quickack behavior on a per-route basis, from Cong Wang. 9) Several bug fixes and improvements to vxlan from Stephen Hemminger, Pravin B Shelar, and Mike Rapoport. In particular, support receiving on multiple UDP ports. 10) Major cleanups, particular in the area of debugging and cookie lifetime handline, to the SCTP protocol code. From Daniel Borkmann. 11) Allow packets to cross network namespaces when traversing tunnel devices. From Nicolas Dichtel. 12) Allow monitoring netlink traffic via AF_PACKET sockets, in a manner akin to how we monitor real network traffic via ptype_all. From Daniel Borkmann. 13) Several bug fixes and improvements for the new alx device driver, from Johannes Berg. 14) Fix scalability issues in the netem packet scheduler's time queue, by using an rbtree. From Eric Dumazet. 15) Several bug fixes in TCP loss recovery handling, from Yuchung Cheng. 16) Add support for GSO segmentation of MPLS packets, from Simon Horman. 17) Make network notifiers have a real data type for the opaque pointer that's passed into them. Use this to properly handle network device flag changes in arp_netdev_event(). From Jiri Pirko and Timo Teräs. 18) Convert several drivers over to module_pci_driver(), from Peter Huewe. 19) tcp_fixup_rcvbuf() can loop 500 times over loopback, just use a O(1) calculation instead. From Eric Dumazet. 20) Support setting of explicit tunnel peer addresses in ipv6, just like ipv4. From Nicolas Dichtel. 21) Protect x86 BPF JIT against spraying attacks, from Eric Dumazet. 22) Prevent a single high rate flow from overruning an individual cpu during RX packet processing via selective flow shedding. From Willem de Bruijn. 23) Don't use spinlocks in TCP md5 signing fast paths, from Eric Dumazet. 24) Don't just drop GSO packets which are above the TBF scheduler's burst limit, chop them up so they are in-bounds instead. Also from Eric Dumazet. 25) VLAN offloads are missed when configured on top of a bridge, fix from Vlad Yasevich. 26) Support IPV6 in ping sockets. From Lorenzo Colitti. 27) Receive flow steering targets should be updated at poll() time too, from David Majnemer. 28) Fix several corner case regressions in PMTU/redirect handling due to the routing cache removal, from Timo Teräs. 29) We have to be mindful of ipv4 mapped ipv6 sockets in upd_v6_push_pending_frames(). From Hannes Frederic Sowa. 30) Fix L2TP sequence number handling bugs, from James Chapman." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1214 commits) drivers/net: caif: fix wrong rtnl_is_locked() usage drivers/net: enic: release rtnl_lock on error-path vhost-net: fix use-after-free in vhost_net_flush net: mv643xx_eth: do not use port number as platform device id net: sctp: confirm route during forward progress virtio_net: fix race in RX VQ processing virtio: support unlocked queue poll net/cadence/macb: fix bug/typo in extracting gem_irq_read_clear bit Documentation: Fix references to defunct linux-net@vger.kernel.org net/fs: change busy poll time accounting net: rename low latency sockets functions to busy poll bridge: fix some kernel warning in multicast timer sfc: Fix memory leak when discarding scattered packets sit: fix tunnel update via netlink dt:net:stmmac: Add dt specific phy reset callback support. dt:net:stmmac: Add support to dwmac version 3.610 and 3.710 dt:net:stmmac: Allocate platform data only if its NULL. net:stmmac: fix memleak in the open method ipv6: rt6_check_neigh should successfully verify neigh if no NUD information are available net: ipv6: fix wrong ping_v6_sendmsg return value ...
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/.gitignore1
-rw-r--r--Documentation/networking/00-INDEX2
-rw-r--r--Documentation/networking/Makefile5
-rw-r--r--Documentation/networking/arcnet.txt7
-rw-r--r--Documentation/networking/bonding.txt79
-rw-r--r--Documentation/networking/ifenslave.c1105
-rw-r--r--Documentation/networking/ip-sysctl.txt11
-rw-r--r--Documentation/networking/ipvs-sysctl.txt13
-rw-r--r--Documentation/networking/netlink_mmap.txt6
-rw-r--r--Documentation/networking/packet_mmap.txt133
-rw-r--r--Documentation/networking/scaling.txt58
-rw-r--r--Documentation/networking/vortex.txt2
12 files changed, 152 insertions, 1270 deletions
diff --git a/Documentation/networking/.gitignore b/Documentation/networking/.gitignore
index 286a5680f490..e69de29bb2d1 100644
--- a/Documentation/networking/.gitignore
+++ b/Documentation/networking/.gitignore
@@ -1 +0,0 @@
1ifenslave
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index 258d9b92c36f..32dfbd924121 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -88,8 +88,6 @@ gianfar.txt
88 - Gianfar Ethernet Driver. 88 - Gianfar Ethernet Driver.
89ieee802154.txt 89ieee802154.txt
90 - Linux IEEE 802.15.4 implementation, API and drivers 90 - Linux IEEE 802.15.4 implementation, API and drivers
91ifenslave.c
92 - Configure network interfaces for parallel routing (bonding).
93igb.txt 91igb.txt
94 - README for the Intel Gigabit Ethernet Driver (igb). 92 - README for the Intel Gigabit Ethernet Driver (igb).
95igbvf.txt 93igbvf.txt
diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile
index 24c308dd3fd1..0aa1ac98fc2b 100644
--- a/Documentation/networking/Makefile
+++ b/Documentation/networking/Makefile
@@ -1,11 +1,6 @@
1# kbuild trick to avoid linker error. Can be omitted if a module is built. 1# kbuild trick to avoid linker error. Can be omitted if a module is built.
2obj- := dummy.o 2obj- := dummy.o
3 3
4# List of programs to build
5hostprogs-y := ifenslave
6
7HOSTCFLAGS_ifenslave.o += -I$(objtree)/usr/include
8
9# Tell kbuild to always build the programs 4# Tell kbuild to always build the programs
10always := $(hostprogs-y) 5always := $(hostprogs-y)
11 6
diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt
index 9ff579502151..aff97f47c05c 100644
--- a/Documentation/networking/arcnet.txt
+++ b/Documentation/networking/arcnet.txt
@@ -70,9 +70,10 @@ list, mail to linux-arcnet@tichy.ch.uj.edu.pl.
70There are archives of the mailing list at: 70There are archives of the mailing list at:
71 http://epistolary.org/mailman/listinfo.cgi/arcnet 71 http://epistolary.org/mailman/listinfo.cgi/arcnet
72 72
73The people on linux-net@vger.kernel.org have also been known to be very 73The people on linux-net@vger.kernel.org (now defunct, replaced by
74helpful, especially when we're talking about ALPHA Linux kernels that may or 74netdev@vger.kernel.org) have also been known to be very helpful, especially
75may not work right in the first place. 75when we're talking about ALPHA Linux kernels that may or may not work right
76in the first place.
76 77
77 78
78Other Drivers and Info 79Other Drivers and Info
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 10a015c384b8..87bbcfee2e06 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -104,8 +104,7 @@ Table of Contents
104============================== 104==============================
105 105
106 Most popular distro kernels ship with the bonding driver 106 Most popular distro kernels ship with the bonding driver
107already available as a module and the ifenslave user level control 107already available as a module. If your distro does not, or you
108program installed and ready for use. If your distro does not, or you
109have need to compile bonding from source (e.g., configuring and 108have need to compile bonding from source (e.g., configuring and
110installing a mainline kernel from kernel.org), you'll need to perform 109installing a mainline kernel from kernel.org), you'll need to perform
111the following steps: 110the following steps:
@@ -124,46 +123,13 @@ device support" section. It is recommended that you configure the
124driver as module since it is currently the only way to pass parameters 123driver as module since it is currently the only way to pass parameters
125to the driver or configure more than one bonding device. 124to the driver or configure more than one bonding device.
126 125
127 Build and install the new kernel and modules, then continue 126 Build and install the new kernel and modules.
128below to install ifenslave.
129 127
1301.2 Install ifenslave Control Utility 1281.2 Bonding Control Utility
131------------------------------------- 129-------------------------------------
132 130
133 The ifenslave user level control program is included in the 131 It is recommended to configure bonding via iproute2 (netlink)
134kernel source tree, in the file Documentation/networking/ifenslave.c. 132or sysfs, the old ifenslave control utility is obsolete.
135It is generally recommended that you use the ifenslave that
136corresponds to the kernel that you are using (either from the same
137source tree or supplied with the distro), however, ifenslave
138executables from older kernels should function (but features newer
139than the ifenslave release are not supported). Running an ifenslave
140that is newer than the kernel is not supported, and may or may not
141work.
142
143 To install ifenslave, do the following:
144
145# gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave
146# cp ifenslave /sbin/ifenslave
147
148 If your kernel source is not in "/usr/src/linux," then replace
149"/usr/src/linux/include" in the above with the location of your kernel
150source include directory.
151
152 You may wish to back up any existing /sbin/ifenslave, or, for
153testing or informal use, tag the ifenslave to the kernel version
154(e.g., name the ifenslave executable /sbin/ifenslave-2.6.10).
155
156IMPORTANT NOTE:
157
158 If you omit the "-I" or specify an incorrect directory, you
159may end up with an ifenslave that is incompatible with the kernel
160you're trying to build it for. Some distros (e.g., Red Hat from 7.1
161onwards) do not have /usr/include/linux symbolically linked to the
162default kernel source include directory.
163
164SECOND IMPORTANT NOTE:
165 If you plan to configure bonding using sysfs or using the
166/etc/network/interfaces file, you do not need to use ifenslave.
167 133
1682. Bonding Driver Options 1342. Bonding Driver Options
169========================= 135=========================
@@ -337,6 +303,12 @@ arp_validate
337 such a situation, validation of backup slaves must be 303 such a situation, validation of backup slaves must be
338 disabled. 304 disabled.
339 305
306 The validation of ARP requests on backup slaves is mainly
307 helping bonding to decide which slaves are more likely to
308 work in case of the active slave failure, it doesn't really
309 guarantee that the backup slave will work if it's selected
310 as the next active slave.
311
340 This option is useful in network configurations in which 312 This option is useful in network configurations in which
341 multiple bonding hosts are concurrently issuing ARPs to one or 313 multiple bonding hosts are concurrently issuing ARPs to one or
342 more targets beyond a common switch. Should the link between 314 more targets beyond a common switch. Should the link between
@@ -349,6 +321,25 @@ arp_validate
349 321
350 This option was added in bonding version 3.1.0. 322 This option was added in bonding version 3.1.0.
351 323
324arp_all_targets
325
326 Specifies the quantity of arp_ip_targets that must be reachable
327 in order for the ARP monitor to consider a slave as being up.
328 This option affects only active-backup mode for slaves with
329 arp_validation enabled.
330
331 Possible values are:
332
333 any or 0
334
335 consider the slave up only when any of the arp_ip_targets
336 is reachable
337
338 all or 1
339
340 consider the slave up only when all of the arp_ip_targets
341 are reachable
342
352downdelay 343downdelay
353 344
354 Specifies the time, in milliseconds, to wait before disabling 345 Specifies the time, in milliseconds, to wait before disabling
@@ -851,7 +842,7 @@ resend_igmp
851============================== 842==============================
852 843
853 You can configure bonding using either your distro's network 844 You can configure bonding using either your distro's network
854initialization scripts, or manually using either ifenslave or the 845initialization scripts, or manually using either iproute2 or the
855sysfs interface. Distros generally use one of three packages for the 846sysfs interface. Distros generally use one of three packages for the
856network initialization scripts: initscripts, sysconfig or interfaces. 847network initialization scripts: initscripts, sysconfig or interfaces.
857Recent versions of these packages have support for bonding, while older 848Recent versions of these packages have support for bonding, while older
@@ -1160,7 +1151,7 @@ not support this method for specifying multiple bonding interfaces; for
1160those instances, see the "Configuring Multiple Bonds Manually" section, 1151those instances, see the "Configuring Multiple Bonds Manually" section,
1161below. 1152below.
1162 1153
11633.3 Configuring Bonding Manually with Ifenslave 11543.3 Configuring Bonding Manually with iproute2
1164----------------------------------------------- 1155-----------------------------------------------
1165 1156
1166 This section applies to distros whose network initialization 1157 This section applies to distros whose network initialization
@@ -1171,7 +1162,7 @@ version 8.
1171 The general method for these systems is to place the bonding 1162 The general method for these systems is to place the bonding
1172module parameters into a config file in /etc/modprobe.d/ (as 1163module parameters into a config file in /etc/modprobe.d/ (as
1173appropriate for the installed distro), then add modprobe and/or 1164appropriate for the installed distro), then add modprobe and/or
1174ifenslave commands to the system's global init script. The name of 1165`ip link` commands to the system's global init script. The name of
1175the global init script differs; for sysconfig, it is 1166the global init script differs; for sysconfig, it is
1176/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. 1167/etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local.
1177 1168
@@ -1183,8 +1174,8 @@ reboots, edit the appropriate file (/etc/init.d/boot.local or
1183modprobe bonding mode=balance-alb miimon=100 1174modprobe bonding mode=balance-alb miimon=100
1184modprobe e100 1175modprobe e100
1185ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up 1176ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up
1186ifenslave bond0 eth0 1177ip link set eth0 master bond0
1187ifenslave bond0 eth1 1178ip link set eth1 master bond0
1188 1179
1189 Replace the example bonding module parameters and bond0 1180 Replace the example bonding module parameters and bond0
1190network configuration (IP address, netmask, etc) with the appropriate 1181network configuration (IP address, netmask, etc) with the appropriate
diff --git a/Documentation/networking/ifenslave.c b/Documentation/networking/ifenslave.c
deleted file mode 100644
index ac5debb2f16c..000000000000
--- a/Documentation/networking/ifenslave.c
+++ /dev/null
@@ -1,1105 +0,0 @@
1/* Mode: C;
2 * ifenslave.c: Configure network interfaces for parallel routing.
3 *
4 * This program controls the Linux implementation of running multiple
5 * network interfaces in parallel.
6 *
7 * Author: Donald Becker <becker@cesdis.gsfc.nasa.gov>
8 * Copyright 1994-1996 Donald Becker
9 *
10 * This program is free software; you can redistribute it
11 * and/or modify it under the terms of the GNU General Public
12 * License as published by the Free Software Foundation.
13 *
14 * The author may be reached as becker@CESDIS.gsfc.nasa.gov, or C/O
15 * Center of Excellence in Space Data and Information Sciences
16 * Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771
17 *
18 * Changes :
19 * - 2000/10/02 Willy Tarreau <willy at meta-x.org> :
20 * - few fixes. Master's MAC address is now correctly taken from
21 * the first device when not previously set ;
22 * - detach support : call BOND_RELEASE to detach an enslaved interface.
23 * - give a mini-howto from command-line help : # ifenslave -h
24 *
25 * - 2001/02/16 Chad N. Tindel <ctindel at ieee dot org> :
26 * - Master is now brought down before setting the MAC address. In
27 * the 2.4 kernel you can't change the MAC address while the device is
28 * up because you get EBUSY.
29 *
30 * - 2001/09/13 Takao Indoh <indou dot takao at jp dot fujitsu dot com>
31 * - Added the ability to change the active interface on a mode 1 bond
32 * at runtime.
33 *
34 * - 2001/10/23 Chad N. Tindel <ctindel at ieee dot org> :
35 * - No longer set the MAC address of the master. The bond device will
36 * take care of this itself
37 * - Try the SIOC*** versions of the bonding ioctls before using the
38 * old versions
39 * - 2002/02/18 Erik Habbinga <erik_habbinga @ hp dot com> :
40 * - ifr2.ifr_flags was not initialized in the hwaddr_notset case,
41 * SIOCGIFFLAGS now called before hwaddr_notset test
42 *
43 * - 2002/10/31 Tony Cureington <tony.cureington * hp_com> :
44 * - If the master does not have a hardware address when the first slave
45 * is enslaved, the master is assigned the hardware address of that
46 * slave - there is a comment in bonding.c stating "ifenslave takes
47 * care of this now." This corrects the problem of slaves having
48 * different hardware addresses in active-backup mode when
49 * multiple interfaces are specified on a single ifenslave command
50 * (ifenslave bond0 eth0 eth1).
51 *
52 * - 2003/03/18 - Tsippy Mendelson <tsippy.mendelson at intel dot com> and
53 * Shmulik Hen <shmulik.hen at intel dot com>
54 * - Moved setting the slave's mac address and openning it, from
55 * the application to the driver. This enables support of modes
56 * that need to use the unique mac address of each slave.
57 * The driver also takes care of closing the slave and restoring its
58 * original mac address upon release.
59 * In addition, block possibility of enslaving before the master is up.
60 * This prevents putting the system in an undefined state.
61 *
62 * - 2003/05/01 - Amir Noam <amir.noam at intel dot com>
63 * - Added ABI version control to restore compatibility between
64 * new/old ifenslave and new/old bonding.
65 * - Prevent adding an adapter that is already a slave.
66 * Fixes the problem of stalling the transmission and leaving
67 * the slave in a down state.
68 *
69 * - 2003/05/01 - Shmulik Hen <shmulik.hen at intel dot com>
70 * - Prevent enslaving if the bond device is down.
71 * Fixes the problem of leaving the system in unstable state and
72 * halting when trying to remove the module.
73 * - Close socket on all abnormal exists.
74 * - Add versioning scheme that follows that of the bonding driver.
75 * current version is 1.0.0 as a base line.
76 *
77 * - 2003/05/22 - Jay Vosburgh <fubar at us dot ibm dot com>
78 * - ifenslave -c was broken; it's now fixed
79 * - Fixed problem with routes vanishing from master during enslave
80 * processing.
81 *
82 * - 2003/05/27 - Amir Noam <amir.noam at intel dot com>
83 * - Fix backward compatibility issues:
84 * For drivers not using ABI versions, slave was set down while
85 * it should be left up before enslaving.
86 * Also, master was not set down and the default set_mac_address()
87 * would fail and generate an error message in the system log.
88 * - For opt_c: slave should not be set to the master's setting
89 * while it is running. It was already set during enslave. To
90 * simplify things, it is now handled separately.
91 *
92 * - 2003/12/01 - Shmulik Hen <shmulik.hen at intel dot com>
93 * - Code cleanup and style changes
94 * set version to 1.1.0
95 */
96
97#define APP_VERSION "1.1.0"
98#define APP_RELDATE "December 1, 2003"
99#define APP_NAME "ifenslave"
100
101static char *version =
102APP_NAME ".c:v" APP_VERSION " (" APP_RELDATE ")\n"
103"o Donald Becker (becker@cesdis.gsfc.nasa.gov).\n"
104"o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).\n"
105"o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel\n"
106" (ctindel at ieee dot org).\n";
107
108static const char *usage_msg =
109"Usage: ifenslave [-f] <master-if> <slave-if> [<slave-if>...]\n"
110" ifenslave -d <master-if> <slave-if> [<slave-if>...]\n"
111" ifenslave -c <master-if> <slave-if>\n"
112" ifenslave --help\n";
113
114static const char *help_msg =
115"\n"
116" To create a bond device, simply follow these three steps :\n"
117" - ensure that the required drivers are properly loaded :\n"
118" # modprobe bonding ; modprobe <3c59x|eepro100|pcnet32|tulip|...>\n"
119" - assign an IP address to the bond device :\n"
120" # ifconfig bond0 <addr> netmask <mask> broadcast <bcast>\n"
121" - attach all the interfaces you need to the bond device :\n"
122" # ifenslave [{-f|--force}] bond0 eth0 [eth1 [eth2]...]\n"
123" If bond0 didn't have a MAC address, it will take eth0's. Then, all\n"
124" interfaces attached AFTER this assignment will get the same MAC addr.\n"
125" (except for ALB/TLB modes)\n"
126"\n"
127" To set the bond device down and automatically release all the slaves :\n"
128" # ifconfig bond0 down\n"
129"\n"
130" To detach a dead interface without setting the bond device down :\n"
131" # ifenslave {-d|--detach} bond0 eth0 [eth1 [eth2]...]\n"
132"\n"
133" To change active slave :\n"
134" # ifenslave {-c|--change-active} bond0 eth0\n"
135"\n"
136" To show master interface info\n"
137" # ifenslave bond0\n"
138"\n"
139" To show all interfaces info\n"
140" # ifenslave {-a|--all-interfaces}\n"
141"\n"
142" To be more verbose\n"
143" # ifenslave {-v|--verbose} ...\n"
144"\n"
145" # ifenslave {-u|--usage} Show usage\n"
146" # ifenslave {-V|--version} Show version\n"
147" # ifenslave {-h|--help} This message\n"
148"\n";
149
150#include <unistd.h>
151#include <stdlib.h>
152#include <stdio.h>
153#include <ctype.h>
154#include <string.h>
155#include <errno.h>
156#include <fcntl.h>
157#include <getopt.h>
158#include <sys/types.h>
159#include <sys/socket.h>
160#include <sys/ioctl.h>
161#include <linux/if.h>
162#include <net/if_arp.h>
163#include <linux/if_ether.h>
164#include <linux/if_bonding.h>
165#include <linux/sockios.h>
166
167typedef unsigned long long u64; /* hack, so we may include kernel's ethtool.h */
168typedef __uint32_t u32; /* ditto */
169typedef __uint16_t u16; /* ditto */
170typedef __uint8_t u8; /* ditto */
171#include <linux/ethtool.h>
172
173struct option longopts[] = {
174 /* { name has_arg *flag val } */
175 {"all-interfaces", 0, 0, 'a'}, /* Show all interfaces. */
176 {"change-active", 0, 0, 'c'}, /* Change the active slave. */
177 {"detach", 0, 0, 'd'}, /* Detach a slave interface. */
178 {"force", 0, 0, 'f'}, /* Force the operation. */
179 {"help", 0, 0, 'h'}, /* Give help */
180 {"usage", 0, 0, 'u'}, /* Give usage */
181 {"verbose", 0, 0, 'v'}, /* Report each action taken. */
182 {"version", 0, 0, 'V'}, /* Emit version information. */
183 { 0, 0, 0, 0}
184};
185
186/* Command-line flags. */
187unsigned int
188opt_a = 0, /* Show-all-interfaces flag. */
189opt_c = 0, /* Change-active-slave flag. */
190opt_d = 0, /* Detach a slave interface. */
191opt_f = 0, /* Force the operation. */
192opt_h = 0, /* Help */
193opt_u = 0, /* Usage */
194opt_v = 0, /* Verbose flag. */
195opt_V = 0; /* Version */
196
197int skfd = -1; /* AF_INET socket for ioctl() calls.*/
198int abi_ver = 0; /* userland - kernel ABI version */
199int hwaddr_set = 0; /* Master's hwaddr is set */
200int saved_errno;
201
202struct ifreq master_mtu, master_flags, master_hwaddr;
203struct ifreq slave_mtu, slave_flags, slave_hwaddr;
204
205struct dev_ifr {
206 struct ifreq *req_ifr;
207 char *req_name;
208 int req_type;
209};
210
211struct dev_ifr master_ifra[] = {
212 {&master_mtu, "SIOCGIFMTU", SIOCGIFMTU},
213 {&master_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS},
214 {&master_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR},
215 {NULL, "", 0}
216};
217
218struct dev_ifr slave_ifra[] = {
219 {&slave_mtu, "SIOCGIFMTU", SIOCGIFMTU},
220 {&slave_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS},
221 {&slave_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR},
222 {NULL, "", 0}
223};
224
225static void if_print(char *ifname);
226static int get_drv_info(char *master_ifname);
227static int get_if_settings(char *ifname, struct dev_ifr ifra[]);
228static int get_slave_flags(char *slave_ifname);
229static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr);
230static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr);
231static int set_slave_mtu(char *slave_ifname, int mtu);
232static int set_if_flags(char *ifname, short flags);
233static int set_if_up(char *ifname, short flags);
234static int set_if_down(char *ifname, short flags);
235static int clear_if_addr(char *ifname);
236static int set_if_addr(char *master_ifname, char *slave_ifname);
237static int change_active(char *master_ifname, char *slave_ifname);
238static int enslave(char *master_ifname, char *slave_ifname);
239static int release(char *master_ifname, char *slave_ifname);
240#define v_print(fmt, args...) \
241 if (opt_v) \
242 fprintf(stderr, fmt, ## args )
243
244int main(int argc, char *argv[])
245{
246 char **spp, *master_ifname, *slave_ifname;
247 int c, i, rv;
248 int res = 0;
249 int exclusive = 0;
250
251 while ((c = getopt_long(argc, argv, "acdfhuvV", longopts, 0)) != EOF) {
252 switch (c) {
253 case 'a': opt_a++; exclusive++; break;
254 case 'c': opt_c++; exclusive++; break;
255 case 'd': opt_d++; exclusive++; break;
256 case 'f': opt_f++; exclusive++; break;
257 case 'h': opt_h++; exclusive++; break;
258 case 'u': opt_u++; exclusive++; break;
259 case 'v': opt_v++; break;
260 case 'V': opt_V++; exclusive++; break;
261
262 case '?':
263 fprintf(stderr, "%s", usage_msg);
264 res = 2;
265 goto out;
266 }
267 }
268
269 /* options check */
270 if (exclusive > 1) {
271 fprintf(stderr, "%s", usage_msg);
272 res = 2;
273 goto out;
274 }
275
276 if (opt_v || opt_V) {
277 printf("%s", version);
278 if (opt_V) {
279 res = 0;
280 goto out;
281 }
282 }
283
284 if (opt_u) {
285 printf("%s", usage_msg);
286 res = 0;
287 goto out;
288 }
289
290 if (opt_h) {
291 printf("%s", usage_msg);
292 printf("%s", help_msg);
293 res = 0;
294 goto out;
295 }
296
297 /* Open a basic socket */
298 if ((skfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) {
299 perror("socket");
300 res = 1;
301 goto out;
302 }
303
304 if (opt_a) {
305 if (optind == argc) {
306 /* No remaining args */
307 /* show all interfaces */
308 if_print((char *)NULL);
309 goto out;
310 } else {
311 /* Just show usage */
312 fprintf(stderr, "%s", usage_msg);
313 res = 2;
314 goto out;
315 }
316 }
317
318 /* Copy the interface name */
319 spp = argv + optind;
320 master_ifname = *spp++;
321
322 if (master_ifname == NULL) {
323 fprintf(stderr, "%s", usage_msg);
324 res = 2;
325 goto out;
326 }
327
328 /* exchange abi version with bonding module */
329 res = get_drv_info(master_ifname);
330 if (res) {
331 fprintf(stderr,
332 "Master '%s': Error: handshake with driver failed. "
333 "Aborting\n",
334 master_ifname);
335 goto out;
336 }
337
338 slave_ifname = *spp++;
339
340 if (slave_ifname == NULL) {
341 if (opt_d || opt_c) {
342 fprintf(stderr, "%s", usage_msg);
343 res = 2;
344 goto out;
345 }
346
347 /* A single arg means show the
348 * configuration for this interface
349 */
350 if_print(master_ifname);
351 goto out;
352 }
353
354 res = get_if_settings(master_ifname, master_ifra);
355 if (res) {
356 /* Probably a good reason not to go on */
357 fprintf(stderr,
358 "Master '%s': Error: get settings failed: %s. "
359 "Aborting\n",
360 master_ifname, strerror(res));
361 goto out;
362 }
363
364 /* check if master is indeed a master;
365 * if not then fail any operation
366 */
367 if (!(master_flags.ifr_flags & IFF_MASTER)) {
368 fprintf(stderr,
369 "Illegal operation; the specified interface '%s' "
370 "is not a master. Aborting\n",
371 master_ifname);
372 res = 1;
373 goto out;
374 }
375
376 /* check if master is up; if not then fail any operation */
377 if (!(master_flags.ifr_flags & IFF_UP)) {
378 fprintf(stderr,
379 "Illegal operation; the specified master interface "
380 "'%s' is not up.\n",
381 master_ifname);
382 res = 1;
383 goto out;
384 }
385
386 /* Only for enslaving */
387 if (!opt_c && !opt_d) {
388 sa_family_t master_family = master_hwaddr.ifr_hwaddr.sa_family;
389 unsigned char *hwaddr =
390 (unsigned char *)master_hwaddr.ifr_hwaddr.sa_data;
391
392 /* The family '1' is ARPHRD_ETHER for ethernet. */
393 if (master_family != 1 && !opt_f) {
394 fprintf(stderr,
395 "Illegal operation: The specified master "
396 "interface '%s' is not ethernet-like.\n "
397 "This program is designed to work with "
398 "ethernet-like network interfaces.\n "
399 "Use the '-f' option to force the "
400 "operation.\n",
401 master_ifname);
402 res = 1;
403 goto out;
404 }
405
406 /* Check master's hw addr */
407 for (i = 0; i < 6; i++) {
408 if (hwaddr[i] != 0) {
409 hwaddr_set = 1;
410 break;
411 }
412 }
413
414 if (hwaddr_set) {
415 v_print("current hardware address of master '%s' "
416 "is %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, "
417 "type %d\n",
418 master_ifname,
419 hwaddr[0], hwaddr[1],
420 hwaddr[2], hwaddr[3],
421 hwaddr[4], hwaddr[5],
422 master_family);
423 }
424 }
425
426 /* Accepts only one slave */
427 if (opt_c) {
428 /* change active slave */
429 res = get_slave_flags(slave_ifname);
430 if (res) {
431 fprintf(stderr,
432 "Slave '%s': Error: get flags failed. "
433 "Aborting\n",
434 slave_ifname);
435 goto out;
436 }
437 res = change_active(master_ifname, slave_ifname);
438 if (res) {
439 fprintf(stderr,
440 "Master '%s', Slave '%s': Error: "
441 "Change active failed\n",
442 master_ifname, slave_ifname);
443 }
444 } else {
445 /* Accept multiple slaves */
446 do {
447 if (opt_d) {
448 /* detach a slave interface from the master */
449 rv = get_slave_flags(slave_ifname);
450 if (rv) {
451 /* Can't work with this slave. */
452 /* remember the error and skip it*/
453 fprintf(stderr,
454 "Slave '%s': Error: get flags "
455 "failed. Skipping\n",
456 slave_ifname);
457 res = rv;
458 continue;
459 }
460 rv = release(master_ifname, slave_ifname);
461 if (rv) {
462 fprintf(stderr,
463 "Master '%s', Slave '%s': Error: "
464 "Release failed\n",
465 master_ifname, slave_ifname);
466 res = rv;
467 }
468 } else {
469 /* attach a slave interface to the master */
470 rv = get_if_settings(slave_ifname, slave_ifra);
471 if (rv) {
472 /* Can't work with this slave. */
473 /* remember the error and skip it*/
474 fprintf(stderr,
475 "Slave '%s': Error: get "
476 "settings failed: %s. "
477 "Skipping\n",
478 slave_ifname, strerror(rv));
479 res = rv;
480 continue;
481 }
482 rv = enslave(master_ifname, slave_ifname);
483 if (rv) {
484 fprintf(stderr,
485 "Master '%s', Slave '%s': Error: "
486 "Enslave failed\n",
487 master_ifname, slave_ifname);
488 res = rv;
489 }
490 }
491 } while ((slave_ifname = *spp++) != NULL);
492 }
493
494out:
495 if (skfd >= 0) {
496 close(skfd);
497 }
498
499 return res;
500}
501
502static short mif_flags;
503
504/* Get the inteface configuration from the kernel. */
505static int if_getconfig(char *ifname)
506{
507 struct ifreq ifr;
508 int metric, mtu; /* Parameters of the master interface. */
509 struct sockaddr dstaddr, broadaddr, netmask;
510 unsigned char *hwaddr;
511
512 strcpy(ifr.ifr_name, ifname);
513 if (ioctl(skfd, SIOCGIFFLAGS, &ifr) < 0)
514 return -1;
515 mif_flags = ifr.ifr_flags;
516 printf("The result of SIOCGIFFLAGS on %s is %x.\n",
517 ifname, ifr.ifr_flags);
518
519 strcpy(ifr.ifr_name, ifname);
520 if (ioctl(skfd, SIOCGIFADDR, &ifr) < 0)
521 return -1;
522 printf("The result of SIOCGIFADDR is %2.2x.%2.2x.%2.2x.%2.2x.\n",
523 ifr.ifr_addr.sa_data[0], ifr.ifr_addr.sa_data[1],
524 ifr.ifr_addr.sa_data[2], ifr.ifr_addr.sa_data[3]);
525
526 strcpy(ifr.ifr_name, ifname);
527 if (ioctl(skfd, SIOCGIFHWADDR, &ifr) < 0)
528 return -1;
529
530 /* Gotta convert from 'char' to unsigned for printf(). */
531 hwaddr = (unsigned char *)ifr.ifr_hwaddr.sa_data;
532 printf("The result of SIOCGIFHWADDR is type %d "
533 "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
534 ifr.ifr_hwaddr.sa_family, hwaddr[0], hwaddr[1],
535 hwaddr[2], hwaddr[3], hwaddr[4], hwaddr[5]);
536
537 strcpy(ifr.ifr_name, ifname);
538 if (ioctl(skfd, SIOCGIFMETRIC, &ifr) < 0) {
539 metric = 0;
540 } else
541 metric = ifr.ifr_metric;
542 printf("The result of SIOCGIFMETRIC is %d\n", metric);
543
544 strcpy(ifr.ifr_name, ifname);
545 if (ioctl(skfd, SIOCGIFMTU, &ifr) < 0)
546 mtu = 0;
547 else
548 mtu = ifr.ifr_mtu;
549 printf("The result of SIOCGIFMTU is %d\n", mtu);
550
551 strcpy(ifr.ifr_name, ifname);
552 if (ioctl(skfd, SIOCGIFDSTADDR, &ifr) < 0) {
553 memset(&dstaddr, 0, sizeof(struct sockaddr));
554 } else
555 dstaddr = ifr.ifr_dstaddr;
556
557 strcpy(ifr.ifr_name, ifname);
558 if (ioctl(skfd, SIOCGIFBRDADDR, &ifr) < 0) {
559 memset(&broadaddr, 0, sizeof(struct sockaddr));
560 } else
561 broadaddr = ifr.ifr_broadaddr;
562
563 strcpy(ifr.ifr_name, ifname);
564 if (ioctl(skfd, SIOCGIFNETMASK, &ifr) < 0) {
565 memset(&netmask, 0, sizeof(struct sockaddr));
566 } else
567 netmask = ifr.ifr_netmask;
568
569 return 0;
570}
571
572static void if_print(char *ifname)
573{
574 char buff[1024];
575 struct ifconf ifc;
576 struct ifreq *ifr;
577 int i;
578
579 if (ifname == (char *)NULL) {
580 ifc.ifc_len = sizeof(buff);
581 ifc.ifc_buf = buff;
582 if (ioctl(skfd, SIOCGIFCONF, &ifc) < 0) {
583 perror("SIOCGIFCONF failed");
584 return;
585 }
586
587 ifr = ifc.ifc_req;
588 for (i = ifc.ifc_len / sizeof(struct ifreq); --i >= 0; ifr++) {
589 if (if_getconfig(ifr->ifr_name) < 0) {
590 fprintf(stderr,
591 "%s: unknown interface.\n",
592 ifr->ifr_name);
593 continue;
594 }
595
596 if (((mif_flags & IFF_UP) == 0) && !opt_a) continue;
597 /*ife_print(&ife);*/
598 }
599 } else {
600 if (if_getconfig(ifname) < 0) {
601 fprintf(stderr,
602 "%s: unknown interface.\n", ifname);
603 }
604 }
605}
606
607static int get_drv_info(char *master_ifname)
608{
609 struct ifreq ifr;
610 struct ethtool_drvinfo info;
611 char *endptr;
612
613 memset(&ifr, 0, sizeof(ifr));
614 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
615 ifr.ifr_data = (caddr_t)&info;
616
617 info.cmd = ETHTOOL_GDRVINFO;
618 strncpy(info.driver, "ifenslave", 32);
619 snprintf(info.fw_version, 32, "%d", BOND_ABI_VERSION);
620
621 if (ioctl(skfd, SIOCETHTOOL, &ifr) < 0) {
622 if (errno == EOPNOTSUPP) {
623 goto out;
624 }
625
626 saved_errno = errno;
627 v_print("Master '%s': Error: get bonding info failed %s\n",
628 master_ifname, strerror(saved_errno));
629 return 1;
630 }
631
632 abi_ver = strtoul(info.fw_version, &endptr, 0);
633 if (*endptr) {
634 v_print("Master '%s': Error: got invalid string as an ABI "
635 "version from the bonding module\n",
636 master_ifname);
637 return 1;
638 }
639
640out:
641 v_print("ABI ver is %d\n", abi_ver);
642
643 return 0;
644}
645
646static int change_active(char *master_ifname, char *slave_ifname)
647{
648 struct ifreq ifr;
649 int res = 0;
650
651 if (!(slave_flags.ifr_flags & IFF_SLAVE)) {
652 fprintf(stderr,
653 "Illegal operation: The specified slave interface "
654 "'%s' is not a slave\n",
655 slave_ifname);
656 return 1;
657 }
658
659 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
660 strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
661 if ((ioctl(skfd, SIOCBONDCHANGEACTIVE, &ifr) < 0) &&
662 (ioctl(skfd, BOND_CHANGE_ACTIVE_OLD, &ifr) < 0)) {
663 saved_errno = errno;
664 v_print("Master '%s': Error: SIOCBONDCHANGEACTIVE failed: "
665 "%s\n",
666 master_ifname, strerror(saved_errno));
667 res = 1;
668 }
669
670 return res;
671}
672
673static int enslave(char *master_ifname, char *slave_ifname)
674{
675 struct ifreq ifr;
676 int res = 0;
677
678 if (slave_flags.ifr_flags & IFF_SLAVE) {
679 fprintf(stderr,
680 "Illegal operation: The specified slave interface "
681 "'%s' is already a slave\n",
682 slave_ifname);
683 return 1;
684 }
685
686 res = set_if_down(slave_ifname, slave_flags.ifr_flags);
687 if (res) {
688 fprintf(stderr,
689 "Slave '%s': Error: bring interface down failed\n",
690 slave_ifname);
691 return res;
692 }
693
694 if (abi_ver < 2) {
695 /* Older bonding versions would panic if the slave has no IP
696 * address, so get the IP setting from the master.
697 */
698 set_if_addr(master_ifname, slave_ifname);
699 } else {
700 res = clear_if_addr(slave_ifname);
701 if (res) {
702 fprintf(stderr,
703 "Slave '%s': Error: clear address failed\n",
704 slave_ifname);
705 return res;
706 }
707 }
708
709 if (master_mtu.ifr_mtu != slave_mtu.ifr_mtu) {
710 res = set_slave_mtu(slave_ifname, master_mtu.ifr_mtu);
711 if (res) {
712 fprintf(stderr,
713 "Slave '%s': Error: set MTU failed\n",
714 slave_ifname);
715 return res;
716 }
717 }
718
719 if (hwaddr_set) {
720 /* Master already has an hwaddr
721 * so set it's hwaddr to the slave
722 */
723 if (abi_ver < 1) {
724 /* The driver is using an old ABI, so
725 * the application sets the slave's
726 * hwaddr
727 */
728 res = set_slave_hwaddr(slave_ifname,
729 &(master_hwaddr.ifr_hwaddr));
730 if (res) {
731 fprintf(stderr,
732 "Slave '%s': Error: set hw address "
733 "failed\n",
734 slave_ifname);
735 goto undo_mtu;
736 }
737
738 /* For old ABI the application needs to bring the
739 * slave back up
740 */
741 res = set_if_up(slave_ifname, slave_flags.ifr_flags);
742 if (res) {
743 fprintf(stderr,
744 "Slave '%s': Error: bring interface "
745 "down failed\n",
746 slave_ifname);
747 goto undo_slave_mac;
748 }
749 }
750 /* The driver is using a new ABI,
751 * so the driver takes care of setting
752 * the slave's hwaddr and bringing
753 * it up again
754 */
755 } else {
756 /* No hwaddr for master yet, so
757 * set the slave's hwaddr to it
758 */
759 if (abi_ver < 1) {
760 /* For old ABI, the master needs to be
761 * down before setting its hwaddr
762 */
763 res = set_if_down(master_ifname, master_flags.ifr_flags);
764 if (res) {
765 fprintf(stderr,
766 "Master '%s': Error: bring interface "
767 "down failed\n",
768 master_ifname);
769 goto undo_mtu;
770 }
771 }
772
773 res = set_master_hwaddr(master_ifname,
774 &(slave_hwaddr.ifr_hwaddr));
775 if (res) {
776 fprintf(stderr,
777 "Master '%s': Error: set hw address "
778 "failed\n",
779 master_ifname);
780 goto undo_mtu;
781 }
782
783 if (abi_ver < 1) {
784 /* For old ABI, bring the master
785 * back up
786 */
787 res = set_if_up(master_ifname, master_flags.ifr_flags);
788 if (res) {
789 fprintf(stderr,
790 "Master '%s': Error: bring interface "
791 "up failed\n",
792 master_ifname);
793 goto undo_master_mac;
794 }
795 }
796
797 hwaddr_set = 1;
798 }
799
800 /* Do the real thing */
801 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
802 strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
803 if ((ioctl(skfd, SIOCBONDENSLAVE, &ifr) < 0) &&
804 (ioctl(skfd, BOND_ENSLAVE_OLD, &ifr) < 0)) {
805 saved_errno = errno;
806 v_print("Master '%s': Error: SIOCBONDENSLAVE failed: %s\n",
807 master_ifname, strerror(saved_errno));
808 res = 1;
809 }
810
811 if (res) {
812 goto undo_master_mac;
813 }
814
815 return 0;
816
817/* rollback (best effort) */
818undo_master_mac:
819 set_master_hwaddr(master_ifname, &(master_hwaddr.ifr_hwaddr));
820 hwaddr_set = 0;
821 goto undo_mtu;
822undo_slave_mac:
823 set_slave_hwaddr(slave_ifname, &(slave_hwaddr.ifr_hwaddr));
824undo_mtu:
825 set_slave_mtu(slave_ifname, slave_mtu.ifr_mtu);
826 return res;
827}
828
829static int release(char *master_ifname, char *slave_ifname)
830{
831 struct ifreq ifr;
832 int res = 0;
833
834 if (!(slave_flags.ifr_flags & IFF_SLAVE)) {
835 fprintf(stderr,
836 "Illegal operation: The specified slave interface "
837 "'%s' is not a slave\n",
838 slave_ifname);
839 return 1;
840 }
841
842 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
843 strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ);
844 if ((ioctl(skfd, SIOCBONDRELEASE, &ifr) < 0) &&
845 (ioctl(skfd, BOND_RELEASE_OLD, &ifr) < 0)) {
846 saved_errno = errno;
847 v_print("Master '%s': Error: SIOCBONDRELEASE failed: %s\n",
848 master_ifname, strerror(saved_errno));
849 return 1;
850 } else if (abi_ver < 1) {
851 /* The driver is using an old ABI, so we'll set the interface
852 * down to avoid any conflicts due to same MAC/IP
853 */
854 res = set_if_down(slave_ifname, slave_flags.ifr_flags);
855 if (res) {
856 fprintf(stderr,
857 "Slave '%s': Error: bring interface "
858 "down failed\n",
859 slave_ifname);
860 }
861 }
862
863 /* set to default mtu */
864 set_slave_mtu(slave_ifname, 1500);
865
866 return res;
867}
868
869static int get_if_settings(char *ifname, struct dev_ifr ifra[])
870{
871 int i;
872 int res = 0;
873
874 for (i = 0; ifra[i].req_ifr; i++) {
875 strncpy(ifra[i].req_ifr->ifr_name, ifname, IFNAMSIZ);
876 res = ioctl(skfd, ifra[i].req_type, ifra[i].req_ifr);
877 if (res < 0) {
878 saved_errno = errno;
879 v_print("Interface '%s': Error: %s failed: %s\n",
880 ifname, ifra[i].req_name,
881 strerror(saved_errno));
882
883 return saved_errno;
884 }
885 }
886
887 return 0;
888}
889
890static int get_slave_flags(char *slave_ifname)
891{
892 int res = 0;
893
894 strncpy(slave_flags.ifr_name, slave_ifname, IFNAMSIZ);
895 res = ioctl(skfd, SIOCGIFFLAGS, &slave_flags);
896 if (res < 0) {
897 saved_errno = errno;
898 v_print("Slave '%s': Error: SIOCGIFFLAGS failed: %s\n",
899 slave_ifname, strerror(saved_errno));
900 } else {
901 v_print("Slave %s: flags %04X.\n",
902 slave_ifname, slave_flags.ifr_flags);
903 }
904
905 return res;
906}
907
908static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr)
909{
910 unsigned char *addr = (unsigned char *)hwaddr->sa_data;
911 struct ifreq ifr;
912 int res = 0;
913
914 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
915 memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr));
916 res = ioctl(skfd, SIOCSIFHWADDR, &ifr);
917 if (res < 0) {
918 saved_errno = errno;
919 v_print("Master '%s': Error: SIOCSIFHWADDR failed: %s\n",
920 master_ifname, strerror(saved_errno));
921 return res;
922 } else {
923 v_print("Master '%s': hardware address set to "
924 "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
925 master_ifname, addr[0], addr[1], addr[2],
926 addr[3], addr[4], addr[5]);
927 }
928
929 return res;
930}
931
932static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr)
933{
934 unsigned char *addr = (unsigned char *)hwaddr->sa_data;
935 struct ifreq ifr;
936 int res = 0;
937
938 strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
939 memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr));
940 res = ioctl(skfd, SIOCSIFHWADDR, &ifr);
941 if (res < 0) {
942 saved_errno = errno;
943
944 v_print("Slave '%s': Error: SIOCSIFHWADDR failed: %s\n",
945 slave_ifname, strerror(saved_errno));
946
947 if (saved_errno == EBUSY) {
948 v_print(" The device is busy: it must be idle "
949 "before running this command.\n");
950 } else if (saved_errno == EOPNOTSUPP) {
951 v_print(" The device does not support setting "
952 "the MAC address.\n"
953 " Your kernel likely does not support slave "
954 "devices.\n");
955 } else if (saved_errno == EINVAL) {
956 v_print(" The device's address type does not match "
957 "the master's address type.\n");
958 }
959 return res;
960 } else {
961 v_print("Slave '%s': hardware address set to "
962 "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n",
963 slave_ifname, addr[0], addr[1], addr[2],
964 addr[3], addr[4], addr[5]);
965 }
966
967 return res;
968}
969
970static int set_slave_mtu(char *slave_ifname, int mtu)
971{
972 struct ifreq ifr;
973 int res = 0;
974
975 ifr.ifr_mtu = mtu;
976 strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
977
978 res = ioctl(skfd, SIOCSIFMTU, &ifr);
979 if (res < 0) {
980 saved_errno = errno;
981 v_print("Slave '%s': Error: SIOCSIFMTU failed: %s\n",
982 slave_ifname, strerror(saved_errno));
983 } else {
984 v_print("Slave '%s': MTU set to %d.\n", slave_ifname, mtu);
985 }
986
987 return res;
988}
989
990static int set_if_flags(char *ifname, short flags)
991{
992 struct ifreq ifr;
993 int res = 0;
994
995 ifr.ifr_flags = flags;
996 strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
997
998 res = ioctl(skfd, SIOCSIFFLAGS, &ifr);
999 if (res < 0) {
1000 saved_errno = errno;
1001 v_print("Interface '%s': Error: SIOCSIFFLAGS failed: %s\n",
1002 ifname, strerror(saved_errno));
1003 } else {
1004 v_print("Interface '%s': flags set to %04X.\n", ifname, flags);
1005 }
1006
1007 return res;
1008}
1009
1010static int set_if_up(char *ifname, short flags)
1011{
1012 return set_if_flags(ifname, flags | IFF_UP);
1013}
1014
1015static int set_if_down(char *ifname, short flags)
1016{
1017 return set_if_flags(ifname, flags & ~IFF_UP);
1018}
1019
1020static int clear_if_addr(char *ifname)
1021{
1022 struct ifreq ifr;
1023 int res = 0;
1024
1025 strncpy(ifr.ifr_name, ifname, IFNAMSIZ);
1026 ifr.ifr_addr.sa_family = AF_INET;
1027 memset(ifr.ifr_addr.sa_data, 0, sizeof(ifr.ifr_addr.sa_data));
1028
1029 res = ioctl(skfd, SIOCSIFADDR, &ifr);
1030 if (res < 0) {
1031 saved_errno = errno;
1032 v_print("Interface '%s': Error: SIOCSIFADDR failed: %s\n",
1033 ifname, strerror(saved_errno));
1034 } else {
1035 v_print("Interface '%s': address cleared\n", ifname);
1036 }
1037
1038 return res;
1039}
1040
1041static int set_if_addr(char *master_ifname, char *slave_ifname)
1042{
1043 struct ifreq ifr;
1044 int res;
1045 unsigned char *ipaddr;
1046 int i;
1047 struct {
1048 char *req_name;
1049 char *desc;
1050 int g_ioctl;
1051 int s_ioctl;
1052 } ifra[] = {
1053 {"IFADDR", "addr", SIOCGIFADDR, SIOCSIFADDR},
1054 {"DSTADDR", "destination addr", SIOCGIFDSTADDR, SIOCSIFDSTADDR},
1055 {"BRDADDR", "broadcast addr", SIOCGIFBRDADDR, SIOCSIFBRDADDR},
1056 {"NETMASK", "netmask", SIOCGIFNETMASK, SIOCSIFNETMASK},
1057 {NULL, NULL, 0, 0},
1058 };
1059
1060 for (i = 0; ifra[i].req_name; i++) {
1061 strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ);
1062 res = ioctl(skfd, ifra[i].g_ioctl, &ifr);
1063 if (res < 0) {
1064 int saved_errno = errno;
1065
1066 v_print("Interface '%s': Error: SIOCG%s failed: %s\n",
1067 master_ifname, ifra[i].req_name,
1068 strerror(saved_errno));
1069
1070 ifr.ifr_addr.sa_family = AF_INET;
1071 memset(ifr.ifr_addr.sa_data, 0,
1072 sizeof(ifr.ifr_addr.sa_data));
1073 }
1074
1075 strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ);
1076 res = ioctl(skfd, ifra[i].s_ioctl, &ifr);
1077 if (res < 0) {
1078 int saved_errno = errno;
1079
1080 v_print("Interface '%s': Error: SIOCS%s failed: %s\n",
1081 slave_ifname, ifra[i].req_name,
1082 strerror(saved_errno));
1083
1084 }
1085
1086 ipaddr = (unsigned char *)ifr.ifr_addr.sa_data;
1087 v_print("Interface '%s': set IP %s to %d.%d.%d.%d\n",
1088 slave_ifname, ifra[i].desc,
1089 ipaddr[0], ipaddr[1], ipaddr[2], ipaddr[3]);
1090 }
1091
1092 return 0;
1093}
1094
1095/*
1096 * Local variables:
1097 * version-control: t
1098 * kept-new-versions: 5
1099 * c-indent-level: 4
1100 * c-basic-offset: 4
1101 * tab-width: 4
1102 * compile-command: "gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave"
1103 * End:
1104 */
1105
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index aa68f3c630c0..10742902146f 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -685,6 +685,15 @@ ip_dynaddr - BOOLEAN
685 occurs. 685 occurs.
686 Default: 0 686 Default: 0
687 687
688ip_early_demux - BOOLEAN
689 Optimize input packet processing down to one demux for
690 certain kinds of local sockets. Currently we only do this
691 for established TCP sockets.
692
693 It may add an additional cost for pure routing workloads that
694 reduces overall throughput, in such case you should disable it.
695 Default: 1
696
688icmp_echo_ignore_all - BOOLEAN 697icmp_echo_ignore_all - BOOLEAN
689 If set non-zero, then the kernel will ignore all ICMP ECHO 698 If set non-zero, then the kernel will ignore all ICMP ECHO
690 requests sent to it. 699 requests sent to it.
@@ -729,7 +738,7 @@ icmp_ignore_bogus_error_responses - BOOLEAN
729 frames. Such violations are normally logged via a kernel warning. 738 frames. Such violations are normally logged via a kernel warning.
730 If this is set to TRUE, the kernel will not give such warnings, which 739 If this is set to TRUE, the kernel will not give such warnings, which
731 will avoid log file clutter. 740 will avoid log file clutter.
732 Default: FALSE 741 Default: 1
733 742
734icmp_errors_use_inbound_ifaddr - BOOLEAN 743icmp_errors_use_inbound_ifaddr - BOOLEAN
735 744
diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt
index 9573d0c48c6e..7a3c04729591 100644
--- a/Documentation/networking/ipvs-sysctl.txt
+++ b/Documentation/networking/ipvs-sysctl.txt
@@ -181,6 +181,19 @@ snat_reroute - BOOLEAN
181 always be the same as the original route so it is an optimisation 181 always be the same as the original route so it is an optimisation
182 to disable snat_reroute and avoid the recalculation. 182 to disable snat_reroute and avoid the recalculation.
183 183
184sync_persist_mode - INTEGER
185 default 0
186
187 Controls the synchronisation of connections when using persistence
188
189 0: All types of connections are synchronised
190 1: Attempt to reduce the synchronisation traffic depending on
191 the connection type. For persistent services avoid synchronisation
192 for normal connections, do it only for persistence templates.
193 In such case, for TCP and SCTP it may need enabling sloppy_tcp and
194 sloppy_sctp flags on backup servers. For non-persistent services
195 such optimization is not applied, mode 0 is assumed.
196
184sync_version - INTEGER 197sync_version - INTEGER
185 default 1 198 default 1
186 199
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
index 9bd0f5211e9a..533378839546 100644
--- a/Documentation/networking/netlink_mmap.txt
+++ b/Documentation/networking/netlink_mmap.txt
@@ -114,7 +114,7 @@ Some parameters are constrained, specifically:
114- nm_frame_nr must equal the actual number of frames as specified above. 114- nm_frame_nr must equal the actual number of frames as specified above.
115 115
116When the kernel can't allocate physically continuous memory for a ring block, 116When the kernel can't allocate physically continuous memory for a ring block,
117it will fall back to use physically discontinous memory. This might affect 117it will fall back to use physically discontinuous memory. This might affect
118performance negatively, in order to avoid this the nm_frame_size parameter 118performance negatively, in order to avoid this the nm_frame_size parameter
119should be chosen to be as small as possible for the required frame size and 119should be chosen to be as small as possible for the required frame size and
120the number of blocks should be increased instead. 120the number of blocks should be increased instead.
@@ -274,9 +274,9 @@ This example assumes some ring parameters of the ring setup are available.
274 /* Get next frame header */ 274 /* Get next frame header */
275 hdr = rx_ring + frame_offset; 275 hdr = rx_ring + frame_offset;
276 276
277 if (hdr->nm_status == NL_MMAP_STATUS_VALID) 277 if (hdr->nm_status == NL_MMAP_STATUS_VALID) {
278 /* Regular memory mapped frame */ 278 /* Regular memory mapped frame */
279 nlh = (void *hdr) + NL_MMAP_HDRLEN; 279 nlh = (void *)hdr + NL_MMAP_HDRLEN;
280 len = hdr->nm_len; 280 len = hdr->nm_len;
281 281
282 /* Release empty message immediately. May happen 282 /* Release empty message immediately. May happen
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 23dd80e82b8e..8572796b1eb6 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -704,6 +704,12 @@ So it seems to be a good candidate to be used with packet fanout.
704Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile 704Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
705it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): 705it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
706 706
707/* Written from scratch, but kernel-to-user space API usage
708 * dissected from lolpcap:
709 * Copyright 2011, Chetan Loke <loke.chetan@gmail.com>
710 * License: GPL, version 2.0
711 */
712
707#include <stdio.h> 713#include <stdio.h>
708#include <stdlib.h> 714#include <stdlib.h>
709#include <stdint.h> 715#include <stdint.h>
@@ -722,27 +728,6 @@ it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
722#include <linux/if_ether.h> 728#include <linux/if_ether.h>
723#include <linux/ip.h> 729#include <linux/ip.h>
724 730
725#define BLOCK_SIZE (1 << 22)
726#define FRAME_SIZE 2048
727
728#define NUM_BLOCKS 64
729#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE)
730
731#define BLOCK_RETIRE_TOV_IN_MS 64
732#define BLOCK_PRIV_AREA_SZ 13
733
734#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
735
736#define BLOCK_STATUS(x) ((x)->h1.block_status)
737#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts)
738#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt)
739#define BLOCK_LEN(x) ((x)->h1.blk_len)
740#define BLOCK_SNUM(x) ((x)->h1.seq_num)
741#define BLOCK_O2PRIV(x) ((x)->offset_to_priv)
742#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x)))
743#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc)))
744#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri)))
745
746#ifndef likely 731#ifndef likely
747# define likely(x) __builtin_expect(!!(x), 1) 732# define likely(x) __builtin_expect(!!(x), 1)
748#endif 733#endif
@@ -765,7 +750,7 @@ struct ring {
765static unsigned long packets_total = 0, bytes_total = 0; 750static unsigned long packets_total = 0, bytes_total = 0;
766static sig_atomic_t sigint = 0; 751static sig_atomic_t sigint = 0;
767 752
768void sighandler(int num) 753static void sighandler(int num)
769{ 754{
770 sigint = 1; 755 sigint = 1;
771} 756}
@@ -774,6 +759,8 @@ static int setup_socket(struct ring *ring, char *netdev)
774{ 759{
775 int err, i, fd, v = TPACKET_V3; 760 int err, i, fd, v = TPACKET_V3;
776 struct sockaddr_ll ll; 761 struct sockaddr_ll ll;
762 unsigned int blocksiz = 1 << 22, framesiz = 1 << 11;
763 unsigned int blocknum = 64;
777 764
778 fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); 765 fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
779 if (fd < 0) { 766 if (fd < 0) {
@@ -788,13 +775,12 @@ static int setup_socket(struct ring *ring, char *netdev)
788 } 775 }
789 776
790 memset(&ring->req, 0, sizeof(ring->req)); 777 memset(&ring->req, 0, sizeof(ring->req));
791 ring->req.tp_block_size = BLOCK_SIZE; 778 ring->req.tp_block_size = blocksiz;
792 ring->req.tp_frame_size = FRAME_SIZE; 779 ring->req.tp_frame_size = framesiz;
793 ring->req.tp_block_nr = NUM_BLOCKS; 780 ring->req.tp_block_nr = blocknum;
794 ring->req.tp_frame_nr = NUM_FRAMES; 781 ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz;
795 ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS; 782 ring->req.tp_retire_blk_tov = 60;
796 ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ; 783 ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH;
797 ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH;
798 784
799 err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, 785 err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
800 sizeof(ring->req)); 786 sizeof(ring->req));
@@ -804,8 +790,7 @@ static int setup_socket(struct ring *ring, char *netdev)
804 } 790 }
805 791
806 ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, 792 ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
807 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, 793 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0);
808 fd, 0);
809 if (ring->map == MAP_FAILED) { 794 if (ring->map == MAP_FAILED) {
810 perror("mmap"); 795 perror("mmap");
811 exit(1); 796 exit(1);
@@ -835,58 +820,6 @@ static int setup_socket(struct ring *ring, char *netdev)
835 return fd; 820 return fd;
836} 821}
837 822
838#ifdef __checked
839static uint64_t prev_block_seq_num = 0;
840
841void assert_block_seq_num(struct block_desc *pbd)
842{
843 if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) {
844 printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != "
845 "actual seq:%"PRIu64"\n", prev_block_seq_num,
846 prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd));
847 exit(1);
848 }
849
850 prev_block_seq_num = BLOCK_SNUM(pbd);
851}
852
853static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
854{
855 if (BLOCK_NUM_PKTS(pbd)) {
856 if (unlikely(bytes != BLOCK_LEN(pbd))) {
857 printf("block:%u with %upackets, expected len:%u != actual len:%u\n",
858 block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd));
859 exit(1);
860 }
861 } else {
862 if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) {
863 printf("block:%u, expected len:%lu != actual len:%u\n",
864 block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd));
865 exit(1);
866 }
867 }
868}
869
870static void assert_block_header(struct block_desc *pbd, const int block_num)
871{
872 uint32_t block_status = BLOCK_STATUS(pbd);
873
874 if (unlikely((block_status & TP_STATUS_USER) == 0)) {
875 printf("block:%u, not in TP_STATUS_USER\n", block_num);
876 exit(1);
877 }
878
879 assert_block_seq_num(pbd);
880}
881#else
882static inline void assert_block_header(struct block_desc *pbd, const int block_num)
883{
884}
885static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
886{
887}
888#endif
889
890static void display(struct tpacket3_hdr *ppd) 823static void display(struct tpacket3_hdr *ppd)
891{ 824{
892 struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); 825 struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
@@ -916,37 +849,27 @@ static void display(struct tpacket3_hdr *ppd)
916 849
917static void walk_block(struct block_desc *pbd, const int block_num) 850static void walk_block(struct block_desc *pbd, const int block_num)
918{ 851{
919 int num_pkts = BLOCK_NUM_PKTS(pbd), i; 852 int num_pkts = pbd->h1.num_pkts, i;
920 unsigned long bytes = 0; 853 unsigned long bytes = 0;
921 unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ);
922 struct tpacket3_hdr *ppd; 854 struct tpacket3_hdr *ppd;
923 855
924 assert_block_header(pbd, block_num); 856 ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd +
925 857 pbd->h1.offset_to_first_pkt);
926 ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd));
927 for (i = 0; i < num_pkts; ++i) { 858 for (i = 0; i < num_pkts; ++i) {
928 bytes += ppd->tp_snaplen; 859 bytes += ppd->tp_snaplen;
929 if (ppd->tp_next_offset)
930 bytes_with_padding += ppd->tp_next_offset;
931 else
932 bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac);
933
934 display(ppd); 860 display(ppd);
935 861
936 ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset); 862 ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd +
937 __sync_synchronize(); 863 ppd->tp_next_offset);
938 } 864 }
939 865
940 assert_block_len(pbd, bytes_with_padding, block_num);
941
942 packets_total += num_pkts; 866 packets_total += num_pkts;
943 bytes_total += bytes; 867 bytes_total += bytes;
944} 868}
945 869
946void flush_block(struct block_desc *pbd) 870static void flush_block(struct block_desc *pbd)
947{ 871{
948 BLOCK_STATUS(pbd) = TP_STATUS_KERNEL; 872 pbd->h1.block_status = TP_STATUS_KERNEL;
949 __sync_synchronize();
950} 873}
951 874
952static void teardown_socket(struct ring *ring, int fd) 875static void teardown_socket(struct ring *ring, int fd)
@@ -962,7 +885,7 @@ int main(int argc, char **argp)
962 socklen_t len; 885 socklen_t len;
963 struct ring ring; 886 struct ring ring;
964 struct pollfd pfd; 887 struct pollfd pfd;
965 unsigned int block_num = 0; 888 unsigned int block_num = 0, blocks = 64;
966 struct block_desc *pbd; 889 struct block_desc *pbd;
967 struct tpacket_stats_v3 stats; 890 struct tpacket_stats_v3 stats;
968 891
@@ -984,15 +907,15 @@ int main(int argc, char **argp)
984 907
985 while (likely(!sigint)) { 908 while (likely(!sigint)) {
986 pbd = (struct block_desc *) ring.rd[block_num].iov_base; 909 pbd = (struct block_desc *) ring.rd[block_num].iov_base;
987retry_block: 910
988 if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) { 911 if ((pbd->h1.block_status & TP_STATUS_USER) == 0) {
989 poll(&pfd, 1, -1); 912 poll(&pfd, 1, -1);
990 goto retry_block; 913 continue;
991 } 914 }
992 915
993 walk_block(pbd, block_num); 916 walk_block(pbd, block_num);
994 flush_block(pbd); 917 flush_block(pbd);
995 block_num = (block_num + 1) % NUM_BLOCKS; 918 block_num = (block_num + 1) % blocks;
996 } 919 }
997 920
998 len = sizeof(stats); 921 len = sizeof(stats);
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index 579994afbe06..ca6977f5b2ed 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -163,6 +163,64 @@ and unnecessary. If there are fewer hardware queues than CPUs, then
163RPS might be beneficial if the rps_cpus for each queue are the ones that 163RPS might be beneficial if the rps_cpus for each queue are the ones that
164share the same memory domain as the interrupting CPU for that queue. 164share the same memory domain as the interrupting CPU for that queue.
165 165
166==== RPS Flow Limit
167
168RPS scales kernel receive processing across CPUs without introducing
169reordering. The trade-off to sending all packets from the same flow
170to the same CPU is CPU load imbalance if flows vary in packet rate.
171In the extreme case a single flow dominates traffic. Especially on
172common server workloads with many concurrent connections, such
173behavior indicates a problem such as a misconfiguration or spoofed
174source Denial of Service attack.
175
176Flow Limit is an optional RPS feature that prioritizes small flows
177during CPU contention by dropping packets from large flows slightly
178ahead of those from small flows. It is active only when an RPS or RFS
179destination CPU approaches saturation. Once a CPU's input packet
180queue exceeds half the maximum queue length (as set by sysctl
181net.core.netdev_max_backlog), the kernel starts a per-flow packet
182count over the last 256 packets. If a flow exceeds a set ratio (by
183default, half) of these packets when a new packet arrives, then the
184new packet is dropped. Packets from other flows are still only
185dropped once the input packet queue reaches netdev_max_backlog.
186No packets are dropped when the input packet queue length is below
187the threshold, so flow limit does not sever connections outright:
188even large flows maintain connectivity.
189
190== Interface
191
192Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not
193turned on. It is implemented for each CPU independently (to avoid lock
194and cache contention) and toggled per CPU by setting the relevant bit
195in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU
196bitmap interface as rps_cpus (see above) when called from procfs:
197
198 /proc/sys/net/core/flow_limit_cpu_bitmap
199
200Per-flow rate is calculated by hashing each packet into a hashtable
201bucket and incrementing a per-bucket counter. The hash function is
202the same that selects a CPU in RPS, but as the number of buckets can
203be much larger than the number of CPUs, flow limit has finer-grained
204identification of large flows and fewer false positives. The default
205table has 4096 buckets. This value can be modified through sysctl
206
207 net.core.flow_limit_table_len
208
209The value is only consulted when a new table is allocated. Modifying
210it does not update active tables.
211
212== Suggested Configuration
213
214Flow limit is useful on systems with many concurrent connections,
215where a single connection taking up 50% of a CPU indicates a problem.
216In such environments, enable the feature on all CPUs that handle
217network rx interrupts (as set in /proc/irq/N/smp_affinity).
218
219The feature depends on the input packet queue length to exceed
220the flow limit threshold (50%) + the flow history length (256).
221Setting net.core.netdev_max_backlog to either 1000 or 10000
222performed well in experiments.
223
166 224
167RFS: Receive Flow Steering 225RFS: Receive Flow Steering
168========================== 226==========================
diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt
index b4038ffb3bc5..9a8041dcbb53 100644
--- a/Documentation/networking/vortex.txt
+++ b/Documentation/networking/vortex.txt
@@ -359,7 +359,7 @@ steps you should take:
359- OK, it's a driver problem. 359- OK, it's a driver problem.
360 360
361 You need to generate a report. Typically this is an email to the 361 You need to generate a report. Typically this is an email to the
362 maintainer and/or linux-net@vger.kernel.org. The maintainer's 362 maintainer and/or netdev@vger.kernel.org. The maintainer's
363 email address will be in the driver source or in the MAINTAINERS file. 363 email address will be in the driver source or in the MAINTAINERS file.
364 364
365- The contents of your report will vary a lot depending upon the 365- The contents of your report will vary a lot depending upon the