Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (183 commits) [TG3]: Update version to 3.78. [TG3]: Add missing NVRAM strapping. [TG3]: Enable auto MDI. [TG3]: Fix the polarity bit. [TG3]: Fix irq_sync race condition. [NET_SCHED]: ematch: module autoloading [TCP]: tcp probe wraparound handling and other changes [RTNETLINK]: rtnl_link: allow specifying initial device address [RTNETLINK]: rtnl_link API simplification [VLAN]: Fix MAC address handling [ETH]: Validate address in eth_mac_addr [NET]: Fix races in net_rx_action vs netpoll. [AF_UNIX]: Rewrite garbage collector, fixes race. [NETFILTER]: {ip, nf}_conntrack_sctp: fix remotely triggerable NULL ptr dereference (CVE-2007-2876) [NET]: Make all initialized struct seq_operations const. [UDP]: Fix length check. [IPV6]: Remove unneeded pointer idev from addrconf_cleanup(). [DECNET]: Another unnecessary net/tcp.h inclusion in net/dn.h [IPV6]: Make IPV6_{RECV,2292}RTHDR boolean options. [IPV6]: Do not send RH0 anymore. ... Fixed up trivial conflict in Documentation/feature-removal-schedule.txt manually. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-07-12 16:31:22 -0400
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-07-12 16:31:22 -0400
commit: e1bd2ac5a6b7a8b625e40c9e9f8b6dea4cf22f85 (patch)
tree: 9366e9fb481da2c7195ca3f2bafeffebbf001363 /Documentation
parent: 0b9062f6b57a87f22309c6b920a51aaa66ce2a13 (diff)
parent: 15028aad00ddf241581fbe74a02ec89cbb28d35d (diff)
5 files changed, 321 insertions, 27 deletions
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 281458b47d75..0599a0c7c026 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -262,25 +262,6 @@ Who:	Richard Purdie <rpurdie@rpsys.net>
 ---------------------------
-What:   Multipath cached routing support in ipv4
-When:   in 2.6.23
-Why:    Code was merged, then submitter immediately disappeared leaving
-        us with no maintainer and lots of bugs.  The code should not have
-        been merged in the first place, and many aspects of it's
-        implementation are blocking more critical core networking
-        development.  It's marked EXPERIMENTAL and no distribution
-        enables it because it cause obscure crashes due to unfixable bugs
-        (interfaces don't return errors so memory allocation can't be
-        handled, calling contexts of these interfaces make handling
-        errors impossible too because they get called after we've
-        totally commited to creating a route object, for example).
-        This problem has existed for years and no forward progress
-        has ever been made, and nobody steps up to try and salvage
-        this code, so we're going to finally just get rid of it.
-Who:    David S. Miller <davem@davemloft.net>
---------------------------
 What:   read_dev_chars(), read_conf_data{,_lpm}() (s390 common I/O layer)
 When:   December 2007
 Why:    These functions are a leftover from 2.4 times. They have several
@@ -337,3 +318,11 @@ Who:	Jean Delvare <khali@linux-fr.org>
 ---------------------------
+What:   iptables SAME target
+When:   1.1. 2008
+Files:  net/ipv4/netfilter/ipt_SAME.c, include/linux/netfilter_ipv4/ipt_SAME.h
+Why:    Obsolete for multiple years now, NAT core provides the same behaviour.
+        Unfixable broken wrt. 32/64 bit cleanness.
+Who:    Patrick McHardy <kaber@trash.net>
+---------------------------
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index af6a63ab9026..09c184e41cf8 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -874,8 +874,7 @@ accept_redirects - BOOLEAN
 accept_source_route - INTEGER
        Accept source routing (routing extension header).
-        > 0: Accept routing header.
+        >= 0: Accept only routing header type 2.
-        = 0: Accept only routing header type 2.
        < 0: Do not accept routing header.
        Default: 0
diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt
new file mode 100644
index 000000000000..2451f551c505
--- /dev/null
+++ b/Documentation/networking/l2tp.txt
@@ -0,0 +1,169 @@
+This brief document describes how to use the kernel's PPPoL2TP driver
+to provide L2TP functionality. L2TP is a protocol that tunnels one or
+more PPP sessions over a UDP tunnel. It is commonly used for VPNs
+(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
+network infrastructure.
+Design
+======
+The PPPoL2TP driver, drivers/net/pppol2tp.c, provides a mechanism by
+which PPP frames carried through an L2TP session are passed through
+the kernel's PPP subsystem. The standard PPP daemon, pppd, handles all
+PPP interaction with the peer. PPP network interfaces are created for
+each local PPP endpoint.
+The L2TP protocol http://www.faqs.org/rfcs/rfc2661.html defines L2TP
+control and data frames. L2TP control frames carry messages between
+L2TP clients/servers and are used to setup / teardown tunnels and
+sessions. An L2TP client or server is implemented in userspace and
+will use a regular UDP socket per tunnel. L2TP data frames carry PPP
+frames, which may be PPP control or PPP data. The kernel's PPP
+subsystem arranges for PPP control frames to be delivered to pppd,
+while data frames are forwarded as usual.
+Each tunnel and session within a tunnel is assigned a unique tunnel_id
+and session_id. These ids are carried in the L2TP header of every
+control and data packet. The pppol2tp driver uses them to lookup
+internal tunnel and/or session contexts. Zero tunnel / session ids are
+treated specially - zero ids are never assigned to tunnels or sessions
+in the network. In the driver, the tunnel context keeps a pointer to
+the tunnel UDP socket. The session context keeps a pointer to the
+PPPoL2TP socket, as well as other data that lets the driver interface
+to the kernel PPP subsystem.
+Note that the pppol2tp kernel driver handles only L2TP data frames;
+L2TP control frames are simply passed up to userspace in the UDP
+tunnel socket. The kernel handles all datapath aspects of the
+protocol, including data packet resequencing (if enabled).
+There are a number of requirements on the userspace L2TP daemon in
+order to use the pppol2tp driver.
+1. Use a UDP socket per tunnel.
+2. Create a single PPPoL2TP socket per tunnel bound to a special null
+   session id. This is used only for communicating with the driver but
+   must remain open while the tunnel is active. Opening this tunnel
+   management socket causes the driver to mark the tunnel socket as an
+   L2TP UDP encapsulation socket and flags it for use by the
+   referenced tunnel id. This hooks up the UDP receive path via
+   udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
+   in this special PPPoX socket.
+3. Create a PPPoL2TP socket per L2TP session. This is typically done
+   by starting pppd with the pppol2tp plugin and appropriate
+   arguments. A PPPoL2TP tunnel management socket (Step 2) must be
+   created before the first PPPoL2TP session socket is created.
+When creating PPPoL2TP sockets, the application provides information
+to the driver about the socket in a socket connect() call. Source and
+destination tunnel and session ids are provided, as well as the file
+descriptor of a UDP socket. See struct pppol2tp_addr in
+include/linux/if_ppp.h. Note that zero tunnel / session ids are
+treated specially. When creating the per-tunnel PPPoL2TP management
+socket in Step 2 above, zero source and destination session ids are
+specified, which tells the driver to prepare the supplied UDP file
+descriptor for use as an L2TP tunnel socket.
+Userspace may control behavior of the tunnel or session using
+setsockopt and ioctl on the PPPoX socket. The following socket
+options are supported:-
+DEBUG     - bitmask of debug message categories. See below.
+SENDSEQ   - 0 => don't send packets with sequence numbers
+            1 => send packets with sequence numbers
+RECVSEQ   - 0 => receive packet sequence numbers are optional
+            1 => drop receive packets without sequence numbers
+LNSMODE   - 0 => act as LAC.
+            1 => act as LNS.
+REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
+Only the DEBUG option is supported by the special tunnel management
+PPPoX socket.
+In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
+to retrieve tunnel and session statistics from the kernel using the
+PPPoX socket of the appropriate tunnel or session.
+Debugging
+=========
+The driver supports a flexible debug scheme where kernel trace
+messages may be optionally enabled per tunnel and per session. Care is
+needed when debugging a live system since the messages are not
+rate-limited and a busy system could be swamped. Userspace uses
+setsockopt on the PPPoX socket to set a debug mask.
+The following debug mask bits are available:
+PPPOL2TP_MSG_DEBUG    verbose debug (if compiled in)
+PPPOL2TP_MSG_CONTROL  userspace - kernel interface
+PPPOL2TP_MSG_SEQ      sequence numbers handling
+PPPOL2TP_MSG_DATA     data packets
+Sample Userspace Code
+=====================
+1. Create tunnel management PPPoX socket
+        kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
+        if (kernel_fd >= 0) {
+                struct sockaddr_pppol2tp sax;
+                struct sockaddr_in const *peer_addr;
+                peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
+                memset(&sax, 0, sizeof(sax));
+                sax.sa_family = AF_PPPOX;
+                sax.sa_protocol = PX_PROTO_OL2TP;
+                sax.pppol2tp.fd = udp_fd;       /* fd of tunnel UDP socket */
+                sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
+                sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
+                sax.pppol2tp.addr.sin_family = AF_INET;
+                sax.pppol2tp.s_tunnel = tunnel_id;
+                sax.pppol2tp.s_session = 0;     /* special case: mgmt socket */
+                sax.pppol2tp.d_tunnel = 0;
+                sax.pppol2tp.d_session = 0;     /* special case: mgmt socket */
+                if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
+                        perror("connect failed");
+                        result = -errno;
+                        goto err;
+                }
+        }
+2. Create session PPPoX data socket
+        struct sockaddr_pppol2tp sax;
+        int fd;
+        /* Note, the target socket must be bound already, else it will not be ready */
+        sax.sa_family = AF_PPPOX;
+        sax.sa_protocol = PX_PROTO_OL2TP;
+        sax.pppol2tp.fd = tunnel_fd;
+        sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
+        sax.pppol2tp.addr.sin_port = addr->sin_port;
+        sax.pppol2tp.addr.sin_family = AF_INET;
+        sax.pppol2tp.s_tunnel  = tunnel_id;
+        sax.pppol2tp.s_session = session_id;
+        sax.pppol2tp.d_tunnel  = peer_tunnel_id;
+        sax.pppol2tp.d_session = peer_session_id;
+        /* session_fd is the fd of the session's PPPoL2TP socket.
+         * tunnel_fd is the fd of the tunnel UDP socket.
+         */
+        fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
+        if (fd < 0 )    {
+                return -errno;
+        }
+        return 0;
+Miscellanous
+============
+The PPPoL2TP driver was developed as part of the OpenL2TP project by
+Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
+designed from the ground up to have the L2TP datapath in the
+kernel. The project also implemented the pppol2tp plugin for pppd
+which allows pppd to use the kernel driver. Details can be found at
+http://openl2tp.sourceforge.net.
diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt
new file mode 100644
index 000000000000..00b60cce2224
--- /dev/null
+++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,111 @@
+                HOWTO for multiqueue network device support
+                ===========================================
+Section 1: Base driver requirements for implementing multiqueue support
+Section 2: Qdisc support for multiqueue devices
+Section 3: Brief howto using PRIO or RR for multiqueue devices
+Intro: Kernel support for multiqueue devices
+---------------------------------------------------------
+Kernel support for multiqueue devices is only an API that is presented to the
+netdevice layer for base drivers to implement.  This feature is part of the
+core networking stack, and all network devices will be running on the
+multiqueue-aware stack.  If a base driver only has one queue, then these
+changes are transparent to that driver.
+Section 1: Base driver requirements for implementing multiqueue support
+-----------------------------------------------------------------------
+Base drivers are required to use the new alloc_etherdev_mq() or
+alloc_netdev_mq() functions to allocate the subqueues for the device.  The
+underlying kernel API will take care of the allocation and deallocation of
+the subqueue memory, as well as netdev configuration of where the queues
+exist in memory.
+The base driver will also need to manage the queues as it does the global
+netdev->queue_lock today.  Therefore base drivers should use the
+netif_{start|stop|wake}_subqueue() functions to manage each queue while the
+device is still operational.  netdev->queue_lock is still used when the device
+comes online or when it's completely shut down (unregister_netdev(), etc.).
+Finally, the base driver should indicate that it is a multiqueue device.  The
+feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
+bitmap on device initialization.  Below is an example from e1000:
+#ifdef CONFIG_E1000_MQ
+        if ( (adapter->hw.mac.type == e1000_82571) ||
+             (adapter->hw.mac.type == e1000_82572) ||
+             (adapter->hw.mac.type == e1000_80003es2lan))
+                netdev->features |= NETIF_F_MULTI_QUEUE;
+#endif
+Section 2: Qdisc support for multiqueue devices
+-----------------------------------------------
+Currently two qdiscs support multiqueue devices.  A new round-robin qdisc,
+sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
+bands and queues, and will store the queue mapping into skb->queue_mapping.
+Use this field in the base driver to determine which queue to send the skb
+to.
+sch_rr has been added for hardware that doesn't want scheduling policies from
+software, so it's a straight round-robin qdisc.  It uses the same syntax and
+classification priomap that sch_prio uses, so it should be intuitive to
+configure for people who've used sch_prio.
+The PRIO qdisc naturally plugs into a multiqueue device.  If PRIO has been
+built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
+bands requested is equal to the number of queues on the hardware.  If they
+are equal, it sets a one-to-one mapping up between the queues and bands.  If
+they're not equal, it will not load the qdisc.  This is the same behavior
+for RR.  Once the association is made, any skb that is classified will have
+skb->queue_mapping set, which will allow the driver to properly queue skb's
+to multiple queues.
+Section 3: Brief howto using PRIO and RR for multiqueue devices
+---------------------------------------------------------------
+The userspace command 'tc,' part of the iproute2 package, is used to configure
+qdiscs.  To add the PRIO qdisc to your network device, assuming the device is
+called eth0, run the following command:
+# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
+This will create 4 bands, 0 being highest priority, and associate those bands
+to the queues on your NIC.  Assuming eth0 has 4 Tx queues, the band mapping
+would look like:
+band 0 => queue 0
+band 1 => queue 1
+band 2 => queue 2
+band 3 => queue 3
+Traffic will begin flowing through each queue if your TOS values are assigning
+traffic across the various bands.  For example, ssh traffic will always try to
+go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
+so it will be sent out queue 0.  ICMP traffic (pings) fall into the "normal"
+traffic classification, which is band 1.  Therefore pings will be send out
+queue 1 on the NIC.
+Note the use of the multiqueue keyword.  This is only in versions of iproute2
+that support multiqueue networking devices; if this is omitted when loading
+a qdisc onto a multiqueue device, the qdisc will load and operate the same
+if it were loaded onto a single-queue device (i.e. - sends all traffic to
+queue 0).
+Another alternative to multiqueue band allocation can be done by using the
+multiqueue option and specify 0 bands.  If this is the case, the qdisc will
+allocate the number of bands to equal the number of queues that the device
+reports, and bring the qdisc online.
+The behavior of tc filters remains the same, where it will override TOS priority
+classification.
+Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
index ce1361f95243..37869295fc70 100644
--- a/Documentation/networking/netdevices.txt
+++ b/Documentation/networking/netdevices.txt
@@ -20,6 +20,30 @@ private data which gets freed when the network device is freed. If
 separately allocated data is attached to the network device
 (dev->priv) then it is up to the module exit handler to free that.
+MTU
+===
+Each network device has a Maximum Transfer Unit. The MTU does not
+include any link layer protocol overhead. Upper layer protocols must
+not pass a socket buffer (skb) to a device to transmit with more data
+than the mtu. The MTU does not include link layer header overhead, so
+for example on Ethernet if the standard MTU is 1500 bytes used, the
+actual skb will contain up to 1514 bytes because of the Ethernet
+header. Devices should allow for the 4 byte VLAN header as well.
+Segmentation Offload (GSO, TSO) is an exception to this rule.  The
+upper layer protocol may pass a large socket buffer to the device
+transmit routine, and the device will break that up into separate
+packets based on the current MTU.
+MTU is symmetrical and applies both to receive and transmit. A device
+must be able to receive at least the maximum size packet allowed by
+the MTU. A network device may use the MTU as mechanism to size receive
+buffers, but the device should allow packets with VLAN header. With
+standard Ethernet mtu of 1500 bytes, the device should allow up to
+1518 byte packets (1500 + 14 header + 4 tag).  The device may either:
+drop, truncate, or pass up oversize packets, but dropping oversize
+packets is preferred.
 struct net_device synchronization rules
 =======================================
@@ -43,16 +67,17 @@ dev->get_stats:
 dev->hard_start_xmit:
        Synchronization: netif_tx_lock spinlock.
        When the driver sets NETIF_F_LLTX in dev->features this will be
        called without holding netif_tx_lock. In this case the driver
        has to lock by itself when needed. It is recommended to use a try lock
-        for this and return -1 when the spin lock fails. 
+        for this and return NETDEV_TX_LOCKED when the spin lock fails.
        The locking there should also properly protect against 
-        set_multicast_list
+        set_multicast_list.
-        Context: Process with BHs disabled or BH (timer).
-        Notes: netif_queue_stopped() is guaranteed false
+        Context: Process with BHs disabled or BH (timer),
-               Interrupts must be enabled when calling hard_start_xmit.
+                 will be called with interrupts disabled by netconsole.
-                (Interrupts must also be enabled when enabling the BH handler.)
        Return codes: 
        o NETDEV_TX_OK everything ok. 
        o NETDEV_TX_BUSY Cannot transmit packet, try later 
@@ -74,4 +99,5 @@ dev->poll:
        Synchronization: __LINK_STATE_RX_SCHED bit in dev->state.  See
                dev_close code and comments in net/core/dev.c for more info.
        Context: softirq
+                 will be called with interrupts disabled by netconsole.
author	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-07-12 16:31:22 -0400
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-07-12 16:31:22 -0400
commit	e1bd2ac5a6b7a8b625e40c9e9f8b6dea4cf22f85 (patch)
tree	9366e9fb481da2c7195ca3f2bafeffebbf001363 /Documentation
parent	0b9062f6b57a87f22309c6b920a51aaa66ce2a13 (diff)
parent	15028aad00ddf241581fbe74a02ec89cbb28d35d (diff)

diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt index 281458b47d75..0599a0c7c026 100644 --- a/Documentation/feature-removal-schedule.txt +++ b/Documentation/feature-removal-schedule.txt
@@ -262,25 +262,6 @@ Who: Richard Purdie <rpurdie@rpsys.net>
262		262
263	---------------------------	263	---------------------------
264		264
265	What: Multipath cached routing support in ipv4
266	When: in 2.6.23
267	Why: Code was merged, then submitter immediately disappeared leaving
268	us with no maintainer and lots of bugs. The code should not have
269	been merged in the first place, and many aspects of it's
270	implementation are blocking more critical core networking
271	development. It's marked EXPERIMENTAL and no distribution
272	enables it because it cause obscure crashes due to unfixable bugs
273	(interfaces don't return errors so memory allocation can't be
274	handled, calling contexts of these interfaces make handling
275	errors impossible too because they get called after we've
276	totally commited to creating a route object, for example).
277	This problem has existed for years and no forward progress
278	has ever been made, and nobody steps up to try and salvage
279	this code, so we're going to finally just get rid of it.
280	Who: David S. Miller <davem@davemloft.net>
281
282	---------------------------
283
284	What: read_dev_chars(), read_conf_data{,_lpm}() (s390 common I/O layer)	265	What: read_dev_chars(), read_conf_data{,_lpm}() (s390 common I/O layer)
285	When: December 2007	266	When: December 2007
286	Why: These functions are a leftover from 2.4 times. They have several	267	Why: These functions are a leftover from 2.4 times. They have several
@@ -337,3 +318,11 @@ Who: Jean Delvare <khali@linux-fr.org>
337		318
338	---------------------------	319	---------------------------
339		320
		321	What: iptables SAME target
		322	When: 1.1. 2008
		323	Files: net/ipv4/netfilter/ipt_SAME.c, include/linux/netfilter_ipv4/ipt_SAME.h
		324	Why: Obsolete for multiple years now, NAT core provides the same behaviour.
		325	Unfixable broken wrt. 32/64 bit cleanness.
		326	Who: Patrick McHardy <kaber@trash.net>
		327
		328	---------------------------


diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index af6a63ab9026..09c184e41cf8 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt
@@ -874,8 +874,7 @@ accept_redirects - BOOLEAN
874	accept_source_route - INTEGER	874	accept_source_route - INTEGER
875	Accept source routing (routing extension header).	875	Accept source routing (routing extension header).
876		876
877	> 0: Accept routing header.	877	>= 0: Accept only routing header type 2.
878	= 0: Accept only routing header type 2.
879	< 0: Do not accept routing header.	878	< 0: Do not accept routing header.
880		879
881	Default: 0	880	Default: 0


diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt new file mode 100644 index 000000000000..2451f551c505 --- /dev/null +++ b/Documentation/networking/l2tp.txt
@@ -0,0 +1,169 @@
		1	This brief document describes how to use the kernel's PPPoL2TP driver
		2	to provide L2TP functionality. L2TP is a protocol that tunnels one or
		3	more PPP sessions over a UDP tunnel. It is commonly used for VPNs
		4	(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP
		5	network infrastructure.
		6
		7	Design
		8	======
		9
		10	The PPPoL2TP driver, drivers/net/pppol2tp.c, provides a mechanism by
		11	which PPP frames carried through an L2TP session are passed through
		12	the kernel's PPP subsystem. The standard PPP daemon, pppd, handles all
		13	PPP interaction with the peer. PPP network interfaces are created for
		14	each local PPP endpoint.
		15
		16	The L2TP protocol http://www.faqs.org/rfcs/rfc2661.html defines L2TP
		17	control and data frames. L2TP control frames carry messages between
		18	L2TP clients/servers and are used to setup / teardown tunnels and
		19	sessions. An L2TP client or server is implemented in userspace and
		20	will use a regular UDP socket per tunnel. L2TP data frames carry PPP
		21	frames, which may be PPP control or PPP data. The kernel's PPP
		22	subsystem arranges for PPP control frames to be delivered to pppd,
		23	while data frames are forwarded as usual.
		24
		25	Each tunnel and session within a tunnel is assigned a unique tunnel_id
		26	and session_id. These ids are carried in the L2TP header of every
		27	control and data packet. The pppol2tp driver uses them to lookup
		28	internal tunnel and/or session contexts. Zero tunnel / session ids are
		29	treated specially - zero ids are never assigned to tunnels or sessions
		30	in the network. In the driver, the tunnel context keeps a pointer to
		31	the tunnel UDP socket. The session context keeps a pointer to the
		32	PPPoL2TP socket, as well as other data that lets the driver interface
		33	to the kernel PPP subsystem.
		34
		35	Note that the pppol2tp kernel driver handles only L2TP data frames;
		36	L2TP control frames are simply passed up to userspace in the UDP
		37	tunnel socket. The kernel handles all datapath aspects of the
		38	protocol, including data packet resequencing (if enabled).
		39
		40	There are a number of requirements on the userspace L2TP daemon in
		41	order to use the pppol2tp driver.
		42
		43	1. Use a UDP socket per tunnel.
		44
		45	2. Create a single PPPoL2TP socket per tunnel bound to a special null
		46	session id. This is used only for communicating with the driver but
		47	must remain open while the tunnel is active. Opening this tunnel
		48	management socket causes the driver to mark the tunnel socket as an
		49	L2TP UDP encapsulation socket and flags it for use by the
		50	referenced tunnel id. This hooks up the UDP receive path via
		51	udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed
		52	in this special PPPoX socket.
		53
		54	3. Create a PPPoL2TP socket per L2TP session. This is typically done
		55	by starting pppd with the pppol2tp plugin and appropriate
		56	arguments. A PPPoL2TP tunnel management socket (Step 2) must be
		57	created before the first PPPoL2TP session socket is created.
		58
		59	When creating PPPoL2TP sockets, the application provides information
		60	to the driver about the socket in a socket connect() call. Source and
		61	destination tunnel and session ids are provided, as well as the file
		62	descriptor of a UDP socket. See struct pppol2tp_addr in
		63	include/linux/if_ppp.h. Note that zero tunnel / session ids are
		64	treated specially. When creating the per-tunnel PPPoL2TP management
		65	socket in Step 2 above, zero source and destination session ids are
		66	specified, which tells the driver to prepare the supplied UDP file
		67	descriptor for use as an L2TP tunnel socket.
		68
		69	Userspace may control behavior of the tunnel or session using
		70	setsockopt and ioctl on the PPPoX socket. The following socket
		71	options are supported:-
		72
		73	DEBUG - bitmask of debug message categories. See below.
		74	SENDSEQ - 0 => don't send packets with sequence numbers
		75	1 => send packets with sequence numbers
		76	RECVSEQ - 0 => receive packet sequence numbers are optional
		77	1 => drop receive packets without sequence numbers
		78	LNSMODE - 0 => act as LAC.
		79	1 => act as LNS.
		80	REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder.
		81
		82	Only the DEBUG option is supported by the special tunnel management
		83	PPPoX socket.
		84
		85	In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided
		86	to retrieve tunnel and session statistics from the kernel using the
		87	PPPoX socket of the appropriate tunnel or session.
		88
		89	Debugging
		90	=========
		91
		92	The driver supports a flexible debug scheme where kernel trace
		93	messages may be optionally enabled per tunnel and per session. Care is
		94	needed when debugging a live system since the messages are not
		95	rate-limited and a busy system could be swamped. Userspace uses
		96	setsockopt on the PPPoX socket to set a debug mask.
		97
		98	The following debug mask bits are available:
		99
		100	PPPOL2TP_MSG_DEBUG verbose debug (if compiled in)
		101	PPPOL2TP_MSG_CONTROL userspace - kernel interface
		102	PPPOL2TP_MSG_SEQ sequence numbers handling
		103	PPPOL2TP_MSG_DATA data packets
		104
		105	Sample Userspace Code
		106	=====================
		107
		108	1. Create tunnel management PPPoX socket
		109
		110	kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP);
		111	if (kernel_fd >= 0) {
		112	struct sockaddr_pppol2tp sax;
		113	struct sockaddr_in const *peer_addr;
		114
		115	peer_addr = l2tp_tunnel_get_peer_addr(tunnel);
		116	memset(&sax, 0, sizeof(sax));
		117	sax.sa_family = AF_PPPOX;
		118	sax.sa_protocol = PX_PROTO_OL2TP;
		119	sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */
		120	sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr;
		121	sax.pppol2tp.addr.sin_port = peer_addr->sin_port;
		122	sax.pppol2tp.addr.sin_family = AF_INET;
		123	sax.pppol2tp.s_tunnel = tunnel_id;
		124	sax.pppol2tp.s_session = 0; /* special case: mgmt socket */
		125	sax.pppol2tp.d_tunnel = 0;
		126	sax.pppol2tp.d_session = 0; /* special case: mgmt socket */
		127
		128	if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) {
		129	perror("connect failed");
		130	result = -errno;
		131	goto err;
		132	}
		133	}
		134
		135	2. Create session PPPoX data socket
		136
		137	struct sockaddr_pppol2tp sax;
		138	int fd;
		139
		140	/* Note, the target socket must be bound already, else it will not be ready */
		141	sax.sa_family = AF_PPPOX;
		142	sax.sa_protocol = PX_PROTO_OL2TP;
		143	sax.pppol2tp.fd = tunnel_fd;
		144	sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr;
		145	sax.pppol2tp.addr.sin_port = addr->sin_port;
		146	sax.pppol2tp.addr.sin_family = AF_INET;
		147	sax.pppol2tp.s_tunnel = tunnel_id;
		148	sax.pppol2tp.s_session = session_id;
		149	sax.pppol2tp.d_tunnel = peer_tunnel_id;
		150	sax.pppol2tp.d_session = peer_session_id;
		151
		152	/* session_fd is the fd of the session's PPPoL2TP socket.
		153	* tunnel_fd is the fd of the tunnel UDP socket.
		154	*/
		155	fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax));
		156	if (fd < 0 ) {
		157	return -errno;
		158	}
		159	return 0;
		160
		161	Miscellanous
		162	============
		163
		164	The PPPoL2TP driver was developed as part of the OpenL2TP project by
		165	Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server,
		166	designed from the ground up to have the L2TP datapath in the
		167	kernel. The project also implemented the pppol2tp plugin for pppd
		168	which allows pppd to use the kernel driver. Details can be found at
		169	http://openl2tp.sourceforge.net.


diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 000000000000..00b60cce2224 --- /dev/null +++ b/Documentation/networking/multiqueue.txt
@@ -0,0 +1,111 @@
		1
		2	HOWTO for multiqueue network device support
		3	===========================================
		4
		5	Section 1: Base driver requirements for implementing multiqueue support
		6	Section 2: Qdisc support for multiqueue devices
		7	Section 3: Brief howto using PRIO or RR for multiqueue devices
		8
		9
		10	Intro: Kernel support for multiqueue devices
		11	---------------------------------------------------------
		12
		13	Kernel support for multiqueue devices is only an API that is presented to the
		14	netdevice layer for base drivers to implement. This feature is part of the
		15	core networking stack, and all network devices will be running on the
		16	multiqueue-aware stack. If a base driver only has one queue, then these
		17	changes are transparent to that driver.
		18
		19
		20	Section 1: Base driver requirements for implementing multiqueue support
		21	-----------------------------------------------------------------------
		22
		23	Base drivers are required to use the new alloc_etherdev_mq() or
		24	alloc_netdev_mq() functions to allocate the subqueues for the device. The
		25	underlying kernel API will take care of the allocation and deallocation of
		26	the subqueue memory, as well as netdev configuration of where the queues
		27	exist in memory.
		28
		29	The base driver will also need to manage the queues as it does the global
		30	netdev->queue_lock today. Therefore base drivers should use the
		31	netif_{start\|stop\|wake}_subqueue() functions to manage each queue while the
		32	device is still operational. netdev->queue_lock is still used when the device
		33	comes online or when it's completely shut down (unregister_netdev(), etc.).
		34
		35	Finally, the base driver should indicate that it is a multiqueue device. The
		36	feature flag NETIF_F_MULTI_QUEUE should be added to the netdev->features
		37	bitmap on device initialization. Below is an example from e1000:
		38
		39	#ifdef CONFIG_E1000_MQ
		40	if ( (adapter->hw.mac.type == e1000_82571) \|\|
		41	(adapter->hw.mac.type == e1000_82572) \|\|
		42	(adapter->hw.mac.type == e1000_80003es2lan))
		43	netdev->features \|= NETIF_F_MULTI_QUEUE;
		44	#endif
		45
		46
		47	Section 2: Qdisc support for multiqueue devices
		48	-----------------------------------------------
		49
		50	Currently two qdiscs support multiqueue devices. A new round-robin qdisc,
		51	sch_rr, and sch_prio. The qdisc is responsible for classifying the skb's to
		52	bands and queues, and will store the queue mapping into skb->queue_mapping.
		53	Use this field in the base driver to determine which queue to send the skb
		54	to.
		55
		56	sch_rr has been added for hardware that doesn't want scheduling policies from
		57	software, so it's a straight round-robin qdisc. It uses the same syntax and
		58	classification priomap that sch_prio uses, so it should be intuitive to
		59	configure for people who've used sch_prio.
		60
		61	The PRIO qdisc naturally plugs into a multiqueue device. If PRIO has been
		62	built with NET_SCH_PRIO_MQ, then upon load, it will make sure the number of
		63	bands requested is equal to the number of queues on the hardware. If they
		64	are equal, it sets a one-to-one mapping up between the queues and bands. If
		65	they're not equal, it will not load the qdisc. This is the same behavior
		66	for RR. Once the association is made, any skb that is classified will have
		67	skb->queue_mapping set, which will allow the driver to properly queue skb's
		68	to multiple queues.
		69
		70
		71	Section 3: Brief howto using PRIO and RR for multiqueue devices
		72	---------------------------------------------------------------
		73
		74	The userspace command 'tc,' part of the iproute2 package, is used to configure
		75	qdiscs. To add the PRIO qdisc to your network device, assuming the device is
		76	called eth0, run the following command:
		77
		78	# tc qdisc add dev eth0 root handle 1: prio bands 4 multiqueue
		79
		80	This will create 4 bands, 0 being highest priority, and associate those bands
		81	to the queues on your NIC. Assuming eth0 has 4 Tx queues, the band mapping
		82	would look like:
		83
		84	band 0 => queue 0
		85	band 1 => queue 1
		86	band 2 => queue 2
		87	band 3 => queue 3
		88
		89	Traffic will begin flowing through each queue if your TOS values are assigning
		90	traffic across the various bands. For example, ssh traffic will always try to
		91	go out band 0 based on TOS -> Linux priority conversion (realtime traffic),
		92	so it will be sent out queue 0. ICMP traffic (pings) fall into the "normal"
		93	traffic classification, which is band 1. Therefore pings will be send out
		94	queue 1 on the NIC.
		95
		96	Note the use of the multiqueue keyword. This is only in versions of iproute2
		97	that support multiqueue networking devices; if this is omitted when loading
		98	a qdisc onto a multiqueue device, the qdisc will load and operate the same
		99	if it were loaded onto a single-queue device (i.e. - sends all traffic to
		100	queue 0).
		101
		102	Another alternative to multiqueue band allocation can be done by using the
		103	multiqueue option and specify 0 bands. If this is the case, the qdisc will
		104	allocate the number of bands to equal the number of queues that the device
		105	reports, and bring the qdisc online.
		106
		107	The behavior of tc filters remains the same, where it will override TOS priority
		108	classification.
		109
		110
		111	Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com>


diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index ce1361f95243..37869295fc70 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt
@@ -20,6 +20,30 @@ private data which gets freed when the network device is freed. If
20	separately allocated data is attached to the network device	20	separately allocated data is attached to the network device
21	(dev->priv) then it is up to the module exit handler to free that.	21	(dev->priv) then it is up to the module exit handler to free that.
22		22
		23	MTU
		24	===
		25	Each network device has a Maximum Transfer Unit. The MTU does not
		26	include any link layer protocol overhead. Upper layer protocols must
		27	not pass a socket buffer (skb) to a device to transmit with more data
		28	than the mtu. The MTU does not include link layer header overhead, so
		29	for example on Ethernet if the standard MTU is 1500 bytes used, the
		30	actual skb will contain up to 1514 bytes because of the Ethernet
		31	header. Devices should allow for the 4 byte VLAN header as well.
		32
		33	Segmentation Offload (GSO, TSO) is an exception to this rule. The
		34	upper layer protocol may pass a large socket buffer to the device
		35	transmit routine, and the device will break that up into separate
		36	packets based on the current MTU.
		37
		38	MTU is symmetrical and applies both to receive and transmit. A device
		39	must be able to receive at least the maximum size packet allowed by
		40	the MTU. A network device may use the MTU as mechanism to size receive
		41	buffers, but the device should allow packets with VLAN header. With
		42	standard Ethernet mtu of 1500 bytes, the device should allow up to
		43	1518 byte packets (1500 + 14 header + 4 tag). The device may either:
		44	drop, truncate, or pass up oversize packets, but dropping oversize
		45	packets is preferred.
		46
23		47
24	struct net_device synchronization rules	48	struct net_device synchronization rules
25	=======================================	49	=======================================
@@ -43,16 +67,17 @@ dev->get_stats:
43		67
44	dev->hard_start_xmit:	68	dev->hard_start_xmit:
45	Synchronization: netif_tx_lock spinlock.	69	Synchronization: netif_tx_lock spinlock.
		70
46	When the driver sets NETIF_F_LLTX in dev->features this will be	71	When the driver sets NETIF_F_LLTX in dev->features this will be
47	called without holding netif_tx_lock. In this case the driver	72	called without holding netif_tx_lock. In this case the driver
48	has to lock by itself when needed. It is recommended to use a try lock	73	has to lock by itself when needed. It is recommended to use a try lock
49	for this and return -1 when the spin lock fails.	74	for this and return NETDEV_TX_LOCKED when the spin lock fails.
50	The locking there should also properly protect against	75	The locking there should also properly protect against
51	set_multicast_list	76	set_multicast_list.
52	Context: Process with BHs disabled or BH (timer).	77
53	Notes: netif_queue_stopped() is guaranteed false	78	Context: Process with BHs disabled or BH (timer),
54	Interrupts must be enabled when calling hard_start_xmit.	79	will be called with interrupts disabled by netconsole.
55	(Interrupts must also be enabled when enabling the BH handler.)	80
56	Return codes:	81	Return codes:
57	o NETDEV_TX_OK everything ok.	82	o NETDEV_TX_OK everything ok.
58	o NETDEV_TX_BUSY Cannot transmit packet, try later	83	o NETDEV_TX_BUSY Cannot transmit packet, try later
@@ -74,4 +99,5 @@ dev->poll:
74	Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See	99	Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See
75	dev_close code and comments in net/core/dev.c for more info.	100	dev_close code and comments in net/core/dev.c for more info.
76	Context: softirq	101	Context: softirq
		102	will be called with interrupts disabled by netconsole.
77		103