aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2013-05-01 17:08:52 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2013-05-01 17:08:52 -0400
commit73287a43cc79ca06629a88d1a199cd283f42456a (patch)
treeacf4456e260115bea77ee31a29f10ce17f0db45c /Documentation
parent251df49db3327c64bf917bfdba94491fde2b4ee0 (diff)
parent20074f357da4a637430aec2879c9d864c5d2c23c (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "Highlights (1721 non-merge commits, this has to be a record of some sort): 1) Add 'random' mode to team driver, from Jiri Pirko and Eric Dumazet. 2) Make it so that any driver that supports configuration of multiple MAC addresses can provide the forwarding database add and del calls by providing a default implementation and hooking that up if the driver doesn't have an explicit set of handlers. From Vlad Yasevich. 3) Support GSO segmentation over tunnels and other encapsulating devices such as VXLAN, from Pravin B Shelar. 4) Support L2 GRE tunnels in the flow dissector, from Michael Dalton. 5) Implement Tail Loss Probe (TLP) detection in TCP, from Nandita Dukkipati. 6) In the PHY layer, allow supporting wake-on-lan in situations where the PHY registers have to be written for it to be configured. Use it to support wake-on-lan in mv643xx_eth. From Michael Stapelberg. 7) Significantly improve firewire IPV6 support, from YOSHIFUJI Hideaki. 8) Allow multiple packets to be sent in a single transmission using network coding in batman-adv, from Martin Hundebøll. 9) Add support for T5 cxgb4 chips, from Santosh Rastapur. 10) Generalize the VXLAN forwarding tables so that there is more flexibility in configurating various aspects of the endpoints. From David Stevens. 11) Support RSS and TSO in hardware over GRE tunnels in bxn2x driver, from Dmitry Kravkov. 12) Zero copy support in nfnelink_queue, from Eric Dumazet and Pablo Neira Ayuso. 13) Start adding networking selftests. 14) In situations of overload on the same AF_PACKET fanout socket, or per-cpu packet receive queue, minimize drop by distributing the load to other cpus/fanouts. From Willem de Bruijn and Eric Dumazet. 15) Add support for new payload offset BPF instruction, from Daniel Borkmann. 16) Convert several drivers over to mdoule_platform_driver(), from Sachin Kamat. 17) Provide a minimal BPF JIT image disassembler userspace tool, from Daniel Borkmann. 18) Rewrite F-RTO implementation in TCP to match the final specification of it in RFC4138 and RFC5682. From Yuchung Cheng. 19) Provide netlink socket diag of netlink sockets ("Yo dawg, I hear you like netlink, so I implemented netlink dumping of netlink sockets.") From Andrey Vagin. 20) Remove ugly passing of rtnetlink attributes into rtnl_doit functions, from Thomas Graf. 21) Allow userspace to be able to see if a configuration change occurs in the middle of an address or device list dump, from Nicolas Dichtel. 22) Support RFC3168 ECN protection for ipv6 fragments, from Hannes Frederic Sowa. 23) Increase accuracy of packet length used by packet scheduler, from Jason Wang. 24) Beginning set of changes to make ipv4/ipv6 fragment handling more scalable and less susceptible to overload and locking contention, from Jesper Dangaard Brouer. 25) Get rid of using non-type-safe NLMSG_* macros and use nlmsg_*() instead. From Hong Zhiguo. 26) Optimize route usage in IPVS by avoiding reference counting where possible, from Julian Anastasov. 27) Convert IPVS schedulers to RCU, also from Julian Anastasov. 28) Support cpu fanouts in xt_NFQUEUE netfilter target, from Holger Eitzenberger. 29) Network namespace support for nf_log, ebt_log, xt_LOG, ipt_ULOG, nfnetlink_log, and nfnetlink_queue. From Gao feng. 30) Implement RFC3168 ECN protection, from Hannes Frederic Sowa. 31) Support several new r8169 chips, from Hayes Wang. 32) Support tokenized interface identifiers in ipv6, from Daniel Borkmann. 33) Use usbnet_link_change() helper in USB net driver, from Ming Lei. 34) Add 802.1ad vlan offload support, from Patrick McHardy. 35) Support mmap() based netlink communication, also from Patrick McHardy. 36) Support HW timestamping in mlx4 driver, from Amir Vadai. 37) Rationalize AF_PACKET packet timestamping when transmitting, from Willem de Bruijn and Daniel Borkmann. 38) Bring parity to what's provided by /proc/net/packet socket dumping and the info provided by netlink socket dumping of AF_PACKET sockets. From Nicolas Dichtel. 39) Fix peeking beyond zero sized SKBs in AF_UNIX, from Benjamin Poirier" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1722 commits) filter: fix va_list build error af_unix: fix a fatal race with bit fields bnx2x: Prevent memory leak when cnic is absent bnx2x: correct reading of speed capabilities net: sctp: attribute printl with __printf for gcc fmt checks netlink: kconfig: move mmap i/o into netlink kconfig netpoll: convert mutex into a semaphore netlink: Fix skb ref counting. net_sched: act_ipt forward compat with xtables mlx4_en: fix a build error on 32bit arches Revert "bnx2x: allow nvram test to run when device is down" bridge: avoid OOPS if root port not found drivers: net: cpsw: fix kernel warn on cpsw irq enable sh_eth: use random MAC address if no valid one supplied 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) tg3: fix to append hardware time stamping flags unix/stream: fix peeking with an offset larger than data in queue unix/dgram: fix peeking with an offset larger than data in queue unix/dgram: peek beyond 0-sized skbs openvswitch: Remove unneeded ovs_netdev_get_ifindex() ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/sysfs-class-net-mesh8
-rw-r--r--Documentation/DocBook/80211.tmpl2
-rw-r--r--Documentation/cgroups/00-INDEX2
-rw-r--r--Documentation/cgroups/net_cls.txt34
-rw-r--r--Documentation/devicetree/bindings/marvell.txt3
-rw-r--r--Documentation/devicetree/bindings/net/can/atmel-can.txt14
-rw-r--r--Documentation/devicetree/bindings/net/cpsw.txt16
-rw-r--r--Documentation/devicetree/bindings/net/dsa/dsa.txt91
-rw-r--r--Documentation/devicetree/bindings/net/marvell-orion-mdio.txt4
-rw-r--r--Documentation/networking/ieee802154.txt5
-rw-r--r--Documentation/networking/ip-sysctl.txt53
-rw-r--r--Documentation/networking/netlink_mmap.txt339
-rw-r--r--Documentation/networking/packet_mmap.txt368
-rw-r--r--Documentation/networking/stmmac.txt45
14 files changed, 924 insertions, 60 deletions
diff --git a/Documentation/ABI/testing/sysfs-class-net-mesh b/Documentation/ABI/testing/sysfs-class-net-mesh
index bc41da61608d..bdcd8b4e38f2 100644
--- a/Documentation/ABI/testing/sysfs-class-net-mesh
+++ b/Documentation/ABI/testing/sysfs-class-net-mesh
@@ -67,6 +67,14 @@ Description:
67 Defines the penalty which will be applied to an 67 Defines the penalty which will be applied to an
68 originator message's tq-field on every hop. 68 originator message's tq-field on every hop.
69 69
70What: /sys/class/net/<mesh_iface>/mesh/network_coding
71Date: Nov 2012
72Contact: Martin Hundeboll <martin@hundeboll.net>
73Description:
74 Controls whether Network Coding (using some magic
75 to send fewer wifi packets but still the same
76 content) is enabled or not.
77
70What: /sys/class/net/<mesh_iface>/mesh/orig_interval 78What: /sys/class/net/<mesh_iface>/mesh/orig_interval
71Date: May 2010 79Date: May 2010
72Contact: Marek Lindner <lindner_marek@yahoo.de> 80Contact: Marek Lindner <lindner_marek@yahoo.de>
diff --git a/Documentation/DocBook/80211.tmpl b/Documentation/DocBook/80211.tmpl
index 284ced7a228f..0f6a3edcd44b 100644
--- a/Documentation/DocBook/80211.tmpl
+++ b/Documentation/DocBook/80211.tmpl
@@ -437,7 +437,7 @@
437 </section> 437 </section>
438!Finclude/net/mac80211.h ieee80211_get_buffered_bc 438!Finclude/net/mac80211.h ieee80211_get_buffered_bc
439!Finclude/net/mac80211.h ieee80211_beacon_get 439!Finclude/net/mac80211.h ieee80211_beacon_get
440!Finclude/net/mac80211.h ieee80211_sta_eosp_irqsafe 440!Finclude/net/mac80211.h ieee80211_sta_eosp
441!Finclude/net/mac80211.h ieee80211_frame_release_type 441!Finclude/net/mac80211.h ieee80211_frame_release_type
442!Finclude/net/mac80211.h ieee80211_sta_ps_transition 442!Finclude/net/mac80211.h ieee80211_sta_ps_transition
443!Finclude/net/mac80211.h ieee80211_sta_ps_transition_ni 443!Finclude/net/mac80211.h ieee80211_sta_ps_transition_ni
diff --git a/Documentation/cgroups/00-INDEX b/Documentation/cgroups/00-INDEX
index f5635a09c3f6..bc461b6425a7 100644
--- a/Documentation/cgroups/00-INDEX
+++ b/Documentation/cgroups/00-INDEX
@@ -18,6 +18,8 @@ memcg_test.txt
18 - Memory Resource Controller; implementation details. 18 - Memory Resource Controller; implementation details.
19memory.txt 19memory.txt
20 - Memory Resource Controller; design, accounting, interface, testing. 20 - Memory Resource Controller; design, accounting, interface, testing.
21net_cls.txt
22 - Network classifier cgroups details and usages.
21net_prio.txt 23net_prio.txt
22 - Network priority cgroups details and usages. 24 - Network priority cgroups details and usages.
23resource_counter.txt 25resource_counter.txt
diff --git a/Documentation/cgroups/net_cls.txt b/Documentation/cgroups/net_cls.txt
new file mode 100644
index 000000000000..9face6bb578a
--- /dev/null
+++ b/Documentation/cgroups/net_cls.txt
@@ -0,0 +1,34 @@
1Network classifier cgroup
2-------------------------
3
4The Network classifier cgroup provides an interface to
5tag network packets with a class identifier (classid).
6
7The Traffic Controller (tc) can be used to assign
8different priorities to packets from different cgroups.
9
10Creating a net_cls cgroups instance creates a net_cls.classid file.
11This net_cls.classid value is initialized to 0.
12
13You can write hexadecimal values to net_cls.classid; the format for these
14values is 0xAAAABBBB; AAAA is the major handle number and BBBB
15is the minor handle number.
16Reading net_cls.classid yields a decimal result.
17
18Example:
19mkdir /sys/fs/cgroup/net_cls
20mount -t cgroup -onet_cls net_cls /sys/fs/cgroup/net_cls
21mkdir /sys/fs/cgroup/net_cls/0
22echo 0x100001 > /sys/fs/cgroup/net_cls/0/net_cls.classid
23 - setting a 10:1 handle.
24
25cat /sys/fs/cgroup/net_cls/0/net_cls.classid
261048577
27
28configuring tc:
29tc qdisc add dev eth0 root handle 10: htb
30
31tc class add dev eth0 parent 10: classid 10:1 htb rate 40mbit
32 - creating traffic class 10:1
33
34tc filter add dev eth0 parent 10: protocol ip prio 10 handle 1: cgroup
diff --git a/Documentation/devicetree/bindings/marvell.txt b/Documentation/devicetree/bindings/marvell.txt
index f1533d91953a..f7a0da6b4022 100644
--- a/Documentation/devicetree/bindings/marvell.txt
+++ b/Documentation/devicetree/bindings/marvell.txt
@@ -115,6 +115,9 @@ prefixed with the string "marvell,", for Marvell Technology Group Ltd.
115 - compatible : "marvell,mv64360-eth-block" 115 - compatible : "marvell,mv64360-eth-block"
116 - reg : Offset and length of the register set for this block 116 - reg : Offset and length of the register set for this block
117 117
118 Optional properties:
119 - clocks : Phandle to the clock control device and gate bit
120
118 Example Discovery Ethernet block node: 121 Example Discovery Ethernet block node:
119 ethernet-block@2000 { 122 ethernet-block@2000 {
120 #address-cells = <1>; 123 #address-cells = <1>;
diff --git a/Documentation/devicetree/bindings/net/can/atmel-can.txt b/Documentation/devicetree/bindings/net/can/atmel-can.txt
new file mode 100644
index 000000000000..72cf0c5daff4
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/can/atmel-can.txt
@@ -0,0 +1,14 @@
1* AT91 CAN *
2
3Required properties:
4 - compatible: Should be "atmel,at91sam9263-can" or "atmel,at91sam9x5-can"
5 - reg: Should contain CAN controller registers location and length
6 - interrupts: Should contain IRQ line for the CAN controller
7
8Example:
9
10 can0: can@f000c000 {
11 compatbile = "atmel,at91sam9x5-can";
12 reg = <0xf000c000 0x300>;
13 interrupts = <40 4 5>
14 };
diff --git a/Documentation/devicetree/bindings/net/cpsw.txt b/Documentation/devicetree/bindings/net/cpsw.txt
index ecfdf756d10f..4f2ca6b4a182 100644
--- a/Documentation/devicetree/bindings/net/cpsw.txt
+++ b/Documentation/devicetree/bindings/net/cpsw.txt
@@ -15,16 +15,22 @@ Required properties:
15- mac_control : Specifies Default MAC control register content 15- mac_control : Specifies Default MAC control register content
16 for the specific platform 16 for the specific platform
17- slaves : Specifies number for slaves 17- slaves : Specifies number for slaves
18- cpts_active_slave : Specifies the slave to use for time stamping 18- active_slave : Specifies the slave to use for time stamping,
19 ethtool and SIOCGMIIPHY
19- cpts_clock_mult : Numerator to convert input clock ticks into nanoseconds 20- cpts_clock_mult : Numerator to convert input clock ticks into nanoseconds
20- cpts_clock_shift : Denominator to convert input clock ticks into nanoseconds 21- cpts_clock_shift : Denominator to convert input clock ticks into nanoseconds
21- phy_id : Specifies slave phy id
22- mac-address : Specifies slave MAC address
23 22
24Optional properties: 23Optional properties:
25- ti,hwmods : Must be "cpgmac0" 24- ti,hwmods : Must be "cpgmac0"
26- no_bd_ram : Must be 0 or 1 25- no_bd_ram : Must be 0 or 1
27- dual_emac : Specifies Switch to act as Dual EMAC 26- dual_emac : Specifies Switch to act as Dual EMAC
27
28Slave Properties:
29Required properties:
30- phy_id : Specifies slave phy id
31- mac-address : Specifies slave MAC address
32
33Optional properties:
28- dual_emac_res_vlan : Specifies VID to be used to segregate the ports 34- dual_emac_res_vlan : Specifies VID to be used to segregate the ports
29 35
30Note: "ti,hwmods" field is used to fetch the base address and irq 36Note: "ti,hwmods" field is used to fetch the base address and irq
@@ -47,7 +53,7 @@ Examples:
47 rx_descs = <64>; 53 rx_descs = <64>;
48 mac_control = <0x20>; 54 mac_control = <0x20>;
49 slaves = <2>; 55 slaves = <2>;
50 cpts_active_slave = <0>; 56 active_slave = <0>;
51 cpts_clock_mult = <0x80000000>; 57 cpts_clock_mult = <0x80000000>;
52 cpts_clock_shift = <29>; 58 cpts_clock_shift = <29>;
53 cpsw_emac0: slave@0 { 59 cpsw_emac0: slave@0 {
@@ -73,7 +79,7 @@ Examples:
73 rx_descs = <64>; 79 rx_descs = <64>;
74 mac_control = <0x20>; 80 mac_control = <0x20>;
75 slaves = <2>; 81 slaves = <2>;
76 cpts_active_slave = <0>; 82 active_slave = <0>;
77 cpts_clock_mult = <0x80000000>; 83 cpts_clock_mult = <0x80000000>;
78 cpts_clock_shift = <29>; 84 cpts_clock_shift = <29>;
79 cpsw_emac0: slave@0 { 85 cpsw_emac0: slave@0 {
diff --git a/Documentation/devicetree/bindings/net/dsa/dsa.txt b/Documentation/devicetree/bindings/net/dsa/dsa.txt
new file mode 100644
index 000000000000..49f4f7ae3f51
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/dsa/dsa.txt
@@ -0,0 +1,91 @@
1Marvell Distributed Switch Architecture Device Tree Bindings
2------------------------------------------------------------
3
4Required properties:
5- compatible : Should be "marvell,dsa"
6- #address-cells : Must be 2, first cell is the address on the MDIO bus
7 and second cell is the address in the switch tree.
8 Second cell is used only when cascading/chaining.
9- #size-cells : Must be 0
10- dsa,ethernet : Should be a phandle to a valid Ethernet device node
11- dsa,mii-bus : Should be a phandle to a valid MDIO bus device node
12
13Optionnal properties:
14- interrupts : property with a value describing the switch
15 interrupt number (not supported by the driver)
16
17A DSA node can contain multiple switch chips which are therefore child nodes of
18the parent DSA node. The maximum number of allowed child nodes is 4
19(DSA_MAX_SWITCHES).
20Each of these switch child nodes should have the following required properties:
21
22- reg : Describes the switch address on the MII bus
23- #address-cells : Must be 1
24- #size-cells : Must be 0
25
26A switch may have multiple "port" children nodes
27
28Each port children node must have the following mandatory properties:
29- reg : Describes the port address in the switch
30- label : Describes the label associated with this port, special
31 labels are "cpu" to indicate a CPU port and "dsa" to
32 indicate an uplink/downlink port.
33
34Note that a port labelled "dsa" will imply checking for the uplink phandle
35described below.
36
37Optionnal property:
38- link : Should be a phandle to another switch's DSA port.
39 This property is only used when switches are being
40 chained/cascaded together.
41
42Example:
43
44 dsa@0 {
45 compatible = "marvell,dsa";
46 #address-cells = <2>;
47 #size-cells = <0>;
48
49 interrupts = <10>;
50 dsa,ethernet = <&ethernet0>;
51 dsa,mii-bus = <&mii_bus0>;
52
53 switch@0 {
54 #address-cells = <1>;
55 #size-cells = <0>;
56 reg = <16 0>; /* MDIO address 16, switch 0 in tree */
57
58 port@0 {
59 reg = <0>;
60 label = "lan1";
61 };
62
63 port@1 {
64 reg = <1>;
65 label = "lan2";
66 };
67
68 port@5 {
69 reg = <5>;
70 label = "cpu";
71 };
72
73 switch0uplink: port@6 {
74 reg = <6>;
75 label = "dsa";
76 link = <&switch1uplink>;
77 };
78 };
79
80 switch@1 {
81 #address-cells = <1>;
82 #size-cells = <0>;
83 reg = <17 1>; /* MDIO address 17, switch 1 in tree */
84
85 switch1uplink: port@0 {
86 reg = <0>;
87 label = "dsa";
88 link = <&switch0uplink>;
89 };
90 };
91 };
diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
index 34e7aafa321c..9417e54c26c0 100644
--- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
+++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
@@ -9,6 +9,10 @@ Required properties:
9- compatible: "marvell,orion-mdio" 9- compatible: "marvell,orion-mdio"
10- reg: address and length of the SMI register 10- reg: address and length of the SMI register
11 11
12Optional properties:
13- interrupts: interrupt line number for the SMI error/done interrupt
14- clocks: Phandle to the clock control device and gate bit
15
12The child nodes of the MDIO driver are the individual PHY devices 16The child nodes of the MDIO driver are the individual PHY devices
13connected to this MDIO bus. They must have a "reg" property given the 17connected to this MDIO bus. They must have a "reg" property given the
14PHY address on the MDIO bus. 18PHY address on the MDIO bus.
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
index 703cf4370c79..67a9cb259d40 100644
--- a/Documentation/networking/ieee802154.txt
+++ b/Documentation/networking/ieee802154.txt
@@ -71,8 +71,9 @@ submits skb to qdisc), so if you need something from that cb later, you should
71store info in the skb->data on your own. 71store info in the skb->data on your own.
72 72
73To hook the MLME interface you have to populate the ml_priv field of your 73To hook the MLME interface you have to populate the ml_priv field of your
74net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are 74net_device with a pointer to struct ieee802154_mlme_ops instance. The fields
75required. 75assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional.
76All other fields are required.
76 77
77We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c 78We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c
78 79
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index dc2dc87d2557..f98ca633b528 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -29,7 +29,7 @@ route/max_size - INTEGER
29neigh/default/gc_thresh1 - INTEGER 29neigh/default/gc_thresh1 - INTEGER
30 Minimum number of entries to keep. Garbage collector will not 30 Minimum number of entries to keep. Garbage collector will not
31 purge entries if there are fewer than this number. 31 purge entries if there are fewer than this number.
32 Default: 256 32 Default: 128
33 33
34neigh/default/gc_thresh3 - INTEGER 34neigh/default/gc_thresh3 - INTEGER
35 Maximum number of neighbor entries allowed. Increase this 35 Maximum number of neighbor entries allowed. Increase this
@@ -175,14 +175,6 @@ tcp_congestion_control - STRING
175 is inherited. 175 is inherited.
176 [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] 176 [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ]
177 177
178tcp_cookie_size - INTEGER
179 Default size of TCP Cookie Transactions (TCPCT) option, that may be
180 overridden on a per socket basis by the TCPCT socket option.
181 Values greater than the maximum (16) are interpreted as the maximum.
182 Values greater than zero and less than the minimum (8) are interpreted
183 as the minimum. Odd values are interpreted as the next even value.
184 Default: 0 (off).
185
186tcp_dsack - BOOLEAN 178tcp_dsack - BOOLEAN
187 Allows TCP to send "duplicate" SACKs. 179 Allows TCP to send "duplicate" SACKs.
188 180
@@ -190,7 +182,9 @@ tcp_early_retrans - INTEGER
190 Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold 182 Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold
191 for triggering fast retransmit when the amount of outstanding data is 183 for triggering fast retransmit when the amount of outstanding data is
192 small and when no previously unsent data can be transmitted (such 184 small and when no previously unsent data can be transmitted (such
193 that limited transmit could be used). 185 that limited transmit could be used). Also controls the use of
186 Tail loss probe (TLP) that converts RTOs occuring due to tail
187 losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01).
194 Possible values: 188 Possible values:
195 0 disables ER 189 0 disables ER
196 1 enables ER 190 1 enables ER
@@ -198,7 +192,9 @@ tcp_early_retrans - INTEGER
198 by a fourth of RTT. This mitigates connection falsely 192 by a fourth of RTT. This mitigates connection falsely
199 recovers when network has a small degree of reordering 193 recovers when network has a small degree of reordering
200 (less than 3 packets). 194 (less than 3 packets).
201 Default: 2 195 3 enables delayed ER and TLP.
196 4 enables TLP only.
197 Default: 3
202 198
203tcp_ecn - INTEGER 199tcp_ecn - INTEGER
204 Control use of Explicit Congestion Notification (ECN) by TCP. 200 Control use of Explicit Congestion Notification (ECN) by TCP.
@@ -229,36 +225,13 @@ tcp_fin_timeout - INTEGER
229 Default: 60 seconds 225 Default: 60 seconds
230 226
231tcp_frto - INTEGER 227tcp_frto - INTEGER
232 Enables Forward RTO-Recovery (F-RTO) defined in RFC4138. 228 Enables Forward RTO-Recovery (F-RTO) defined in RFC5682.
233 F-RTO is an enhanced recovery algorithm for TCP retransmission 229 F-RTO is an enhanced recovery algorithm for TCP retransmission
234 timeouts. It is particularly beneficial in wireless environments 230 timeouts. It is particularly beneficial in networks where the
235 where packet loss is typically due to random radio interference 231 RTT fluctuates (e.g., wireless). F-RTO is sender-side only
236 rather than intermediate router congestion. F-RTO is sender-side 232 modification. It does not require any support from the peer.
237 only modification. Therefore it does not require any support from 233
238 the peer. 234 By default it's enabled with a non-zero value. 0 disables F-RTO.
239
240 If set to 1, basic version is enabled. 2 enables SACK enhanced
241 F-RTO if flow uses SACK. The basic version can be used also when
242 SACK is in use though scenario(s) with it exists where F-RTO
243 interacts badly with the packet counting of the SACK enabled TCP
244 flow.
245
246tcp_frto_response - INTEGER
247 When F-RTO has detected that a TCP retransmission timeout was
248 spurious (i.e, the timeout would have been avoided had TCP set a
249 longer retransmission timeout), TCP has several options what to do
250 next. Possible values are:
251 0 Rate halving based; a smooth and conservative response,
252 results in halved cwnd and ssthresh after one RTT
253 1 Very conservative response; not recommended because even
254 though being valid, it interacts poorly with the rest of
255 Linux TCP, halves cwnd and ssthresh immediately
256 2 Aggressive response; undoes congestion control measures
257 that are now known to be unnecessary (ignoring the
258 possibility of a lost retransmission that would require
259 TCP to be more cautious), cwnd and ssthresh are restored
260 to the values prior timeout
261 Default: 0 (rate halving based)
262 235
263tcp_keepalive_time - INTEGER 236tcp_keepalive_time - INTEGER
264 How often TCP sends out keepalive messages when keepalive is enabled. 237 How often TCP sends out keepalive messages when keepalive is enabled.
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
new file mode 100644
index 000000000000..1c2dab409625
--- /dev/null
+++ b/Documentation/networking/netlink_mmap.txt
@@ -0,0 +1,339 @@
1This file documents how to use memory mapped I/O with netlink.
2
3Author: Patrick McHardy <kaber@trash.net>
4
5Overview
6--------
7
8Memory mapped netlink I/O can be used to increase throughput and decrease
9overhead of unicast receive and transmit operations. Some netlink subsystems
10require high throughput, these are mainly the netfilter subsystems
11nfnetlink_queue and nfnetlink_log, but it can also help speed up large
12dump operations of f.i. the routing database.
13
14Memory mapped netlink I/O used two circular ring buffers for RX and TX which
15are mapped into the processes address space.
16
17The RX ring is used by the kernel to directly construct netlink messages into
18user-space memory without copying them as done with regular socket I/O,
19additionally as long as the ring contains messages no recvmsg() or poll()
20syscalls have to be issued by user-space to get more message.
21
22The TX ring is used to process messages directly from user-space memory, the
23kernel processes all messages contained in the ring using a single sendmsg()
24call.
25
26Usage overview
27--------------
28
29In order to use memory mapped netlink I/O, user-space needs three main changes:
30
31- ring setup
32- conversion of the RX path to get messages from the ring instead of recvmsg()
33- conversion of the TX path to construct messages into the ring
34
35Ring setup is done using setsockopt() to provide the ring parameters to the
36kernel, then a call to mmap() to map the ring into the processes address space:
37
38- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &params, sizeof(params));
39- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &params, sizeof(params));
40- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0)
41
42Usage of either ring is optional, but even if only the RX ring is used the
43mapping still needs to be writable in order to update the frame status after
44processing.
45
46Conversion of the reception path involves calling poll() on the file
47descriptor, once the socket is readable the frames from the ring are
48processsed in order until no more messages are available, as indicated by
49a status word in the frame header.
50
51On kernel side, in order to make use of memory mapped I/O on receive, the
52originating netlink subsystem needs to support memory mapped I/O, otherwise
53it will use an allocated socket buffer as usual and the contents will be
54 copied to the ring on transmission, nullifying most of the performance gains.
55Dumps of kernel databases automatically support memory mapped I/O.
56
57Conversion of the transmit path involves changing message contruction to
58use memory from the TX ring instead of (usually) a buffer declared on the
59stack and setting up the frame header approriately. Optionally poll() can
60be used to wait for free frames in the TX ring.
61
62Structured and definitions for using memory mapped I/O are contained in
63<linux/netlink.h>.
64
65RX and TX rings
66----------------
67
68Each ring contains a number of continous memory blocks, containing frames of
69fixed size dependant on the parameters used for ring setup.
70
71Ring: [ block 0 ]
72 [ frame 0 ]
73 [ frame 1 ]
74 [ block 1 ]
75 [ frame 2 ]
76 [ frame 3 ]
77 ...
78 [ block n ]
79 [ frame 2 * n ]
80 [ frame 2 * n + 1 ]
81
82The blocks are only visible to the kernel, from the point of view of user-space
83the ring just contains the frames in a continous memory zone.
84
85The ring parameters used for setting up the ring are defined as follows:
86
87struct nl_mmap_req {
88 unsigned int nm_block_size;
89 unsigned int nm_block_nr;
90 unsigned int nm_frame_size;
91 unsigned int nm_frame_nr;
92};
93
94Frames are grouped into blocks, where each block is a continous region of memory
95and holds nm_block_size / nm_frame_size frames. The total number of frames in
96the ring is nm_frame_nr. The following invariants hold:
97
98- frames_per_block = nm_block_size / nm_frame_size
99
100- nm_frame_nr = frames_per_block * nm_block_nr
101
102Some parameters are constrained, specifically:
103
104- nm_block_size must be a multiple of the architectures memory page size.
105 The getpagesize() function can be used to get the page size.
106
107- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be
108 able to hold at least the frame header
109
110- nm_frame_size must be smaller or equal to nm_block_size
111
112- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT
113
114- nm_frame_nr must equal the actual number of frames as specified above.
115
116When the kernel can't allocate phsyically continous memory for a ring block,
117it will fall back to use physically discontinous memory. This might affect
118performance negatively, in order to avoid this the nm_frame_size parameter
119should be chosen to be as small as possible for the required frame size and
120the number of blocks should be increased instead.
121
122Ring frames
123------------
124
125Each frames contain a frame header, consisting of a synchronization word and some
126meta-data, and the message itself.
127
128Frame: [ header message ]
129
130The frame header is defined as follows:
131
132struct nl_mmap_hdr {
133 unsigned int nm_status;
134 unsigned int nm_len;
135 __u32 nm_group;
136 /* credentials */
137 __u32 nm_pid;
138 __u32 nm_uid;
139 __u32 nm_gid;
140};
141
142- nm_status is used for synchronizing processing between the kernel and user-
143 space and specifies ownership of the frame as well as the operation to perform
144
145- nm_len contains the length of the message contained in the data area
146
147- nm_group specified the destination multicast group of message
148
149- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending
150 process. These values correspond to the data available using SOCK_PASSCRED in
151 the SCM_CREDENTIALS cmsg.
152
153The possible values in the status word are:
154
155- NL_MMAP_STATUS_UNUSED:
156 RX ring: frame belongs to the kernel and contains no message
157 for user-space. Approriate action is to invoke poll()
158 to wait for new messages.
159
160 TX ring: frame belongs to user-space and can be used for
161 message construction.
162
163- NL_MMAP_STATUS_RESERVED:
164 RX ring only: frame is currently used by the kernel for message
165 construction and contains no valid message yet.
166 Appropriate action is to invoke poll() to wait for
167 new messages.
168
169- NL_MMAP_STATUS_VALID:
170 RX ring: frame contains a valid message. Approriate action is
171 to process the message and release the frame back to
172 the kernel by setting the status to
173 NL_MMAP_STATUS_UNUSED or queue the frame by setting the
174 status to NL_MMAP_STATUS_SKIP.
175
176 TX ring: the frame contains a valid message from user-space to
177 be processed by the kernel. After completing processing
178 the kernel will release the frame back to user-space by
179 setting the status to NL_MMAP_STATUS_UNUSED.
180
181- NL_MMAP_STATUS_COPY:
182 RX ring only: a message is ready to be processed but could not be
183 stored in the ring, either because it exceeded the
184 frame size or because the originating subsystem does
185 not support memory mapped I/O. Appropriate action is
186 to invoke recvmsg() to receive the message and release
187 the frame back to the kernel by setting the status to
188 NL_MMAP_STATUS_UNUSED.
189
190- NL_MMAP_STATUS_SKIP:
191 RX ring only: user-space queued the message for later processing, but
192 processed some messages following it in the ring. The
193 kernel should skip this frame when looking for unused
194 frames.
195
196The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the
197frame header.
198
199TX limitations
200--------------
201
202Kernel processing usually involves validation of the message received by
203user-space, then processing its contents. The kernel must assure that
204userspace is not able to modify the message contents after they have been
205validated. In order to do so, the message is copied from the ring frame
206to an allocated buffer if either of these conditions is false:
207
208- only a single mapping of the ring exists
209- the file descriptor is not shared between processes
210
211This means that for threaded programs, the kernel will fall back to copying.
212
213Example
214-------
215
216Ring setup:
217
218 unsigned int block_size = 16 * getpagesize();
219 struct nl_mmap_req req = {
220 .nm_block_size = block_size,
221 .nm_block_nr = 64,
222 .nm_frame_size = 16384,
223 .nm_frame_nr = 64 * block_size / 16384,
224 };
225 unsigned int ring_size;
226 void *rx_ring, *tx_ring;
227
228 /* Configure ring parameters */
229 if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
230 exit(1);
231 if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
232 exit(1)
233
234 /* Calculate size of each invididual ring */
235 ring_size = req.nm_block_nr * req.nm_block_size;
236
237 /* Map RX/TX rings. The TX ring is located after the RX ring */
238 rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE,
239 MAP_SHARED, fd, 0);
240 if ((long)rx_ring == -1L)
241 exit(1);
242 tx_ring = rx_ring + ring_size:
243
244Message reception:
245
246This example assumes some ring parameters of the ring setup are available.
247
248 unsigned int frame_offset = 0;
249 struct nl_mmap_hdr *hdr;
250 struct nlmsghdr *nlh;
251 unsigned char buf[16384];
252 ssize_t len;
253
254 while (1) {
255 struct pollfd pfds[1];
256
257 pfds[0].fd = fd;
258 pfds[0].events = POLLIN | POLLERR;
259 pfds[0].revents = 0;
260
261 if (poll(pfds, 1, -1) < 0 && errno != -EINTR)
262 exit(1);
263
264 /* Check for errors. Error handling omitted */
265 if (pfds[0].revents & POLLERR)
266 <handle error>
267
268 /* If no new messages, poll again */
269 if (!(pfds[0].revents & POLLIN))
270 continue;
271
272 /* Process all frames */
273 while (1) {
274 /* Get next frame header */
275 hdr = rx_ring + frame_offset;
276
277 if (hdr->nm_status == NL_MMAP_STATUS_VALID)
278 /* Regular memory mapped frame */
279 nlh = (void *hdr) + NL_MMAP_HDRLEN;
280 len = hdr->nm_len;
281
282 /* Release empty message immediately. May happen
283 * on error during message construction.
284 */
285 if (len == 0)
286 goto release;
287 } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) {
288 /* Frame queued to socket receive queue */
289 len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT);
290 if (len <= 0)
291 break;
292 nlh = buf;
293 } else
294 /* No more messages to process, continue polling */
295 break;
296
297 process_msg(nlh);
298release:
299 /* Release frame back to the kernel */
300 hdr->nm_status = NL_MMAP_STATUS_UNUSED;
301
302 /* Advance frame offset to next frame */
303 frame_offset = (frame_offset + frame_size) % ring_size;
304 }
305 }
306
307Message transmission:
308
309This example assumes some ring parameters of the ring setup are available.
310A single message is constructed and transmitted, to send multiple messages
311at once they would be constructed in consecutive frames before a final call
312to sendto().
313
314 unsigned int frame_offset = 0;
315 struct nl_mmap_hdr *hdr;
316 struct nlmsghdr *nlh;
317 struct sockaddr_nl addr = {
318 .nl_family = AF_NETLINK,
319 };
320
321 hdr = tx_ring + frame_offset;
322 if (hdr->nm_status != NL_MMAP_STATUS_UNUSED)
323 /* No frame available. Use poll() to avoid. */
324 exit(1);
325
326 nlh = (void *)hdr + NL_MMAP_HDRLEN;
327
328 /* Build message */
329 build_message(nlh);
330
331 /* Fill frame header: length and status need to be set */
332 hdr->nm_len = nlh->nlmsg_len;
333 hdr->nm_status = NL_MMAP_STATUS_VALID;
334
335 if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0)
336 exit(1);
337
338 /* Advance frame offset to next frame */
339 frame_offset = (frame_offset + frame_size) % ring_size;
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index 94444b152fbc..23dd80e82b8e 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -685,14 +685,342 @@ int main(int argc, char **argp)
685} 685}
686 686
687------------------------------------------------------------------------------- 687-------------------------------------------------------------------------------
688+ AF_PACKET TPACKET_V3 example
689-------------------------------------------------------------------------------
690
691AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame
692sizes by doing it's own memory management. It is based on blocks where polling
693works on a per block basis instead of per ring as in TPACKET_V2 and predecessor.
694
695It is said that TPACKET_V3 brings the following benefits:
696 *) ~15 - 20% reduction in CPU-usage
697 *) ~20% increase in packet capture rate
698 *) ~2x increase in packet density
699 *) Port aggregation analysis
700 *) Non static frame size to capture entire packet payload
701
702So it seems to be a good candidate to be used with packet fanout.
703
704Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile
705it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.):
706
707#include <stdio.h>
708#include <stdlib.h>
709#include <stdint.h>
710#include <string.h>
711#include <assert.h>
712#include <net/if.h>
713#include <arpa/inet.h>
714#include <netdb.h>
715#include <poll.h>
716#include <unistd.h>
717#include <signal.h>
718#include <inttypes.h>
719#include <sys/socket.h>
720#include <sys/mman.h>
721#include <linux/if_packet.h>
722#include <linux/if_ether.h>
723#include <linux/ip.h>
724
725#define BLOCK_SIZE (1 << 22)
726#define FRAME_SIZE 2048
727
728#define NUM_BLOCKS 64
729#define NUM_FRAMES ((BLOCK_SIZE * NUM_BLOCKS) / FRAME_SIZE)
730
731#define BLOCK_RETIRE_TOV_IN_MS 64
732#define BLOCK_PRIV_AREA_SZ 13
733
734#define ALIGN_8(x) (((x) + 8 - 1) & ~(8 - 1))
735
736#define BLOCK_STATUS(x) ((x)->h1.block_status)
737#define BLOCK_NUM_PKTS(x) ((x)->h1.num_pkts)
738#define BLOCK_O2FP(x) ((x)->h1.offset_to_first_pkt)
739#define BLOCK_LEN(x) ((x)->h1.blk_len)
740#define BLOCK_SNUM(x) ((x)->h1.seq_num)
741#define BLOCK_O2PRIV(x) ((x)->offset_to_priv)
742#define BLOCK_PRIV(x) ((void *) ((uint8_t *) (x) + BLOCK_O2PRIV(x)))
743#define BLOCK_HDR_LEN (ALIGN_8(sizeof(struct block_desc)))
744#define BLOCK_PLUS_PRIV(sz_pri) (BLOCK_HDR_LEN + ALIGN_8((sz_pri)))
745
746#ifndef likely
747# define likely(x) __builtin_expect(!!(x), 1)
748#endif
749#ifndef unlikely
750# define unlikely(x) __builtin_expect(!!(x), 0)
751#endif
752
753struct block_desc {
754 uint32_t version;
755 uint32_t offset_to_priv;
756 struct tpacket_hdr_v1 h1;
757};
758
759struct ring {
760 struct iovec *rd;
761 uint8_t *map;
762 struct tpacket_req3 req;
763};
764
765static unsigned long packets_total = 0, bytes_total = 0;
766static sig_atomic_t sigint = 0;
767
768void sighandler(int num)
769{
770 sigint = 1;
771}
772
773static int setup_socket(struct ring *ring, char *netdev)
774{
775 int err, i, fd, v = TPACKET_V3;
776 struct sockaddr_ll ll;
777
778 fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
779 if (fd < 0) {
780 perror("socket");
781 exit(1);
782 }
783
784 err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v));
785 if (err < 0) {
786 perror("setsockopt");
787 exit(1);
788 }
789
790 memset(&ring->req, 0, sizeof(ring->req));
791 ring->req.tp_block_size = BLOCK_SIZE;
792 ring->req.tp_frame_size = FRAME_SIZE;
793 ring->req.tp_block_nr = NUM_BLOCKS;
794 ring->req.tp_frame_nr = NUM_FRAMES;
795 ring->req.tp_retire_blk_tov = BLOCK_RETIRE_TOV_IN_MS;
796 ring->req.tp_sizeof_priv = BLOCK_PRIV_AREA_SZ;
797 ring->req.tp_feature_req_word |= TP_FT_REQ_FILL_RXHASH;
798
799 err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req,
800 sizeof(ring->req));
801 if (err < 0) {
802 perror("setsockopt");
803 exit(1);
804 }
805
806 ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr,
807 PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED,
808 fd, 0);
809 if (ring->map == MAP_FAILED) {
810 perror("mmap");
811 exit(1);
812 }
813
814 ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd));
815 assert(ring->rd);
816 for (i = 0; i < ring->req.tp_block_nr; ++i) {
817 ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size);
818 ring->rd[i].iov_len = ring->req.tp_block_size;
819 }
820
821 memset(&ll, 0, sizeof(ll));
822 ll.sll_family = PF_PACKET;
823 ll.sll_protocol = htons(ETH_P_ALL);
824 ll.sll_ifindex = if_nametoindex(netdev);
825 ll.sll_hatype = 0;
826 ll.sll_pkttype = 0;
827 ll.sll_halen = 0;
828
829 err = bind(fd, (struct sockaddr *) &ll, sizeof(ll));
830 if (err < 0) {
831 perror("bind");
832 exit(1);
833 }
834
835 return fd;
836}
837
838#ifdef __checked
839static uint64_t prev_block_seq_num = 0;
840
841void assert_block_seq_num(struct block_desc *pbd)
842{
843 if (unlikely(prev_block_seq_num + 1 != BLOCK_SNUM(pbd))) {
844 printf("prev_block_seq_num:%"PRIu64", expected seq:%"PRIu64" != "
845 "actual seq:%"PRIu64"\n", prev_block_seq_num,
846 prev_block_seq_num + 1, (uint64_t) BLOCK_SNUM(pbd));
847 exit(1);
848 }
849
850 prev_block_seq_num = BLOCK_SNUM(pbd);
851}
852
853static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
854{
855 if (BLOCK_NUM_PKTS(pbd)) {
856 if (unlikely(bytes != BLOCK_LEN(pbd))) {
857 printf("block:%u with %upackets, expected len:%u != actual len:%u\n",
858 block_num, BLOCK_NUM_PKTS(pbd), bytes, BLOCK_LEN(pbd));
859 exit(1);
860 }
861 } else {
862 if (unlikely(BLOCK_LEN(pbd) != BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ))) {
863 printf("block:%u, expected len:%lu != actual len:%u\n",
864 block_num, BLOCK_HDR_LEN, BLOCK_LEN(pbd));
865 exit(1);
866 }
867 }
868}
869
870static void assert_block_header(struct block_desc *pbd, const int block_num)
871{
872 uint32_t block_status = BLOCK_STATUS(pbd);
873
874 if (unlikely((block_status & TP_STATUS_USER) == 0)) {
875 printf("block:%u, not in TP_STATUS_USER\n", block_num);
876 exit(1);
877 }
878
879 assert_block_seq_num(pbd);
880}
881#else
882static inline void assert_block_header(struct block_desc *pbd, const int block_num)
883{
884}
885static void assert_block_len(struct block_desc *pbd, uint32_t bytes, int block_num)
886{
887}
888#endif
889
890static void display(struct tpacket3_hdr *ppd)
891{
892 struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac);
893 struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN);
894
895 if (eth->h_proto == htons(ETH_P_IP)) {
896 struct sockaddr_in ss, sd;
897 char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST];
898
899 memset(&ss, 0, sizeof(ss));
900 ss.sin_family = PF_INET;
901 ss.sin_addr.s_addr = ip->saddr;
902 getnameinfo((struct sockaddr *) &ss, sizeof(ss),
903 sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST);
904
905 memset(&sd, 0, sizeof(sd));
906 sd.sin_family = PF_INET;
907 sd.sin_addr.s_addr = ip->daddr;
908 getnameinfo((struct sockaddr *) &sd, sizeof(sd),
909 dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST);
910
911 printf("%s -> %s, ", sbuff, dbuff);
912 }
913
914 printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash);
915}
916
917static void walk_block(struct block_desc *pbd, const int block_num)
918{
919 int num_pkts = BLOCK_NUM_PKTS(pbd), i;
920 unsigned long bytes = 0;
921 unsigned long bytes_with_padding = BLOCK_PLUS_PRIV(BLOCK_PRIV_AREA_SZ);
922 struct tpacket3_hdr *ppd;
923
924 assert_block_header(pbd, block_num);
925
926 ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + BLOCK_O2FP(pbd));
927 for (i = 0; i < num_pkts; ++i) {
928 bytes += ppd->tp_snaplen;
929 if (ppd->tp_next_offset)
930 bytes_with_padding += ppd->tp_next_offset;
931 else
932 bytes_with_padding += ALIGN_8(ppd->tp_snaplen + ppd->tp_mac);
933
934 display(ppd);
935
936 ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + ppd->tp_next_offset);
937 __sync_synchronize();
938 }
939
940 assert_block_len(pbd, bytes_with_padding, block_num);
941
942 packets_total += num_pkts;
943 bytes_total += bytes;
944}
945
946void flush_block(struct block_desc *pbd)
947{
948 BLOCK_STATUS(pbd) = TP_STATUS_KERNEL;
949 __sync_synchronize();
950}
951
952static void teardown_socket(struct ring *ring, int fd)
953{
954 munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr);
955 free(ring->rd);
956 close(fd);
957}
958
959int main(int argc, char **argp)
960{
961 int fd, err;
962 socklen_t len;
963 struct ring ring;
964 struct pollfd pfd;
965 unsigned int block_num = 0;
966 struct block_desc *pbd;
967 struct tpacket_stats_v3 stats;
968
969 if (argc != 2) {
970 fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]);
971 return EXIT_FAILURE;
972 }
973
974 signal(SIGINT, sighandler);
975
976 memset(&ring, 0, sizeof(ring));
977 fd = setup_socket(&ring, argp[argc - 1]);
978 assert(fd > 0);
979
980 memset(&pfd, 0, sizeof(pfd));
981 pfd.fd = fd;
982 pfd.events = POLLIN | POLLERR;
983 pfd.revents = 0;
984
985 while (likely(!sigint)) {
986 pbd = (struct block_desc *) ring.rd[block_num].iov_base;
987retry_block:
988 if ((BLOCK_STATUS(pbd) & TP_STATUS_USER) == 0) {
989 poll(&pfd, 1, -1);
990 goto retry_block;
991 }
992
993 walk_block(pbd, block_num);
994 flush_block(pbd);
995 block_num = (block_num + 1) % NUM_BLOCKS;
996 }
997
998 len = sizeof(stats);
999 err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len);
1000 if (err < 0) {
1001 perror("getsockopt");
1002 exit(1);
1003 }
1004
1005 fflush(stdout);
1006 printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n",
1007 stats.tp_packets, bytes_total, stats.tp_drops,
1008 stats.tp_freeze_q_cnt);
1009
1010 teardown_socket(&ring, fd);
1011 return 0;
1012}
1013
1014-------------------------------------------------------------------------------
688+ PACKET_TIMESTAMP 1015+ PACKET_TIMESTAMP
689------------------------------------------------------------------------------- 1016-------------------------------------------------------------------------------
690 1017
691The PACKET_TIMESTAMP setting determines the source of the timestamp in 1018The PACKET_TIMESTAMP setting determines the source of the timestamp in
692the packet meta information. If your NIC is capable of timestamping 1019the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your
693packets in hardware, you can request those hardware timestamps to used. 1020NIC is capable of timestamping packets in hardware, you can request those
694Note: you may need to enable the generation of hardware timestamps with 1021hardware timestamps to be used. Note: you may need to enable the generation
695SIOCSHWTSTAMP. 1022of hardware timestamps with SIOCSHWTSTAMP (see related information from
1023Documentation/networking/timestamping.txt).
696 1024
697PACKET_TIMESTAMP accepts the same integer bit field as 1025PACKET_TIMESTAMP accepts the same integer bit field as
698SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE 1026SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE
@@ -704,8 +1032,36 @@ SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set.
704 req |= SOF_TIMESTAMPING_SYS_HARDWARE; 1032 req |= SOF_TIMESTAMPING_SYS_HARDWARE;
705 setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) 1033 setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req))
706 1034
707If PACKET_TIMESTAMP is not set, a software timestamp generated inside 1035For the mmap(2)ed ring buffers, such timestamps are stored in the
708the networking stack is used (the behavior before this setting was added). 1036tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine
1037what kind of timestamp has been reported, the tp_status field is binary |'ed
1038with the following possible bits ...
1039
1040 TP_STATUS_TS_SYS_HARDWARE
1041 TP_STATUS_TS_RAW_HARDWARE
1042 TP_STATUS_TS_SOFTWARE
1043
1044... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the
1045RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set),
1046then this means that a software fallback was invoked *within* PF_PACKET's
1047processing code (less precise).
1048
1049Getting timestamps for the TX_RING works as follows: i) fill the ring frames,
1050ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant
1051frames to be updated resp. the frame handed over to the application, iv) walk
1052through the frames to pick up the individual hw/sw timestamps.
1053
1054Only (!) if transmit timestamping is enabled, then these bits are combined
1055with binary | with TP_STATUS_AVAILABLE, so you must check for that in your
1056application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING))
1057in a first step to see if the frame belongs to the application, and then
1058one can extract the type of timestamp in a second step from tp_status)!
1059
1060If you don't care about them, thus having it disabled, checking for
1061TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the
1062TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec
1063members do not contain a valid value. For TX_RINGs, by default no timestamp
1064is generated!
709 1065
710See include/linux/net_tstamp.h and Documentation/networking/timestamping 1066See include/linux/net_tstamp.h and Documentation/networking/timestamping
711for more information on hardware timestamps. 1067for more information on hardware timestamps.
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index f9fa6db40a52..654d2e55c8cb 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -1,6 +1,6 @@
1 STMicroelectronics 10/100/1000 Synopsys Ethernet driver 1 STMicroelectronics 10/100/1000 Synopsys Ethernet driver
2 2
3Copyright (C) 2007-2010 STMicroelectronics Ltd 3Copyright (C) 2007-2013 STMicroelectronics Ltd
4Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> 4Author: Giuseppe Cavallaro <peppe.cavallaro@st.com>
5 5
6This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers 6This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
@@ -10,7 +10,7 @@ Currently this network device driver is for all STM embedded MAC/GMAC
10(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 10(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000
11FF1152AMT0221 D1215994A VIRTEX FPGA board. 11FF1152AMT0221 D1215994A VIRTEX FPGA board.
12 12
13DWC Ether MAC 10/100/1000 Universal version 3.60a (and older) and DWC Ether 13DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether
14MAC 10/100 Universal version 4.0 have been used for developing this driver. 14MAC 10/100 Universal version 4.0 have been used for developing this driver.
15 15
16This driver supports both the platform bus and PCI. 16This driver supports both the platform bus and PCI.
@@ -32,6 +32,8 @@ The kernel configuration option is STMMAC_ETH:
32 watchdog: transmit timeout (in milliseconds); 32 watchdog: transmit timeout (in milliseconds);
33 flow_ctrl: Flow control ability [on/off]; 33 flow_ctrl: Flow control ability [on/off];
34 pause: Flow Control Pause Time; 34 pause: Flow Control Pause Time;
35 eee_timer: tx EEE timer;
36 chain_mode: select chain mode instead of ring.
35 37
363) Command line options 383) Command line options
37Driver parameters can be also passed in command line by using: 39Driver parameters can be also passed in command line by using:
@@ -164,12 +166,12 @@ Where:
164 o bus_setup: perform HW setup of the bus. For example, on some ST platforms 166 o bus_setup: perform HW setup of the bus. For example, on some ST platforms
165 this field is used to configure the AMBA bridge to generate more 167 this field is used to configure the AMBA bridge to generate more
166 efficient STBus traffic. 168 efficient STBus traffic.
167 o init/exit: callbacks used for calling a custom initialisation; 169 o init/exit: callbacks used for calling a custom initialization;
168 this is sometime necessary on some platforms (e.g. ST boxes) 170 this is sometime necessary on some platforms (e.g. ST boxes)
169 where the HW needs to have set some PIO lines or system cfg 171 where the HW needs to have set some PIO lines or system cfg
170 registers. 172 registers.
171 o custom_cfg/custom_data: this is a custom configuration that can be passed 173 o custom_cfg/custom_data: this is a custom configuration that can be passed
172 while initialising the resources. 174 while initializing the resources.
173 o bsp_priv: another private poiter. 175 o bsp_priv: another private poiter.
174 176
175For MDIO bus The we have: 177For MDIO bus The we have:
@@ -273,6 +275,8 @@ reset procedure etc).
273 o norm_desc.c: functions for handling normal descriptors; 275 o norm_desc.c: functions for handling normal descriptors;
274 o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; 276 o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes;
275 o mmc_core.c/mmc.h: Management MAC Counters; 277 o mmc_core.c/mmc.h: Management MAC Counters;
278 o stmmac_hwtstamp.c: HW timestamp support for PTP
279 o stmmac_ptp.c: PTP 1588 clock
276 280
2775) Debug Information 2815) Debug Information
278 282
@@ -326,6 +330,35 @@ To enter in Tx LPI mode the driver needs to have a software timer
326that enable and disable the LPI mode when there is nothing to be 330that enable and disable the LPI mode when there is nothing to be
327transmitted. 331transmitted.
328 332
3297) TODO: 3337) Extended descriptors
334The extended descriptors give us information about the receive Ethernet payload
335when it is carrying PTP packets or TCP/UDP/ICMP over IP.
336These are not available on GMAC Synopsys chips older than the 3.50.
337At probe time the driver will decide if these can be actually used.
338This support also is mandatory for PTPv2 because the extra descriptors 6 and 7
339are used for saving the hardware timestamps.
340
3418) Precision Time Protocol (PTP)
342The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP),
343which enables precise synchronization of clocks in measurement and
344control systems implemented with technologies such as network
345communication.
346
347In addition to the basic timestamp features mentioned in IEEE 1588-2002
348Timestamps, new GMAC cores support the advanced timestamp features.
349IEEE 1588-2008 that can be enabled when configure the Kernel.
350
3519) SGMII/RGMII supports
352New GMAC devices provide own way to manage RGMII/SGMII.
353This information is available at run-time by looking at the
354HW capability register. This means that the stmmac can manage
355auto-negotiation and link status w/o using the PHYLIB stuff
356In fact, the HW provides a subset of extended registers to
357restart the ANE, verify Full/Half duplex mode and Speed.
358Also thanks to these registers it is possible to look at the
359Auto-negotiated Link Parter Ability.
360
36110) TODO:
330 o XGMAC is not supported. 362 o XGMAC is not supported.
331 o Add the PTP - precision time protocol 363 o Complete the TBI & RTBI support.
364 o extened VLAN support for 3.70a SYNP GMAC.