aboutsummaryrefslogtreecommitdiffstats
path: root/Documentation/networking
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2014-04-02 23:53:45 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2014-04-02 23:53:45 -0400
commitcd6362befe4cc7bf589a5236d2a780af2d47bcc9 (patch)
tree3bd4e13ec3f92a00dc4f6c3d65e820b54dbfe46e /Documentation/networking
parent0f1b1e6d73cb989ce2c071edc57deade3b084dfe (diff)
parentb1586f099ba897542ece36e8a23c1a62907261ef (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller: "Here is my initial pull request for the networking subsystem during this merge window: 1) Support for ESN in AH (RFC 4302) from Fan Du. 2) Add full kernel doc for ethtool command structures, from Ben Hutchings. 3) Add BCM7xxx PHY driver, from Florian Fainelli. 4) Export computed TCP rate information in netlink socket dumps, from Eric Dumazet. 5) Allow IPSEC SA to be dumped partially using a filter, from Nicolas Dichtel. 6) Convert many drivers to pci_enable_msix_range(), from Alexander Gordeev. 7) Record SKB timestamps more efficiently, from Eric Dumazet. 8) Switch to microsecond resolution for TCP round trip times, also from Eric Dumazet. 9) Clean up and fix 6lowpan fragmentation handling by making use of the existing inet_frag api for it's implementation. 10) Add TX grant mapping to xen-netback driver, from Zoltan Kiss. 11) Auto size SKB lengths when composing netlink messages based upon past message sizes used, from Eric Dumazet. 12) qdisc dumps can take a long time, add a cond_resched(), From Eric Dumazet. 13) Sanitize netpoll core and drivers wrt. SKB handling semantics. Get rid of never-used-in-tree netpoll RX handling. From Eric W Biederman. 14) Support inter-address-family and namespace changing in VTI tunnel driver(s). From Steffen Klassert. 15) Add Altera TSE driver, from Vince Bridgers. 16) Optimizing csum_replace2() so that it doesn't adjust the checksum by checksumming the entire header, from Eric Dumazet. 17) Expand BPF internal implementation for faster interpreting, more direct translations into JIT'd code, and much cleaner uses of BPF filtering in non-socket ocntexts. From Daniel Borkmann and Alexei Starovoitov" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1976 commits) netpoll: Use skb_irq_freeable to make zap_completion_queue safe. net: Add a test to see if a skb is freeable in irq context qlcnic: Fix build failure due to undefined reference to `vxlan_get_rx_port' net: ptp: move PTP classifier in its own file net: sxgbe: make "core_ops" static net: sxgbe: fix logical vs bitwise operation net: sxgbe: sxgbe_mdio_register() frees the bus Call efx_set_channels() before efx->type->dimension_resources() xen-netback: disable rogue vif in kthread context net/mlx4: Set proper build dependancy with vxlan be2net: fix build dependency on VxLAN mac802154: make csma/cca parameters per-wpan mac802154: allow only one WPAN to be up at any given time net: filter: minor: fix kdoc in __sk_run_filter netlink: don't compare the nul-termination in nla_strcmp can: c_can: Avoid led toggling for every packet. can: c_can: Simplify TX interrupt cleanup can: c_can: Store dlc private can: c_can: Reduce register access can: c_can: Make the code readable ...
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/altera_tse.txt263
-rw-r--r--Documentation/networking/bonding.txt96
-rw-r--r--Documentation/networking/can.txt2
-rw-r--r--Documentation/networking/filter.txt125
-rw-r--r--Documentation/networking/gianfar.txt30
-rw-r--r--Documentation/networking/igb.txt48
-rw-r--r--Documentation/networking/phy.txt11
-rw-r--r--Documentation/networking/pktgen.txt24
-rw-r--r--Documentation/networking/rxrpc.txt81
-rw-r--r--Documentation/networking/tcp.txt2
-rw-r--r--Documentation/networking/timestamping.txt6
11 files changed, 569 insertions, 119 deletions
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt
new file mode 100644
index 000000000000..3f24df8c6e65
--- /dev/null
+++ b/Documentation/networking/altera_tse.txt
@@ -0,0 +1,263 @@
1 Altera Triple-Speed Ethernet MAC driver
2
3Copyright (C) 2008-2014 Altera Corporation
4
5This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
6using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
7platform bus to obtain component resources. The designs used to test this
8driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
9and tested with ARM and NIOS processor hosts seperately. The anticipated use
10cases are simple communications between an embedded system and an external peer
11for status and simple configuration of the embedded system.
12
13For more information visit www.altera.com and www.rocketboards.org. Support
14forums for the driver may be found on www.rocketboards.org, and a design used
15to test this driver may be found there as well. Support is also available from
16the maintainer of this driver, found in MAINTAINERS.
17
18The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP
19components that can be assembled and built into an FPGA using the Altera
20Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that
21this driver was tested against. The sopc2dts tool is used to create the
22device tree for the driver, and may be found at rocketboards.org.
23
24The driver probe function examines the device tree and determines if the
25Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The
26probe function then installs the appropriate set of DMA routines to
27initialize, setup transmits, receives, and interrupt handling primitives for
28the respective configurations.
29
30The SGDMA component is to be deprecated in the near future (over the next 1-2
31years as of this writing in early 2014) in favor of the MSGDMA component.
32SGDMA support is included for existing designs and reference in case a
33developer wishes to support their own soft DMA logic and driver support. Any
34new designs should not use the SGDMA.
35
36The SGDMA supports only a single transmit or receive operation at a time, and
37therefore will not perform as well compared to the MSGDMA soft IP. Please
38visit www.altera.com for known, documented SGDMA errata.
39
40Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time.
41Scatter-gather DMA will be added to a future maintenance update to this
42driver.
43
44Jumbo frames are not supported at this time.
45
46The driver limits PHY operations to 10/100Mbps, and has not yet been fully
47tested for 1Gbps. This support will be added in a future maintenance update.
48
491) Kernel Configuration
50The kernel configuration option is ALTERA_TSE:
51 Device Drivers ---> Network device support ---> Ethernet driver support --->
52 Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
53
542) Driver parameters list:
55 debug: message level (0: no output, 16: all);
56 dma_rx_num: Number of descriptors in the RX list (default is 64);
57 dma_tx_num: Number of descriptors in the TX list (default is 64).
58
593) Command line options
60Driver parameters can be also passed in command line by using:
61 altera_tse=dma_rx_num:128,dma_tx_num:512
62
634) Driver information and notes
64
654.1) Transmit process
66When the driver's transmit routine is called by the kernel, it sets up a
67transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
68MSGDMA), and initites a transmit operation. Once the transmit is complete, an
69interrupt is driven by the transmit DMA logic. The driver handles the transmit
70completion in the context of the interrupt handling chain by recycling
71resource required to send and track the requested transmit operation.
72
734.2) Receive process
74The driver will post receive buffers to the receive DMA logic during driver
75intialization. Receive buffers may or may not be queued depending upon the
76underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
77to queue receive buffers to the SGDMA receive logic). When a packet is
78received, the DMA logic generates an interrupt. The driver handles a receive
79interrupt by obtaining the DMA receive logic status, reaping receive
80completions until no more receive completions are available.
81
824.3) Interrupt Mitigation
83The driver is able to mitigate the number of its DMA interrupts
84using NAPI for receive operations. Interrupt mitigation is not yet supported
85for transmit operations, but will be added in a future maintenance release.
86
874.4) Ethtool support
88Ethtool is supported. Driver statistics and internal errors can be taken using:
89ethtool -S ethX command. It is possible to dump registers etc.
90
914.5) PHY Support
92The driver is compatible with PAL to work with PHY and GPHY devices.
93
944.7) List of source files:
95 o Kconfig
96 o Makefile
97 o altera_tse_main.c: main network device driver
98 o altera_tse_ethtool.c: ethtool support
99 o altera_tse.h: private driver structure and common definitions
100 o altera_msgdma.h: MSGDMA implementation function definitions
101 o altera_sgdma.h: SGDMA implementation function definitions
102 o altera_msgdma.c: MSGDMA implementation
103 o altera_sgdma.c: SGDMA implementation
104 o altera_sgdmahw.h: SGDMA register and descriptor definitions
105 o altera_msgdmahw.h: MSGDMA register and descriptor definitions
106 o altera_utils.c: Driver utility functions
107 o altera_utils.h: Driver utility function definitions
108
1095) Debug Information
110
111The driver exports debug information such as internal statistics,
112debug information, MAC and DMA registers etc.
113
114A user may use the ethtool support to get statistics:
115e.g. using: ethtool -S ethX (that shows the statistics counters)
116or sees the MAC registers: e.g. using: ethtool -d ethX
117
118The developer can also use the "debug" module parameter to get
119further debug information.
120
1216) Statistics Support
122
123The controller and driver support a mix of IEEE standard defined statistics,
124RFC defined statistics, and driver or Altera defined statistics. The four
125specifications containing the standard definitions for these statistics are
126as follows:
127
128 o IEEE 802.3-2012 - IEEE Standard for Ethernet.
129 o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
130 o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
131 o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
132
133The statistics supported by the TSE and the device driver are as follows:
134
135"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012,
136Section 5.2.2.1.2. This statistics is the count of frames that are successfully
137transmitted.
138
139"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012,
140Section 5.2.2.1.5. This statistic is the count of frames that are successfully
141received. This count does not include any error packets such as CRC errors,
142length errors, or alignment errors.
143
144"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE
145802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are
146an integral number of bytes in length and do not pass the CRC test as the frame
147is received.
148
149"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012,
150Section 5.2.2.1.7. This statistic is the count of frames that are not an
151integral number of bytes in length and do not pass the CRC test as the frame is
152received.
153
154"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012,
155Section 5.2.2.1.8. This statistic is the count of data and pad bytes
156successfully transmitted from the interface.
157
158"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012,
159Section 5.2.2.1.14. This statistic is the count of data and pad bytes
160successfully received by the controller.
161
162"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE
163802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames
164transmitted from the network controller.
165
166"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE
167802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames
168received by the network controller.
169
170"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is
171a count of the number of packets received containing errors that prevented the
172packet from being delivered to a higher level protocol.
173
174"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic
175is a count of the number of packets that could not be transmitted due to errors.
176
177"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This
178statistic is a count of the number of packets received that were not addressed
179to the broadcast address or a multicast group.
180
181"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This
182statistic is a count of the number of packets received that were addressed to
183a multicast address group.
184
185"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This
186statistic is a count of the number of packets received that were addressed to
187the broadcast address.
188
189"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This
190statistic is the number of outbound packets not transmitted even though an
191error was not detected. An example of a reason this might occur is to free up
192internal buffer space.
193
194"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This
195statistic counts the number of packets transmitted that were not addressed to
196a multicast group or broadcast address.
197
198"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This
199statistic counts the number of packets transmitted that were addressed to a
200multicast group.
201
202"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This
203statistic counts the number of packets transmitted that were addressed to a
204broadcast address.
205
206"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819.
207This statistic counts the number of packets dropped due to lack of internal
208controller resources.
209
210"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819.
211This statistic counts the total number of bytes received by the controller,
212including error and discarded packets.
213
214"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819.
215This statistic counts the total number of packets received by the controller,
216including error, discarded, unicast, multicast, and broadcast packets.
217
218"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819.
219This statistic counts the number of correctly formed packets received less
220than 64 bytes long.
221
222"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819.
223This statistic counts the number of correctly formed packets greater than 1518
224bytes long.
225
226"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819.
227This statistic counts the total number of packets received that were 64 octets
228in length.
229
230"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC
2312819. This statistic counts the total number of packets received that were
232between 65 and 127 octets in length inclusive.
233
234"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in
235RFC 2819. This statistic is the total number of packets received that were
236between 128 and 255 octets in length inclusive.
237
238"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in
239RFC 2819. This statistic is the total number of packets received that were
240between 256 and 511 octets in length inclusive.
241
242"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in
243RFC 2819. This statistic is the total number of packets received that were
244between 512 and 1023 octets in length inclusive.
245
246"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define
247in RFC 2819. This statistic is the total number of packets received that were
248between 1024 and 1518 octets in length inclusive.
249
250"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the
251Altera TSE. This statistics counts the number of received good and errored
252frames between the length of 1519 and the maximum frame length configured
253in the frm_length register. See the Altera TSE User Guide for More details.
254
255"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This
256statistic is the total number of packets received that were longer than 1518
257octets, and had either a bad CRC with an integral number of octets (CRC Error)
258or a bad CRC with a non-integral number of octets (Alignment Error).
259
260"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This
261statistic is the total number of packets received that were less than 64 octets
262in length and had either a bad CRC with an integral number of octets (CRC
263error) or a bad CRC with a non-integral number of octets (Alignment Error).
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 5cdb22971d19..a383c00392d0 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -270,16 +270,15 @@ arp_ip_target
270arp_validate 270arp_validate
271 271
272 Specifies whether or not ARP probes and replies should be 272 Specifies whether or not ARP probes and replies should be
273 validated in the active-backup mode. This causes the ARP 273 validated in any mode that supports arp monitoring, or whether
274 monitor to examine the incoming ARP requests and replies, and 274 non-ARP traffic should be filtered (disregarded) for link
275 only consider a slave to be up if it is receiving the 275 monitoring purposes.
276 appropriate ARP traffic.
277 276
278 Possible values are: 277 Possible values are:
279 278
280 none or 0 279 none or 0
281 280
282 No validation is performed. This is the default. 281 No validation or filtering is performed.
283 282
284 active or 1 283 active or 1
285 284
@@ -293,31 +292,68 @@ arp_validate
293 292
294 Validation is performed for all slaves. 293 Validation is performed for all slaves.
295 294
296 For the active slave, the validation checks ARP replies to 295 filter or 4
297 confirm that they were generated by an arp_ip_target. Since 296
298 backup slaves do not typically receive these replies, the 297 Filtering is applied to all slaves. No validation is
299 validation performed for backup slaves is on the ARP request 298 performed.
300 sent out via the active slave. It is possible that some 299
301 switch or network configurations may result in situations 300 filter_active or 5
302 wherein the backup slaves do not receive the ARP requests; in 301
303 such a situation, validation of backup slaves must be 302 Filtering is applied to all slaves, validation is performed
304 disabled. 303 only for the active slave.
305 304
306 The validation of ARP requests on backup slaves is mainly 305 filter_backup or 6
307 helping bonding to decide which slaves are more likely to 306
308 work in case of the active slave failure, it doesn't really 307 Filtering is applied to all slaves, validation is performed
309 guarantee that the backup slave will work if it's selected 308 only for backup slaves.
310 as the next active slave. 309
311 310 Validation:
312 This option is useful in network configurations in which 311
313 multiple bonding hosts are concurrently issuing ARPs to one or 312 Enabling validation causes the ARP monitor to examine the incoming
314 more targets beyond a common switch. Should the link between 313 ARP requests and replies, and only consider a slave to be up if it
315 the switch and target fail (but not the switch itself), the 314 is receiving the appropriate ARP traffic.
316 probe traffic generated by the multiple bonding instances will 315
317 fool the standard ARP monitor into considering the links as 316 For an active slave, the validation checks ARP replies to confirm
318 still up. Use of the arp_validate option can resolve this, as 317 that they were generated by an arp_ip_target. Since backup slaves
319 the ARP monitor will only consider ARP requests and replies 318 do not typically receive these replies, the validation performed
320 associated with its own instance of bonding. 319 for backup slaves is on the broadcast ARP request sent out via the
320 active slave. It is possible that some switch or network
321 configurations may result in situations wherein the backup slaves
322 do not receive the ARP requests; in such a situation, validation
323 of backup slaves must be disabled.
324
325 The validation of ARP requests on backup slaves is mainly helping
326 bonding to decide which slaves are more likely to work in case of
327 the active slave failure, it doesn't really guarantee that the
328 backup slave will work if it's selected as the next active slave.
329
330 Validation is useful in network configurations in which multiple
331 bonding hosts are concurrently issuing ARPs to one or more targets
332 beyond a common switch. Should the link between the switch and
333 target fail (but not the switch itself), the probe traffic
334 generated by the multiple bonding instances will fool the standard
335 ARP monitor into considering the links as still up. Use of
336 validation can resolve this, as the ARP monitor will only consider
337 ARP requests and replies associated with its own instance of
338 bonding.
339
340 Filtering:
341
342 Enabling filtering causes the ARP monitor to only use incoming ARP
343 packets for link availability purposes. Arriving packets that are
344 not ARPs are delivered normally, but do not count when determining
345 if a slave is available.
346
347 Filtering operates by only considering the reception of ARP
348 packets (any ARP packet, regardless of source or destination) when
349 determining if a slave has received traffic for link availability
350 purposes.
351
352 Filtering is useful in network configurations in which significant
353 levels of third party broadcast traffic would fool the standard
354 ARP monitor into considering the links as still up. Use of
355 filtering can resolve this, as only ARP traffic is considered for
356 link availability purposes.
321 357
322 This option was added in bonding version 3.1.0. 358 This option was added in bonding version 3.1.0.
323 359
diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt
index 0cbe6ec22d6f..2fa44cbe81b7 100644
--- a/Documentation/networking/can.txt
+++ b/Documentation/networking/can.txt
@@ -1017,7 +1017,7 @@ solution for a couple of reasons:
1017 in case of a bus-off condition after the specified delay time 1017 in case of a bus-off condition after the specified delay time
1018 in milliseconds. By default it's off. 1018 in milliseconds. By default it's off.
1019 1019
1020 "bitrate 125000 sample_point 0.875" 1020 "bitrate 125000 sample-point 0.875"
1021 Shows the real bit-rate in bits/sec and the sample-point in the 1021 Shows the real bit-rate in bits/sec and the sample-point in the
1022 range 0.000..0.999. If the calculation of bit-timing parameters 1022 range 0.000..0.999. If the calculation of bit-timing parameters
1023 is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the 1023 is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index a06b48d2f5cc..81f940f4e884 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -546,6 +546,130 @@ ffffffffa0069c8f + <x>:
546For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful 546For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
547toolchain for developing and testing the kernel's JIT compiler. 547toolchain for developing and testing the kernel's JIT compiler.
548 548
549BPF kernel internals
550--------------------
551Internally, for the kernel interpreter, a different BPF instruction set
552format with similar underlying principles from BPF described in previous
553paragraphs is being used. However, the instruction set format is modelled
554closer to the underlying architecture to mimic native instruction sets, so
555that a better performance can be achieved (more details later).
556
557It is designed to be JITed with one to one mapping, which can also open up
558the possibility for GCC/LLVM compilers to generate optimized BPF code through
559a BPF backend that performs almost as fast as natively compiled code.
560
561The new instruction set was originally designed with the possible goal in
562mind to write programs in "restricted C" and compile into BPF with a optional
563GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
564minimal performance overhead over two steps, that is, C -> BPF -> native code.
565
566Currently, the new format is being used for running user BPF programs, which
567includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
568team driver's classifier for its load-balancing mode, netfilter's xt_bpf
569extension, PTP dissector/classifier, and much more. They are all internally
570converted by the kernel into the new instruction set representation and run
571in the extended interpreter. For in-kernel handlers, this all works
572transparently by using sk_unattached_filter_create() for setting up the
573filter, resp. sk_unattached_filter_destroy() for destroying it. The macro
574SK_RUN_FILTER(filter, ctx) transparently invokes the right BPF function to
575run the filter. 'filter' is a pointer to struct sk_filter that we got from
576sk_unattached_filter_create(), and 'ctx' the given context (e.g. skb pointer).
577All constraints and restrictions from sk_chk_filter() apply before a
578conversion to the new layout is being done behind the scenes!
579
580Currently, for JITing, the user BPF format is being used and current BPF JIT
581compilers reused whenever possible. In other words, we do not (yet!) perform
582a JIT compilation in the new layout, however, future work will successively
583migrate traditional JIT compilers into the new instruction format as well, so
584that they will profit from the very same benefits. Thus, when speaking about
585JIT in the following, a JIT compiler (TBD) for the new instruction format is
586meant in this context.
587
588Some core changes of the new internal format:
589
590- Number of registers increase from 2 to 10:
591
592 The old format had two registers A and X, and a hidden frame pointer. The
593 new layout extends this to be 10 internal registers and a read-only frame
594 pointer. Since 64-bit CPUs are passing arguments to functions via registers
595 the number of args from BPF program to in-kernel function is restricted
596 to 5 and one register is used to accept return value from an in-kernel
597 function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
598 sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
599 registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
600
601 Therefore, BPF calling convention is defined as:
602
603 * R0 - return value from in-kernel function
604 * R1 - R5 - arguments from BPF program to in-kernel function
605 * R6 - R9 - callee saved registers that in-kernel function will preserve
606 * R10 - read-only frame pointer to access stack
607
608 Thus, all BPF registers map one to one to HW registers on x86_64, aarch64,
609 etc, and BPF calling convention maps directly to ABIs used by the kernel on
610 64-bit architectures.
611
612 On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
613 and may let more complex programs to be interpreted.
614
615 R0 - R5 are scratch registers and BPF program needs spill/fill them if
616 necessary across calls. Note that there is only one BPF program (== one BPF
617 main routine) and it cannot call other BPF functions, it can only call
618 predefined in-kernel functions, though.
619
620- Register width increases from 32-bit to 64-bit:
621
622 Still, the semantics of the original 32-bit ALU operations are preserved
623 via 32-bit subregisters. All BPF registers are 64-bit with 32-bit lower
624 subregisters that zero-extend into 64-bit if they are being written to.
625 That behavior maps directly to x86_64 and arm64 subregister definition, but
626 makes other JITs more difficult.
627
628 32-bit architectures run 64-bit internal BPF programs via interpreter.
629 Their JITs may convert BPF programs that only use 32-bit subregisters into
630 native instruction set and let the rest being interpreted.
631
632 Operation is 64-bit, because on 64-bit architectures, pointers are also
633 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
634 so 32-bit BPF registers would otherwise require to define register-pair
635 ABI, thus, there won't be able to use a direct BPF register to HW register
636 mapping and JIT would need to do combine/split/move operations for every
637 register in and out of the function, which is complex, bug prone and slow.
638 Another reason is the use of atomic 64-bit counters.
639
640- Conditional jt/jf targets replaced with jt/fall-through:
641
642 While the original design has constructs such as "if (cond) jump_true;
643 else jump_false;", they are being replaced into alternative constructs like
644 "if (cond) jump_true; /* else fall-through */".
645
646- Introduces bpf_call insn and register passing convention for zero overhead
647 calls from/to other kernel functions:
648
649 After a kernel function call, R1 - R5 are reset to unreadable and R0 has a
650 return type of the function. Since R6 - R9 are callee saved, their state is
651 preserved across the call.
652
653Also in the new design, BPF is limited to 4096 insns, which means that any
654program will terminate quickly and will only call a fixed number of kernel
655functions. Original BPF and the new format are two operand instructions,
656which helps to do one-to-one mapping between BPF insn and x86 insn during JIT.
657
658The input context pointer for invoking the interpreter function is generic,
659its content is defined by a specific use case. For seccomp register R1 points
660to seccomp_data, for converted BPF filters R1 points to a skb.
661
662A program, that is translated internally consists of the following elements:
663
664 op:16, jt:8, jf:8, k:32 ==> op:8, a_reg:4, x_reg:4, off:16, imm:32
665
666Just like the original BPF, the new format runs within a controlled environment,
667is deterministic and the kernel can easily prove that. The safety of the program
668can be determined in two steps: first step does depth-first-search to disallow
669loops and other CFG validation; second step starts from the first insn and
670descends all possible paths. It simulates execution of every insn and observes
671the state change of registers and stack.
672
549Misc 673Misc
550---- 674----
551 675
@@ -561,3 +685,4 @@ the underlying architecture.
561 685
562Jay Schulist <jschlst@samba.org> 686Jay Schulist <jschlst@samba.org>
563Daniel Borkmann <dborkman@redhat.com> 687Daniel Borkmann <dborkman@redhat.com>
688Alexei Starovoitov <ast@plumgrid.com>
diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt
index ad474ea07d07..ba1daea7f2e4 100644
--- a/Documentation/networking/gianfar.txt
+++ b/Documentation/networking/gianfar.txt
@@ -1,38 +1,8 @@
1The Gianfar Ethernet Driver 1The Gianfar Ethernet Driver
2Sysfs File description
3 2
4Author: Andy Fleming <afleming@freescale.com> 3Author: Andy Fleming <afleming@freescale.com>
5Updated: 2005-07-28 4Updated: 2005-07-28
6 5
7SYSFS
8
9Several of the features of the gianfar driver are controlled
10through sysfs files. These are:
11
12bd_stash:
13To stash RX Buffer Descriptors in the L2, echo 'on' or '1' to
14bd_stash, echo 'off' or '0' to disable
15
16rx_stash_len:
17To stash the first n bytes of the packet in L2, echo the number
18of bytes to buf_stash_len. echo 0 to disable.
19
20WARNING: You could really screw these up if you set them too low or high!
21fifo_threshold:
22To change the number of bytes the controller needs in the
23fifo before it starts transmission, echo the number of bytes to
24fifo_thresh. Range should be 0-511.
25
26fifo_starve:
27When the FIFO has less than this many bytes during a transmit, it
28enters starve mode, and increases the priority of TX memory
29transactions. To change, echo the number of bytes to
30fifo_starve. Range should be 0-511.
31
32fifo_starve_off:
33Once in starve mode, the FIFO remains there until it has this
34many bytes. To change, echo the number of bytes to
35fifo_starve_off. Range should be 0-511.
36 6
37CHECKSUM OFFLOADING 7CHECKSUM OFFLOADING
38 8
diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt
index 4ebbd659256f..43d3549366a0 100644
--- a/Documentation/networking/igb.txt
+++ b/Documentation/networking/igb.txt
@@ -36,54 +36,6 @@ Default Value: 0
36This parameter adds support for SR-IOV. It causes the driver to spawn up to 36This parameter adds support for SR-IOV. It causes the driver to spawn up to
37max_vfs worth of virtual function. 37max_vfs worth of virtual function.
38 38
39QueuePairs
40----------
41Valid Range: 0-1
42Default Value: 1 (TX and RX will be paired onto one interrupt vector)
43
44If set to 0, when MSI-X is enabled, the TX and RX will attempt to occupy
45separate vectors.
46
47This option can be overridden to 1 if there are not sufficient interrupts
48available. This can occur if any combination of RSS, VMDQ, and max_vfs
49results in more than 4 queues being used.
50
51Node
52----
53Valid Range: 0-n
54Default Value: -1 (off)
55
56 0 - n: where n is the number of the NUMA node that should be used to
57 allocate memory for this adapter port.
58 -1: uses the driver default of allocating memory on whichever processor is
59 running insmod/modprobe.
60
61 The Node parameter will allow you to pick which NUMA node you want to have
62 the adapter allocate memory from. All driver structures, in-memory queues,
63 and receive buffers will be allocated on the node specified. This parameter
64 is only useful when interrupt affinity is specified, otherwise some portion
65 of the time the interrupt could run on a different core than the memory is
66 allocated on, causing slower memory access and impacting throughput, CPU, or
67 both.
68
69EEE
70---
71Valid Range: 0-1
72Default Value: 1 (enabled)
73
74 A link between two EEE-compliant devices will result in periodic bursts of
75 data followed by long periods where in the link is in an idle state. This Low
76 Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds.
77 NOTE: EEE support requires autonegotiation.
78
79DMAC
80----
81Valid Range: 0-1
82Default Value: 1 (enabled)
83 Enables or disables DMA Coalescing feature.
84
85
86
87Additional Configurations 39Additional Configurations
88========================= 40=========================
89 41
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index ebf270719402..3544c98401fd 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -48,7 +48,7 @@ The MDIO bus
48 time, so it is safe for them to block, waiting for an interrupt to signal 48 time, so it is safe for them to block, waiting for an interrupt to signal
49 the operation is complete 49 the operation is complete
50 50
51 2) A reset function is necessary. This is used to return the bus to an 51 2) A reset function is optional. This is used to return the bus to an
52 initialized state. 52 initialized state.
53 53
54 3) A probe function is needed. This function should set up anything the bus 54 3) A probe function is needed. This function should set up anything the bus
@@ -253,16 +253,25 @@ Writing a PHY driver
253 253
254 Each driver consists of a number of function pointers: 254 Each driver consists of a number of function pointers:
255 255
256 soft_reset: perform a PHY software reset
256 config_init: configures PHY into a sane state after a reset. 257 config_init: configures PHY into a sane state after a reset.
257 For instance, a Davicom PHY requires descrambling disabled. 258 For instance, a Davicom PHY requires descrambling disabled.
258 probe: Allocate phy->priv, optionally refuse to bind. 259 probe: Allocate phy->priv, optionally refuse to bind.
259 PHY may not have been reset or had fixups run yet. 260 PHY may not have been reset or had fixups run yet.
260 suspend/resume: power management 261 suspend/resume: power management
261 config_aneg: Changes the speed/duplex/negotiation settings 262 config_aneg: Changes the speed/duplex/negotiation settings
263 aneg_done: Determines the auto-negotiation result
262 read_status: Reads the current speed/duplex/negotiation settings 264 read_status: Reads the current speed/duplex/negotiation settings
263 ack_interrupt: Clear a pending interrupt 265 ack_interrupt: Clear a pending interrupt
266 did_interrupt: Checks if the PHY generated an interrupt
264 config_intr: Enable or disable interrupts 267 config_intr: Enable or disable interrupts
265 remove: Does any driver take-down 268 remove: Does any driver take-down
269 ts_info: Queries about the HW timestamping status
270 hwtstamp: Set the PHY HW timestamping configuration
271 rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
272 txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
273 set_wol: Enable Wake-on-LAN at the PHY level
274 get_wol: Get the Wake-on-LAN status at the PHY level
266 275
267 Of these, only config_aneg and read_status are required to be 276 Of these, only config_aneg and read_status are required to be
268 assigned by the driver code. The rest are optional. Also, it is 277 assigned by the driver code. The rest are optional. Also, it is
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 5a61a240a652..0e30c7845b2b 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -102,13 +102,18 @@ Examples:
102 The 'minimum' MAC is what you set with dstmac. 102 The 'minimum' MAC is what you set with dstmac.
103 103
104 pgset "flag [name]" Set a flag to determine behaviour. Current flags 104 pgset "flag [name]" Set a flag to determine behaviour. Current flags
105 are: IPSRC_RND #IP Source is random (between min/max), 105 are: IPSRC_RND # IP source is random (between min/max)
106 IPDST_RND, UDPSRC_RND, 106 IPDST_RND # IP destination is random
107 UDPDST_RND, MACSRC_RND, MACDST_RND 107 UDPSRC_RND, UDPDST_RND,
108 MACSRC_RND, MACDST_RND
109 TXSIZE_RND, IPV6,
108 MPLS_RND, VID_RND, SVID_RND 110 MPLS_RND, VID_RND, SVID_RND
111 FLOW_SEQ,
109 QUEUE_MAP_RND # queue map random 112 QUEUE_MAP_RND # queue map random
110 QUEUE_MAP_CPU # queue map mirrors smp_processor_id() 113 QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
111 IPSEC # Make IPsec encapsulation for packet 114 UDPCSUM,
115 IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
116 NODE_ALLOC # node specific memory allocation
112 117
113 pgset spi SPI_VALUE Set specific SA used to transform packet. 118 pgset spi SPI_VALUE Set specific SA used to transform packet.
114 119
@@ -233,13 +238,22 @@ udp_dst_max
233 238
234flag 239flag
235 IPSRC_RND 240 IPSRC_RND
236 TXSIZE_RND
237 IPDST_RND 241 IPDST_RND
238 UDPSRC_RND 242 UDPSRC_RND
239 UDPDST_RND 243 UDPDST_RND
240 MACSRC_RND 244 MACSRC_RND
241 MACDST_RND 245 MACDST_RND
246 TXSIZE_RND
247 IPV6
248 MPLS_RND
249 VID_RND
250 SVID_RND
251 FLOW_SEQ
252 QUEUE_MAP_RND
253 QUEUE_MAP_CPU
254 UDPCSUM
242 IPSEC 255 IPSEC
256 NODE_ALLOC
243 257
244dst_min 258dst_min
245dst_max 259dst_max
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index b89bc82eed46..16a924c486bf 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -27,6 +27,8 @@ Contents of this document:
27 27
28 (*) AF_RXRPC kernel interface. 28 (*) AF_RXRPC kernel interface.
29 29
30 (*) Configurable parameters.
31
30 32
31======== 33========
32OVERVIEW 34OVERVIEW
@@ -864,3 +866,82 @@ The kernel interface functions are as follows:
864 866
865 This is used to allocate a null RxRPC key that can be used to indicate 867 This is used to allocate a null RxRPC key that can be used to indicate
866 anonymous security for a particular domain. 868 anonymous security for a particular domain.
869
870
871=======================
872CONFIGURABLE PARAMETERS
873=======================
874
875The RxRPC protocol driver has a number of configurable parameters that can be
876adjusted through sysctls in /proc/net/rxrpc/:
877
878 (*) req_ack_delay
879
880 The amount of time in milliseconds after receiving a packet with the
881 request-ack flag set before we honour the flag and actually send the
882 requested ack.
883
884 Usually the other side won't stop sending packets until the advertised
885 reception window is full (to a maximum of 255 packets), so delaying the
886 ACK permits several packets to be ACK'd in one go.
887
888 (*) soft_ack_delay
889
890 The amount of time in milliseconds after receiving a new packet before we
891 generate a soft-ACK to tell the sender that it doesn't need to resend.
892
893 (*) idle_ack_delay
894
895 The amount of time in milliseconds after all the packets currently in the
896 received queue have been consumed before we generate a hard-ACK to tell
897 the sender it can free its buffers, assuming no other reason occurs that
898 we would send an ACK.
899
900 (*) resend_timeout
901
902 The amount of time in milliseconds after transmitting a packet before we
903 transmit it again, assuming no ACK is received from the receiver telling
904 us they got it.
905
906 (*) max_call_lifetime
907
908 The maximum amount of time in seconds that a call may be in progress
909 before we preemptively kill it.
910
911 (*) dead_call_expiry
912
913 The amount of time in seconds before we remove a dead call from the call
914 list. Dead calls are kept around for a little while for the purpose of
915 repeating ACK and ABORT packets.
916
917 (*) connection_expiry
918
919 The amount of time in seconds after a connection was last used before we
920 remove it from the connection list. Whilst a connection is in existence,
921 it serves as a placeholder for negotiated security; when it is deleted,
922 the security must be renegotiated.
923
924 (*) transport_expiry
925
926 The amount of time in seconds after a transport was last used before we
927 remove it from the transport list. Whilst a transport is in existence, it
928 serves to anchor the peer data and keeps the connection ID counter.
929
930 (*) rxrpc_rx_window_size
931
932 The size of the receive window in packets. This is the maximum number of
933 unconsumed received packets we're willing to hold in memory for any
934 particular call.
935
936 (*) rxrpc_rx_mtu
937
938 The maximum packet MTU size that we're willing to receive in bytes. This
939 indicates to the peer whether we're willing to accept jumbo packets.
940
941 (*) rxrpc_rx_jumbo_max
942
943 The maximum number of packets that we're willing to accept in a jumbo
944 packet. Non-terminal packets in a jumbo packet must contain a four byte
945 header plus exactly 1412 bytes of data. The terminal packet must contain
946 a four byte header plus any amount of data. In any event, a jumbo packet
947 may not exceed rxrpc_rx_mtu in size.
diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt
index 7d11bb5dc30a..bdc4c0db51e1 100644
--- a/Documentation/networking/tcp.txt
+++ b/Documentation/networking/tcp.txt
@@ -30,7 +30,7 @@ A congestion control mechanism can be registered through functions in
30tcp_cong.c. The functions used by the congestion control mechanism are 30tcp_cong.c. The functions used by the congestion control mechanism are
31registered via passing a tcp_congestion_ops struct to 31registered via passing a tcp_congestion_ops struct to
32tcp_register_congestion_control. As a minimum name, ssthresh, 32tcp_register_congestion_control. As a minimum name, ssthresh,
33cong_avoid, min_cwnd must be valid. 33cong_avoid must be valid.
34 34
35Private data for a congestion control mechanism is stored in tp->ca_priv. 35Private data for a congestion control mechanism is stored in tp->ca_priv.
36tcp_ca(tp) returns a pointer to this space. This is preallocated space - it 36tcp_ca(tp) returns a pointer to this space. This is preallocated space - it
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 048c92b487f6..bc3554124903 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -202,6 +202,9 @@ Time stamps for outgoing packets are to be generated as follows:
202 and not free the skb. A driver not supporting hardware time stamping doesn't 202 and not free the skb. A driver not supporting hardware time stamping doesn't
203 do that. A driver must never touch sk_buff::tstamp! It is used to store 203 do that. A driver must never touch sk_buff::tstamp! It is used to store
204 software generated time stamps by the network subsystem. 204 software generated time stamps by the network subsystem.
205- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
206 as possible. skb_tx_timestamp() provides a software time stamp if requested
207 and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
205- As soon as the driver has sent the packet and/or obtained a 208- As soon as the driver has sent the packet and/or obtained a
206 hardware time stamp for it, it passes the time stamp back by 209 hardware time stamp for it, it passes the time stamp back by
207 calling skb_hwtstamp_tx() with the original skb, the raw 210 calling skb_hwtstamp_tx() with the original skb, the raw
@@ -212,6 +215,3 @@ Time stamps for outgoing packets are to be generated as follows:
212 this would occur at a later time in the processing pipeline than other 215 this would occur at a later time in the processing pipeline than other
213 software time stamping and therefore could lead to unexpected deltas 216 software time stamping and therefore could lead to unexpected deltas
214 between time stamps. 217 between time stamps.
215- If the driver did not set the SKBTX_IN_PROGRESS flag (see above), then
216 dev_hard_start_xmit() checks whether software time stamping
217 is wanted as fallback and potentially generates the time stamp.