diff options
author | Netanel Belgazal <netanel@annapurnalabs.com> | 2016-08-10 07:03:22 -0400 |
---|---|---|
committer | David S. Miller <davem@davemloft.net> | 2016-08-12 20:12:08 -0400 |
commit | 1738cd3ed342294360d6a74d4e58800004bff854 (patch) | |
tree | 5de305a5b5832b3f926541f997770278dc0ac322 /Documentation/networking | |
parent | 4330ea798f087a5f1e1dc6bbabe2eab18a2b3b92 (diff) |
net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)
This is a driver for the ENA family of networking devices.
Signed-off-by: Netanel Belgazal <netanel@annapurnalabs.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/networking/ena.txt | 305 |
2 files changed, 307 insertions, 0 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 415154a487d0..a7697783ac4c 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX | |||
@@ -74,6 +74,8 @@ dns_resolver.txt | |||
74 | - The DNS resolver module allows kernel servies to make DNS queries. | 74 | - The DNS resolver module allows kernel servies to make DNS queries. |
75 | driver.txt | 75 | driver.txt |
76 | - Softnet driver issues. | 76 | - Softnet driver issues. |
77 | ena.txt | ||
78 | - info on Amazon's Elastic Network Adapter (ENA) | ||
77 | e100.txt | 79 | e100.txt |
78 | - info on Intel's EtherExpress PRO/100 line of 10/100 boards | 80 | - info on Intel's EtherExpress PRO/100 line of 10/100 boards |
79 | e1000.txt | 81 | e1000.txt |
diff --git a/Documentation/networking/ena.txt b/Documentation/networking/ena.txt new file mode 100644 index 000000000000..2b4b6f57e549 --- /dev/null +++ b/Documentation/networking/ena.txt | |||
@@ -0,0 +1,305 @@ | |||
1 | Linux kernel driver for Elastic Network Adapter (ENA) family: | ||
2 | ============================================================= | ||
3 | |||
4 | Overview: | ||
5 | ========= | ||
6 | ENA is a networking interface designed to make good use of modern CPU | ||
7 | features and system architectures. | ||
8 | |||
9 | The ENA device exposes a lightweight management interface with a | ||
10 | minimal set of memory mapped registers and extendable command set | ||
11 | through an Admin Queue. | ||
12 | |||
13 | The driver supports a range of ENA devices, is link-speed independent | ||
14 | (i.e., the same driver is used for 10GbE, 25GbE, 40GbE, etc.), and has | ||
15 | a negotiated and extendable feature set. | ||
16 | |||
17 | Some ENA devices support SR-IOV. This driver is used for both the | ||
18 | SR-IOV Physical Function (PF) and Virtual Function (VF) devices. | ||
19 | |||
20 | ENA devices enable high speed and low overhead network traffic | ||
21 | processing by providing multiple Tx/Rx queue pairs (the maximum number | ||
22 | is advertised by the device via the Admin Queue), a dedicated MSI-X | ||
23 | interrupt vector per Tx/Rx queue pair, adaptive interrupt moderation, | ||
24 | and CPU cacheline optimized data placement. | ||
25 | |||
26 | The ENA driver supports industry standard TCP/IP offload features such | ||
27 | as checksum offload and TCP transmit segmentation offload (TSO). | ||
28 | Receive-side scaling (RSS) is supported for multi-core scaling. | ||
29 | |||
30 | The ENA driver and its corresponding devices implement health | ||
31 | monitoring mechanisms such as watchdog, enabling the device and driver | ||
32 | to recover in a manner transparent to the application, as well as | ||
33 | debug logs. | ||
34 | |||
35 | Some of the ENA devices support a working mode called Low-latency | ||
36 | Queue (LLQ), which saves several more microseconds. | ||
37 | |||
38 | Supported PCI vendor ID/device IDs: | ||
39 | =================================== | ||
40 | 1d0f:0ec2 - ENA PF | ||
41 | 1d0f:1ec2 - ENA PF with LLQ support | ||
42 | 1d0f:ec20 - ENA VF | ||
43 | 1d0f:ec21 - ENA VF with LLQ support | ||
44 | |||
45 | ENA Source Code Directory Structure: | ||
46 | ==================================== | ||
47 | ena_com.[ch] - Management communication layer. This layer is | ||
48 | responsible for the handling all the management | ||
49 | (admin) communication between the device and the | ||
50 | driver. | ||
51 | ena_eth_com.[ch] - Tx/Rx data path. | ||
52 | ena_admin_defs.h - Definition of ENA management interface. | ||
53 | ena_eth_io_defs.h - Definition of ENA data path interface. | ||
54 | ena_common_defs.h - Common definitions for ena_com layer. | ||
55 | ena_regs_defs.h - Definition of ENA PCI memory-mapped (MMIO) registers. | ||
56 | ena_netdev.[ch] - Main Linux kernel driver. | ||
57 | ena_syfsfs.[ch] - Sysfs files. | ||
58 | ena_ethtool.c - ethtool callbacks. | ||
59 | ena_pci_id_tbl.h - Supported device IDs. | ||
60 | |||
61 | Management Interface: | ||
62 | ===================== | ||
63 | ENA management interface is exposed by means of: | ||
64 | - PCIe Configuration Space | ||
65 | - Device Registers | ||
66 | - Admin Queue (AQ) and Admin Completion Queue (ACQ) | ||
67 | - Asynchronous Event Notification Queue (AENQ) | ||
68 | |||
69 | ENA device MMIO Registers are accessed only during driver | ||
70 | initialization and are not involved in further normal device | ||
71 | operation. | ||
72 | |||
73 | AQ is used for submitting management commands, and the | ||
74 | results/responses are reported asynchronously through ACQ. | ||
75 | |||
76 | ENA introduces a very small set of management commands with room for | ||
77 | vendor-specific extensions. Most of the management operations are | ||
78 | framed in a generic Get/Set feature command. | ||
79 | |||
80 | The following admin queue commands are supported: | ||
81 | - Create I/O submission queue | ||
82 | - Create I/O completion queue | ||
83 | - Destroy I/O submission queue | ||
84 | - Destroy I/O completion queue | ||
85 | - Get feature | ||
86 | - Set feature | ||
87 | - Configure AENQ | ||
88 | - Get statistics | ||
89 | |||
90 | Refer to ena_admin_defs.h for the list of supported Get/Set Feature | ||
91 | properties. | ||
92 | |||
93 | The Asynchronous Event Notification Queue (AENQ) is a uni-directional | ||
94 | queue used by the ENA device to send to the driver events that cannot | ||
95 | be reported using ACQ. AENQ events are subdivided into groups. Each | ||
96 | group may have multiple syndromes, as shown below | ||
97 | |||
98 | The events are: | ||
99 | Group Syndrome | ||
100 | Link state change - X - | ||
101 | Fatal error - X - | ||
102 | Notification Suspend traffic | ||
103 | Notification Resume traffic | ||
104 | Keep-Alive - X - | ||
105 | |||
106 | ACQ and AENQ share the same MSI-X vector. | ||
107 | |||
108 | Keep-Alive is a special mechanism that allows monitoring of the | ||
109 | device's health. The driver maintains a watchdog (WD) handler which, | ||
110 | if fired, logs the current state and statistics then resets and | ||
111 | restarts the ENA device and driver. A Keep-Alive event is delivered by | ||
112 | the device every second. The driver re-arms the WD upon reception of a | ||
113 | Keep-Alive event. A missed Keep-Alive event causes the WD handler to | ||
114 | fire. | ||
115 | |||
116 | Data Path Interface: | ||
117 | ==================== | ||
118 | I/O operations are based on Tx and Rx Submission Queues (Tx SQ and Rx | ||
119 | SQ correspondingly). Each SQ has a completion queue (CQ) associated | ||
120 | with it. | ||
121 | |||
122 | The SQs and CQs are implemented as descriptor rings in contiguous | ||
123 | physical memory. | ||
124 | |||
125 | The ENA driver supports two Queue Operation modes for Tx SQs: | ||
126 | - Regular mode | ||
127 | * In this mode the Tx SQs reside in the host's memory. The ENA | ||
128 | device fetches the ENA Tx descriptors and packet data from host | ||
129 | memory. | ||
130 | - Low Latency Queue (LLQ) mode or "push-mode". | ||
131 | * In this mode the driver pushes the transmit descriptors and the | ||
132 | first 128 bytes of the packet directly to the ENA device memory | ||
133 | space. The rest of the packet payload is fetched by the | ||
134 | device. For this operation mode, the driver uses a dedicated PCI | ||
135 | device memory BAR, which is mapped with write-combine capability. | ||
136 | |||
137 | The Rx SQs support only the regular mode. | ||
138 | |||
139 | Note: Not all ENA devices support LLQ, and this feature is negotiated | ||
140 | with the device upon initialization. If the ENA device does not | ||
141 | support LLQ mode, the driver falls back to the regular mode. | ||
142 | |||
143 | The driver supports multi-queue for both Tx and Rx. This has various | ||
144 | benefits: | ||
145 | - Reduced CPU/thread/process contention on a given Ethernet interface. | ||
146 | - Cache miss rate on completion is reduced, particularly for data | ||
147 | cache lines that hold the sk_buff structures. | ||
148 | - Increased process-level parallelism when handling received packets. | ||
149 | - Increased data cache hit rate, by steering kernel processing of | ||
150 | packets to the CPU, where the application thread consuming the | ||
151 | packet is running. | ||
152 | - In hardware interrupt re-direction. | ||
153 | |||
154 | Interrupt Modes: | ||
155 | ================ | ||
156 | The driver assigns a single MSI-X vector per queue pair (for both Tx | ||
157 | and Rx directions). The driver assigns an additional dedicated MSI-X vector | ||
158 | for management (for ACQ and AENQ). | ||
159 | |||
160 | Management interrupt registration is performed when the Linux kernel | ||
161 | probes the adapter, and it is de-registered when the adapter is | ||
162 | removed. I/O queue interrupt registration is performed when the Linux | ||
163 | interface of the adapter is opened, and it is de-registered when the | ||
164 | interface is closed. | ||
165 | |||
166 | The management interrupt is named: | ||
167 | ena-mgmnt@pci:<PCI domain:bus:slot.function> | ||
168 | and for each queue pair, an interrupt is named: | ||
169 | <interface name>-Tx-Rx-<queue index> | ||
170 | |||
171 | The ENA device operates in auto-mask and auto-clear interrupt | ||
172 | modes. That is, once MSI-X is delivered to the host, its Cause bit is | ||
173 | automatically cleared and the interrupt is masked. The interrupt is | ||
174 | unmasked by the driver after NAPI processing is complete. | ||
175 | |||
176 | Interrupt Moderation: | ||
177 | ===================== | ||
178 | ENA driver and device can operate in conventional or adaptive interrupt | ||
179 | moderation mode. | ||
180 | |||
181 | In conventional mode the driver instructs device to postpone interrupt | ||
182 | posting according to static interrupt delay value. The interrupt delay | ||
183 | value can be configured through ethtool(8). The following ethtool | ||
184 | parameters are supported by the driver: tx-usecs, rx-usecs | ||
185 | |||
186 | In adaptive interrupt moderation mode the interrupt delay value is | ||
187 | updated by the driver dynamically and adjusted every NAPI cycle | ||
188 | according to the traffic nature. | ||
189 | |||
190 | By default ENA driver applies adaptive coalescing on Rx traffic and | ||
191 | conventional coalescing on Tx traffic. | ||
192 | |||
193 | Adaptive coalescing can be switched on/off through ethtool(8) | ||
194 | adaptive_rx on|off parameter. | ||
195 | |||
196 | The driver chooses interrupt delay value according to the number of | ||
197 | bytes and packets received between interrupt unmasking and interrupt | ||
198 | posting. The driver uses interrupt delay table that subdivides the | ||
199 | range of received bytes/packets into 5 levels and assigns interrupt | ||
200 | delay value to each level. | ||
201 | |||
202 | The user can enable/disable adaptive moderation, modify the interrupt | ||
203 | delay table and restore its default values through sysfs. | ||
204 | |||
205 | The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK | ||
206 | and can be configured by the ETHTOOL_STUNABLE command of the | ||
207 | SIOCETHTOOL ioctl. | ||
208 | |||
209 | SKB: | ||
210 | The driver-allocated SKB for frames received from Rx handling using | ||
211 | NAPI context. The allocation method depends on the size of the packet. | ||
212 | If the frame length is larger than rx_copybreak, napi_get_frags() | ||
213 | is used, otherwise netdev_alloc_skb_ip_align() is used, the buffer | ||
214 | content is copied (by CPU) to the SKB, and the buffer is recycled. | ||
215 | |||
216 | Statistics: | ||
217 | =========== | ||
218 | The user can obtain ENA device and driver statistics using ethtool. | ||
219 | The driver can collect regular or extended statistics (including | ||
220 | per-queue stats) from the device. | ||
221 | |||
222 | In addition the driver logs the stats to syslog upon device reset. | ||
223 | |||
224 | MTU: | ||
225 | ==== | ||
226 | The driver supports an arbitrarily large MTU with a maximum that is | ||
227 | negotiated with the device. The driver configures MTU using the | ||
228 | SetFeature command (ENA_ADMIN_MTU property). The user can change MTU | ||
229 | via ip(8) and similar legacy tools. | ||
230 | |||
231 | Stateless Offloads: | ||
232 | =================== | ||
233 | The ENA driver supports: | ||
234 | - TSO over IPv4/IPv6 | ||
235 | - TSO with ECN | ||
236 | - IPv4 header checksum offload | ||
237 | - TCP/UDP over IPv4/IPv6 checksum offloads | ||
238 | |||
239 | RSS: | ||
240 | ==== | ||
241 | - The ENA device supports RSS that allows flexible Rx traffic | ||
242 | steering. | ||
243 | - Toeplitz and CRC32 hash functions are supported. | ||
244 | - Different combinations of L2/L3/L4 fields can be configured as | ||
245 | inputs for hash functions. | ||
246 | - The driver configures RSS settings using the AQ SetFeature command | ||
247 | (ENA_ADMIN_RSS_HASH_FUNCTION, ENA_ADMIN_RSS_HASH_INPUT and | ||
248 | ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG properties). | ||
249 | - If the NETIF_F_RXHASH flag is set, the 32-bit result of the hash | ||
250 | function delivered in the Rx CQ descriptor is set in the received | ||
251 | SKB. | ||
252 | - The user can provide a hash key, hash function, and configure the | ||
253 | indirection table through ethtool(8). | ||
254 | |||
255 | DATA PATH: | ||
256 | ========== | ||
257 | Tx: | ||
258 | --- | ||
259 | end_start_xmit() is called by the stack. This function does the following: | ||
260 | - Maps data buffers (skb->data and frags). | ||
261 | - Populates ena_buf for the push buffer (if the driver and device are | ||
262 | in push mode.) | ||
263 | - Prepares ENA bufs for the remaining frags. | ||
264 | - Allocates a new request ID from the empty req_id ring. The request | ||
265 | ID is the index of the packet in the Tx info. This is used for | ||
266 | out-of-order TX completions. | ||
267 | - Adds the packet to the proper place in the Tx ring. | ||
268 | - Calls ena_com_prepare_tx(), an ENA communication layer that converts | ||
269 | the ena_bufs to ENA descriptors (and adds meta ENA descriptors as | ||
270 | needed.) | ||
271 | * This function also copies the ENA descriptors and the push buffer | ||
272 | to the Device memory space (if in push mode.) | ||
273 | - Writes doorbell to the ENA device. | ||
274 | - When the ENA device finishes sending the packet, a completion | ||
275 | interrupt is raised. | ||
276 | - The interrupt handler schedules NAPI. | ||
277 | - The ena_clean_tx_irq() function is called. This function handles the | ||
278 | completion descriptors generated by the ENA, with a single | ||
279 | completion descriptor per completed packet. | ||
280 | * req_id is retrieved from the completion descriptor. The tx_info of | ||
281 | the packet is retrieved via the req_id. The data buffers are | ||
282 | unmapped and req_id is returned to the empty req_id ring. | ||
283 | * The function stops when the completion descriptors are completed or | ||
284 | the budget is reached. | ||
285 | |||
286 | Rx: | ||
287 | --- | ||
288 | - When a packet is received from the ENA device. | ||
289 | - The interrupt handler schedules NAPI. | ||
290 | - The ena_clean_rx_irq() function is called. This function calls | ||
291 | ena_rx_pkt(), an ENA communication layer function, which returns the | ||
292 | number of descriptors used for a new unhandled packet, and zero if | ||
293 | no new packet is found. | ||
294 | - Then it calls the ena_clean_rx_irq() function. | ||
295 | - ena_eth_rx_skb() checks packet length: | ||
296 | * If the packet is small (len < rx_copybreak), the driver allocates | ||
297 | a SKB for the new packet, and copies the packet payload into the | ||
298 | SKB data buffer. | ||
299 | - In this way the original data buffer is not passed to the stack | ||
300 | and is reused for future Rx packets. | ||
301 | * Otherwise the function unmaps the Rx buffer, then allocates the | ||
302 | new SKB structure and hooks the Rx buffer to the SKB frags. | ||
303 | - The new SKB is updated with the necessary information (protocol, | ||
304 | checksum hw verify result, etc.), and then passed to the network | ||
305 | stack, using the NAPI interface function napi_gro_receive(). | ||