diff options
author | Rusty Russell <rusty@rustcorp.com.au> | 2013-03-12 01:07:33 -0400 |
---|---|---|
committer | Rusty Russell <rusty@rustcorp.com.au> | 2013-03-12 01:15:18 -0400 |
commit | 29266e2e29f1f87b93321e56812f9fb16f91cb6d (patch) | |
tree | 794a50dd151a4b0a5eeaa41e38232817958d2004 | |
parent | 9d9598b81c5c05495009e81ac0508ec8d1558015 (diff) |
Remove Documentation/virtual/virtio-spec.txt
We haven't been keeping it in sync, so just remove it.
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
-rw-r--r-- | Documentation/virtual/00-INDEX | 3 | ||||
-rw-r--r-- | Documentation/virtual/virtio-spec.txt | 3210 |
2 files changed, 0 insertions, 3213 deletions
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX index 924bd462675e..e952d30bbf0f 100644 --- a/Documentation/virtual/00-INDEX +++ b/Documentation/virtual/00-INDEX | |||
@@ -6,6 +6,3 @@ kvm/ | |||
6 | - Kernel Virtual Machine. See also http://linux-kvm.org | 6 | - Kernel Virtual Machine. See also http://linux-kvm.org |
7 | uml/ | 7 | uml/ |
8 | - User Mode Linux, builds/runs Linux kernel as a userspace program. | 8 | - User Mode Linux, builds/runs Linux kernel as a userspace program. |
9 | virtio.txt | ||
10 | - Text version of draft virtio spec. | ||
11 | See http://ozlabs.org/~rusty/virtio-spec | ||
diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt deleted file mode 100644 index 0d6ec85481cb..000000000000 --- a/Documentation/virtual/virtio-spec.txt +++ /dev/null | |||
@@ -1,3210 +0,0 @@ | |||
1 | [Generated file: see http://ozlabs.org/~rusty/virtio-spec/] | ||
2 | Virtio PCI Card Specification | ||
3 | v0.9.5 DRAFT | ||
4 | - | ||
5 | |||
6 | Rusty Russell <rusty@rustcorp.com.au> IBM Corporation (Editor) | ||
7 | |||
8 | 2012 May 7. | ||
9 | |||
10 | Purpose and Description | ||
11 | |||
12 | This document describes the specifications of the “virtio” family | ||
13 | of PCI[LaTeX Command: nomenclature] devices. These are devices | ||
14 | are found in virtual environments[LaTeX Command: nomenclature], | ||
15 | yet by design they are not all that different from physical PCI | ||
16 | devices, and this document treats them as such. This allows the | ||
17 | guest to use standard PCI drivers and discovery mechanisms. | ||
18 | |||
19 | The purpose of virtio and this specification is that virtual | ||
20 | environments and guests should have a straightforward, efficient, | ||
21 | standard and extensible mechanism for virtual devices, rather | ||
22 | than boutique per-environment or per-OS mechanisms. | ||
23 | |||
24 | Straightforward: Virtio PCI devices use normal PCI mechanisms | ||
25 | of interrupts and DMA which should be familiar to any device | ||
26 | driver author. There is no exotic page-flipping or COW | ||
27 | mechanism: it's just a PCI device.[footnote: | ||
28 | This lack of page-sharing implies that the implementation of the | ||
29 | device (e.g. the hypervisor or host) needs full access to the | ||
30 | guest memory. Communication with untrusted parties (i.e. | ||
31 | inter-guest communication) requires copying. | ||
32 | ] | ||
33 | |||
34 | Efficient: Virtio PCI devices consist of rings of descriptors | ||
35 | for input and output, which are neatly separated to avoid cache | ||
36 | effects from both guest and device writing to the same cache | ||
37 | lines. | ||
38 | |||
39 | Standard: Virtio PCI makes no assumptions about the environment | ||
40 | in which it operates, beyond supporting PCI. In fact the virtio | ||
41 | devices specified in the appendices do not require PCI at all: | ||
42 | they have been implemented on non-PCI buses.[footnote: | ||
43 | The Linux implementation further separates the PCI virtio code | ||
44 | from the specific virtio drivers: these drivers are shared with | ||
45 | the non-PCI implementations (currently lguest and S/390). | ||
46 | ] | ||
47 | |||
48 | Extensible: Virtio PCI devices contain feature bits which are | ||
49 | acknowledged by the guest operating system during device setup. | ||
50 | This allows forwards and backwards compatibility: the device | ||
51 | offers all the features it knows about, and the driver | ||
52 | acknowledges those it understands and wishes to use. | ||
53 | |||
54 | Virtqueues | ||
55 | |||
56 | The mechanism for bulk data transport on virtio PCI devices is | ||
57 | pretentiously called a virtqueue. Each device can have zero or | ||
58 | more virtqueues: for example, the network device has one for | ||
59 | transmit and one for receive. | ||
60 | |||
61 | Each virtqueue occupies two or more physically-contiguous pages | ||
62 | (defined, for the purposes of this specification, as 4096 bytes), | ||
63 | and consists of three parts: | ||
64 | |||
65 | |||
66 | +-------------------+-----------------------------------+-----------+ | ||
67 | | Descriptor Table | Available Ring (padding) | Used Ring | | ||
68 | +-------------------+-----------------------------------+-----------+ | ||
69 | |||
70 | |||
71 | When the driver wants to send a buffer to the device, it fills in | ||
72 | a slot in the descriptor table (or chains several together), and | ||
73 | writes the descriptor index into the available ring. It then | ||
74 | notifies the device. When the device has finished a buffer, it | ||
75 | writes the descriptor into the used ring, and sends an interrupt. | ||
76 | |||
77 | Specification | ||
78 | |||
79 | PCI Discovery | ||
80 | |||
81 | Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 | ||
82 | through 0x103F inclusive is a virtio device[footnote: | ||
83 | The actual value within this range is ignored | ||
84 | ]. The device must also have a Revision ID of 0 to match this | ||
85 | specification. | ||
86 | |||
87 | The Subsystem Device ID indicates which virtio device is | ||
88 | supported by the device. The Subsystem Vendor ID should reflect | ||
89 | the PCI Vendor ID of the environment (it's currently only used | ||
90 | for informational purposes by the guest). | ||
91 | |||
92 | |||
93 | +----------------------+--------------------+---------------+ | ||
94 | | Subsystem Device ID | Virtio Device | Specification | | ||
95 | +----------------------+--------------------+---------------+ | ||
96 | +----------------------+--------------------+---------------+ | ||
97 | | 1 | network card | Appendix C | | ||
98 | +----------------------+--------------------+---------------+ | ||
99 | | 2 | block device | Appendix D | | ||
100 | +----------------------+--------------------+---------------+ | ||
101 | | 3 | console | Appendix E | | ||
102 | +----------------------+--------------------+---------------+ | ||
103 | | 4 | entropy source | Appendix F | | ||
104 | +----------------------+--------------------+---------------+ | ||
105 | | 5 | memory ballooning | Appendix G | | ||
106 | +----------------------+--------------------+---------------+ | ||
107 | | 6 | ioMemory | - | | ||
108 | +----------------------+--------------------+---------------+ | ||
109 | | 7 | rpmsg | Appendix H | | ||
110 | +----------------------+--------------------+---------------+ | ||
111 | | 8 | SCSI host | Appendix I | | ||
112 | +----------------------+--------------------+---------------+ | ||
113 | | 9 | 9P transport | - | | ||
114 | +----------------------+--------------------+---------------+ | ||
115 | | 10 | mac80211 wlan | - | | ||
116 | +----------------------+--------------------+---------------+ | ||
117 | |||
118 | |||
119 | Device Configuration | ||
120 | |||
121 | To configure the device, we use the first I/O region of the PCI | ||
122 | device. This contains a virtio header followed by a | ||
123 | device-specific region. | ||
124 | |||
125 | There may be different widths of accesses to the I/O region; the “ | ||
126 | natural” access method for each field in the virtio header must | ||
127 | be used (i.e. 32-bit accesses for 32-bit fields, etc), but the | ||
128 | device-specific region can be accessed using any width accesses, | ||
129 | and should obtain the same results. | ||
130 | |||
131 | Note that this is possible because while the virtio header is PCI | ||
132 | (i.e. little) endian, the device-specific region is encoded in | ||
133 | the native endian of the guest (where such distinction is | ||
134 | applicable). | ||
135 | |||
136 | Device Initialization Sequence<sub:Device-Initialization-Sequence> | ||
137 | |||
138 | We start with an overview of device initialization, then expand | ||
139 | on the details of the device and how each step is preformed. | ||
140 | |||
141 | Reset the device. This is not required on initial start up. | ||
142 | |||
143 | The ACKNOWLEDGE status bit is set: we have noticed the device. | ||
144 | |||
145 | The DRIVER status bit is set: we know how to drive the device. | ||
146 | |||
147 | Device-specific setup, including reading the Device Feature | ||
148 | Bits, discovery of virtqueues for the device, optional MSI-X | ||
149 | setup, and reading and possibly writing the virtio | ||
150 | configuration space. | ||
151 | |||
152 | The subset of Device Feature Bits understood by the driver is | ||
153 | written to the device. | ||
154 | |||
155 | The DRIVER_OK status bit is set. | ||
156 | |||
157 | The device can now be used (ie. buffers added to the | ||
158 | virtqueues)[footnote: | ||
159 | Historically, drivers have used the device before steps 5 and 6. | ||
160 | This is only allowed if the driver does not use any features | ||
161 | which would alter this early use of the device. | ||
162 | ] | ||
163 | |||
164 | If any of these steps go irrecoverably wrong, the guest should | ||
165 | set the FAILED status bit to indicate that it has given up on the | ||
166 | device (it can reset the device later to restart if desired). | ||
167 | |||
168 | We now cover the fields required for general setup in detail. | ||
169 | |||
170 | Virtio Header | ||
171 | |||
172 | The virtio header looks as follows: | ||
173 | |||
174 | |||
175 | +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ | ||
176 | | Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 | | ||
177 | +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ | ||
178 | | Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R | | ||
179 | +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ | ||
180 | | Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR | | ||
181 | | || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status | | ||
182 | +------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+ | ||
183 | |||
184 | |||
185 | If MSI-X is enabled for the device, two additional fields | ||
186 | immediately follow this header:[footnote: | ||
187 | ie. once you enable MSI-X on the device, the other fields move. | ||
188 | If you turn it off again, they move back! | ||
189 | ] | ||
190 | |||
191 | |||
192 | +------------++----------------+--------+ | ||
193 | | Bits || 16 | 16 | | ||
194 | +----------------+--------+ | ||
195 | +------------++----------------+--------+ | ||
196 | | Read/Write || R+W | R+W | | ||
197 | +------------++----------------+--------+ | ||
198 | | Purpose || Configuration | Queue | | ||
199 | | (MSI-X) || Vector | Vector | | ||
200 | +------------++----------------+--------+ | ||
201 | |||
202 | |||
203 | Immediately following these general headers, there may be | ||
204 | device-specific headers: | ||
205 | |||
206 | |||
207 | +------------++--------------------+ | ||
208 | | Bits || Device Specific | | ||
209 | +--------------------+ | ||
210 | +------------++--------------------+ | ||
211 | | Read/Write || Device Specific | | ||
212 | +------------++--------------------+ | ||
213 | | Purpose || Device Specific... | | ||
214 | | || | | ||
215 | +------------++--------------------+ | ||
216 | |||
217 | |||
218 | Device Status | ||
219 | |||
220 | The Device Status field is updated by the guest to indicate its | ||
221 | progress. This provides a simple low-level diagnostic: it's most | ||
222 | useful to imagine them hooked up to traffic lights on the console | ||
223 | indicating the status of each device. | ||
224 | |||
225 | The device can be reset by writing a 0 to this field, otherwise | ||
226 | at least one bit should be set: | ||
227 | |||
228 | ACKNOWLEDGE (1) Indicates that the guest OS has found the | ||
229 | device and recognized it as a valid virtio device. | ||
230 | |||
231 | DRIVER (2) Indicates that the guest OS knows how to drive the | ||
232 | device. Under Linux, drivers can be loadable modules so there | ||
233 | may be a significant (or infinite) delay before setting this | ||
234 | bit. | ||
235 | |||
236 | DRIVER_OK (4) Indicates that the driver is set up and ready to | ||
237 | drive the device. | ||
238 | |||
239 | FAILED (128) Indicates that something went wrong in the guest, | ||
240 | and it has given up on the device. This could be an internal | ||
241 | error, or the driver didn't like the device for some reason, or | ||
242 | even a fatal error during device operation. The device must be | ||
243 | reset before attempting to re-initialize. | ||
244 | |||
245 | Feature Bits<sub:Feature-Bits> | ||
246 | |||
247 | Thefirst configuration field indicates the features that the | ||
248 | device supports. The bits are allocated as follows: | ||
249 | |||
250 | 0 to 23 Feature bits for the specific device type | ||
251 | |||
252 | 24 to 32 Feature bits reserved for extensions to the queue and | ||
253 | feature negotiation mechanisms | ||
254 | |||
255 | For example, feature bit 0 for a network device (i.e. Subsystem | ||
256 | Device ID 1) indicates that the device supports checksumming of | ||
257 | packets. | ||
258 | |||
259 | The feature bits are negotiated: the device lists all the | ||
260 | features it understands in the Device Features field, and the | ||
261 | guest writes the subset that it understands into the Guest | ||
262 | Features field. The only way to renegotiate is to reset the | ||
263 | device. | ||
264 | |||
265 | In particular, new fields in the device configuration header are | ||
266 | indicated by offering a feature bit, so the guest can check | ||
267 | before accessing that part of the configuration space. | ||
268 | |||
269 | This allows for forwards and backwards compatibility: if the | ||
270 | device is enhanced with a new feature bit, older guests will not | ||
271 | write that feature bit back to the Guest Features field and it | ||
272 | can go into backwards compatibility mode. Similarly, if a guest | ||
273 | is enhanced with a feature that the device doesn't support, it | ||
274 | will not see that feature bit in the Device Features field and | ||
275 | can go into backwards compatibility mode (or, for poor | ||
276 | implementations, set the FAILED Device Status bit). | ||
277 | |||
278 | Configuration/Queue Vectors | ||
279 | |||
280 | When MSI-X capability is present and enabled in the device | ||
281 | (through standard PCI configuration space) 4 bytes at byte offset | ||
282 | 20 are used to map configuration change and queue interrupts to | ||
283 | MSI-X vectors. In this case, the ISR Status field is unused, and | ||
284 | device specific configuration starts at byte offset 24 in virtio | ||
285 | header structure. When MSI-X capability is not enabled, device | ||
286 | specific configuration starts at byte offset 20 in virtio header. | ||
287 | |||
288 | Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of | ||
289 | Configuration/Queue Vector registers, maps interrupts triggered | ||
290 | by the configuration change/selected queue events respectively to | ||
291 | the corresponding MSI-X vector. To disable interrupts for a | ||
292 | specific event type, unmap it by writing a special NO_VECTOR | ||
293 | value: | ||
294 | |||
295 | /* Vector value used to disable MSI for queue */ | ||
296 | |||
297 | #define VIRTIO_MSI_NO_VECTOR 0xffff | ||
298 | |||
299 | Reading these registers returns vector mapped to a given event, | ||
300 | or NO_VECTOR if unmapped. All queue and configuration change | ||
301 | events are unmapped by default. | ||
302 | |||
303 | Note that mapping an event to vector might require allocating | ||
304 | internal device resources, and might fail. Devices report such | ||
305 | failures by returning the NO_VECTOR value when the relevant | ||
306 | Vector field is read. After mapping an event to vector, the | ||
307 | driver must verify success by reading the Vector field value: on | ||
308 | success, the previously written value is returned, and on | ||
309 | failure, NO_VECTOR is returned. If a mapping failure is detected, | ||
310 | the driver can retry mapping with fewervectors, or disable MSI-X. | ||
311 | |||
312 | Virtqueue Configuration<sec:Virtqueue-Configuration> | ||
313 | |||
314 | As a device can have zero or more virtqueues for bulk data | ||
315 | transport (for example, the network driver has two), the driver | ||
316 | needs to configure them as part of the device-specific | ||
317 | configuration. | ||
318 | |||
319 | This is done as follows, for each virtqueue a device has: | ||
320 | |||
321 | Write the virtqueue index (first queue is 0) to the Queue | ||
322 | Select field. | ||
323 | |||
324 | Read the virtqueue size from the Queue Size field, which is | ||
325 | always a power of 2. This controls how big the virtqueue is | ||
326 | (see below). If this field is 0, the virtqueue does not exist. | ||
327 | |||
328 | Allocate and zero virtqueue in contiguous physical memory, on a | ||
329 | 4096 byte alignment. Write the physical address, divided by | ||
330 | 4096 to the Queue Address field.[footnote: | ||
331 | The 4096 is based on the x86 page size, but it's also large | ||
332 | enough to ensure that the separate parts of the virtqueue are on | ||
333 | separate cache lines. | ||
334 | ] | ||
335 | |||
336 | Optionally, if MSI-X capability is present and enabled on the | ||
337 | device, select a vector to use to request interrupts triggered | ||
338 | by virtqueue events. Write the MSI-X Table entry number | ||
339 | corresponding to this vector in Queue Vector field. Read the | ||
340 | Queue Vector field: on success, previously written value is | ||
341 | returned; on failure, NO_VECTOR value is returned. | ||
342 | |||
343 | The Queue Size field controls the total number of bytes required | ||
344 | for the virtqueue according to the following formula: | ||
345 | |||
346 | #define ALIGN(x) (((x) + 4095) & ~4095) | ||
347 | |||
348 | static inline unsigned vring_size(unsigned int qsz) | ||
349 | |||
350 | { | ||
351 | |||
352 | return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2 | ||
353 | + qsz)) | ||
354 | |||
355 | + ALIGN(sizeof(struct vring_used_elem)*qsz); | ||
356 | |||
357 | } | ||
358 | |||
359 | This currently wastes some space with padding, but also allows | ||
360 | future extensions. The virtqueue layout structure looks like this | ||
361 | (qsz is the Queue Size field, which is a variable, so this code | ||
362 | won't compile): | ||
363 | |||
364 | struct vring { | ||
365 | |||
366 | /* The actual descriptors (16 bytes each) */ | ||
367 | |||
368 | struct vring_desc desc[qsz]; | ||
369 | |||
370 | |||
371 | |||
372 | /* A ring of available descriptor heads with free-running | ||
373 | index. */ | ||
374 | |||
375 | struct vring_avail avail; | ||
376 | |||
377 | |||
378 | |||
379 | // Padding to the next 4096 boundary. | ||
380 | |||
381 | char pad[]; | ||
382 | |||
383 | |||
384 | |||
385 | // A ring of used descriptor heads with free-running index. | ||
386 | |||
387 | struct vring_used used; | ||
388 | |||
389 | }; | ||
390 | |||
391 | A Note on Virtqueue Endianness | ||
392 | |||
393 | Note that the endian of these fields and everything else in the | ||
394 | virtqueue is the native endian of the guest, not little-endian as | ||
395 | PCI normally is. This makes for simpler guest code, and it is | ||
396 | assumed that the host already has to be deeply aware of the guest | ||
397 | endian so such an “endian-aware” device is not a significant | ||
398 | issue. | ||
399 | |||
400 | Descriptor Table | ||
401 | |||
402 | The descriptor table refers to the buffers the guest is using for | ||
403 | the device. The addresses are physical addresses, and the buffers | ||
404 | can be chained via the next field. Each descriptor describes a | ||
405 | buffer which is read-only or write-only, but a chain of | ||
406 | descriptors can contain both read-only and write-only buffers. | ||
407 | |||
408 | No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc { | ||
409 | |||
410 | /* Address (guest-physical). */ | ||
411 | |||
412 | u64 addr; | ||
413 | |||
414 | /* Length. */ | ||
415 | |||
416 | u32 len; | ||
417 | |||
418 | /* This marks a buffer as continuing via the next field. */ | ||
419 | |||
420 | #define VRING_DESC_F_NEXT 1 | ||
421 | |||
422 | /* This marks a buffer as write-only (otherwise read-only). */ | ||
423 | |||
424 | #define VRING_DESC_F_WRITE 2 | ||
425 | |||
426 | /* This means the buffer contains a list of buffer descriptors. | ||
427 | */ | ||
428 | |||
429 | #define VRING_DESC_F_INDIRECT 4 | ||
430 | |||
431 | /* The flags as indicated above. */ | ||
432 | |||
433 | u16 flags; | ||
434 | |||
435 | /* Next field if flags & NEXT */ | ||
436 | |||
437 | u16 next; | ||
438 | |||
439 | }; | ||
440 | |||
441 | The number of descriptors in the table is specified by the Queue | ||
442 | Size field for this virtqueue. | ||
443 | |||
444 | <sub:Indirect-Descriptors>Indirect Descriptors | ||
445 | |||
446 | Some devices benefit by concurrently dispatching a large number | ||
447 | of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be | ||
448 | used to allow this (see [cha:Reserved-Feature-Bits]). To increase | ||
449 | ring capacity it is possible to store a table of indirect | ||
450 | descriptors anywhere in memory, and insert a descriptor in main | ||
451 | virtqueue (with flags&INDIRECT on) that refers to memory buffer | ||
452 | containing this indirect descriptor table; fields addr and len | ||
453 | refer to the indirect table address and length in bytes, | ||
454 | respectively. The indirect table layout structure looks like this | ||
455 | (len is the length of the descriptor that refers to this table, | ||
456 | which is a variable, so this code won't compile): | ||
457 | |||
458 | struct indirect_descriptor_table { | ||
459 | |||
460 | /* The actual descriptors (16 bytes each) */ | ||
461 | |||
462 | struct vring_desc desc[len / 16]; | ||
463 | |||
464 | }; | ||
465 | |||
466 | The first indirect descriptor is located at start of the indirect | ||
467 | descriptor table (index 0), additional indirect descriptors are | ||
468 | chained by next field. An indirect descriptor without next field | ||
469 | (with flags&NEXT off) signals the end of the indirect descriptor | ||
470 | table, and transfers control back to the main virtqueue. An | ||
471 | indirect descriptor can not refer to another indirect descriptor | ||
472 | table (flags&INDIRECT must be off). A single indirect descriptor | ||
473 | table can include both read-only and write-only descriptors; | ||
474 | write-only flag (flags&WRITE) in the descriptor that refers to it | ||
475 | is ignored. | ||
476 | |||
477 | Available Ring | ||
478 | |||
479 | The available ring refers to what descriptors we are offering the | ||
480 | device: it refers to the head of a descriptor chain. The “flags” | ||
481 | field is currently 0 or 1: 1 indicating that we do not need an | ||
482 | interrupt when the device consumes a descriptor from the | ||
483 | available ring. Alternatively, the guest can ask the device to | ||
484 | delay interrupts until an entry with an index specified by the “ | ||
485 | used_event” field is written in the used ring (equivalently, | ||
486 | until the idx field in the used ring will reach the value | ||
487 | used_event + 1). The method employed by the device is controlled | ||
488 | by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] | ||
489 | ). This interrupt suppression is merely an optimization; it may | ||
490 | not suppress interrupts entirely. | ||
491 | |||
492 | The “idx” field indicates where we would put the next descriptor | ||
493 | entry (modulo the ring size). This starts at 0, and increases. | ||
494 | |||
495 | struct vring_avail { | ||
496 | |||
497 | #define VRING_AVAIL_F_NO_INTERRUPT 1 | ||
498 | |||
499 | u16 flags; | ||
500 | |||
501 | u16 idx; | ||
502 | |||
503 | u16 ring[qsz]; /* qsz is the Queue Size field read from device | ||
504 | */ | ||
505 | |||
506 | u16 used_event; | ||
507 | |||
508 | }; | ||
509 | |||
510 | Used Ring | ||
511 | |||
512 | The used ring is where the device returns buffers once it is done | ||
513 | with them. The flags field can be used by the device to hint that | ||
514 | no notification is necessary when the guest adds to the available | ||
515 | ring. Alternatively, the “avail_event” field can be used by the | ||
516 | device to hint that no notification is necessary until an entry | ||
517 | with an index specified by the “avail_event” is written in the | ||
518 | available ring (equivalently, until the idx field in the | ||
519 | available ring will reach the value avail_event + 1). The method | ||
520 | employed by the device is controlled by the guest through the | ||
521 | VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits] | ||
522 | ). [footnote: | ||
523 | These fields are kept here because this is the only part of the | ||
524 | virtqueue written by the device | ||
525 | ]. | ||
526 | |||
527 | Each entry in the ring is a pair: the head entry of the | ||
528 | descriptor chain describing the buffer (this matches an entry | ||
529 | placed in the available ring by the guest earlier), and the total | ||
530 | of bytes written into the buffer. The latter is extremely useful | ||
531 | for guests using untrusted buffers: if you do not know exactly | ||
532 | how much has been written by the device, you usually have to zero | ||
533 | the buffer to ensure no data leakage occurs. | ||
534 | |||
535 | /* u32 is used here for ids for padding reasons. */ | ||
536 | |||
537 | struct vring_used_elem { | ||
538 | |||
539 | /* Index of start of used descriptor chain. */ | ||
540 | |||
541 | u32 id; | ||
542 | |||
543 | /* Total length of the descriptor chain which was used | ||
544 | (written to) */ | ||
545 | |||
546 | u32 len; | ||
547 | |||
548 | }; | ||
549 | |||
550 | |||
551 | |||
552 | struct vring_used { | ||
553 | |||
554 | #define VRING_USED_F_NO_NOTIFY 1 | ||
555 | |||
556 | u16 flags; | ||
557 | |||
558 | u16 idx; | ||
559 | |||
560 | struct vring_used_elem ring[qsz]; | ||
561 | |||
562 | u16 avail_event; | ||
563 | |||
564 | }; | ||
565 | |||
566 | Helpers for Managing Virtqueues | ||
567 | |||
568 | The Linux Kernel Source code contains the definitions above and | ||
569 | helper routines in a more usable form, in | ||
570 | include/linux/virtio_ring.h. This was explicitly licensed by IBM | ||
571 | and Red Hat under the (3-clause) BSD license so that it can be | ||
572 | freely used by all other projects, and is reproduced (with slight | ||
573 | variation to remove Linux assumptions) in Appendix A. | ||
574 | |||
575 | Device Operation<sec:Device-Operation> | ||
576 | |||
577 | There are two parts to device operation: supplying new buffers to | ||
578 | the device, and processing used buffers from the device. As an | ||
579 | example, the virtio network device has two virtqueues: the | ||
580 | transmit virtqueue and the receive virtqueue. The driver adds | ||
581 | outgoing (read-only) packets to the transmit virtqueue, and then | ||
582 | frees them after they are used. Similarly, incoming (write-only) | ||
583 | buffers are added to the receive virtqueue, and processed after | ||
584 | they are used. | ||
585 | |||
586 | Supplying Buffers to The Device | ||
587 | |||
588 | Actual transfer of buffers from the guest OS to the device | ||
589 | operates as follows: | ||
590 | |||
591 | Place the buffer(s) into free descriptor(s). | ||
592 | |||
593 | If there are no free descriptors, the guest may choose to | ||
594 | notify the device even if notifications are suppressed (to | ||
595 | reduce latency).[footnote: | ||
596 | The Linux drivers do this only for read-only buffers: for | ||
597 | write-only buffers, it is assumed that the driver is merely | ||
598 | trying to keep the receive buffer ring full, and no notification | ||
599 | of this expected condition is necessary. | ||
600 | ] | ||
601 | |||
602 | Place the id of the buffer in the next ring entry of the | ||
603 | available ring. | ||
604 | |||
605 | The steps (1) and (2) may be performed repeatedly if batching | ||
606 | is possible. | ||
607 | |||
608 | A memory barrier should be executed to ensure the device sees | ||
609 | the updated descriptor table and available ring before the next | ||
610 | step. | ||
611 | |||
612 | The available “idx” field should be increased by the number of | ||
613 | entries added to the available ring. | ||
614 | |||
615 | A memory barrier should be executed to ensure that we update | ||
616 | the idx field before checking for notification suppression. | ||
617 | |||
618 | If notifications are not suppressed, the device should be | ||
619 | notified of the new buffers. | ||
620 | |||
621 | Note that the above code does not take precautions against the | ||
622 | available ring buffer wrapping around: this is not possible since | ||
623 | the ring buffer is the same size as the descriptor table, so step | ||
624 | (1) will prevent such a condition. | ||
625 | |||
626 | In addition, the maximum queue size is 32768 (it must be a power | ||
627 | of 2 which fits in 16 bits), so the 16-bit “idx” value can always | ||
628 | distinguish between a full and empty buffer. | ||
629 | |||
630 | Here is a description of each stage in more detail. | ||
631 | |||
632 | Placing Buffers Into The Descriptor Table | ||
633 | |||
634 | A buffer consists of zero or more read-only physically-contiguous | ||
635 | elements followed by zero or more physically-contiguous | ||
636 | write-only elements (it must have at least one element). This | ||
637 | algorithm maps it into the descriptor table: | ||
638 | |||
639 | for each buffer element, b: | ||
640 | |||
641 | Get the next free descriptor table entry, d | ||
642 | |||
643 | Set d.addr to the physical address of the start of b | ||
644 | |||
645 | Set d.len to the length of b. | ||
646 | |||
647 | If b is write-only, set d.flags to VRING_DESC_F_WRITE, | ||
648 | otherwise 0. | ||
649 | |||
650 | If there is a buffer element after this: | ||
651 | |||
652 | Set d.next to the index of the next free descriptor element. | ||
653 | |||
654 | Set the VRING_DESC_F_NEXT bit in d.flags. | ||
655 | |||
656 | In practice, the d.next fields are usually used to chain free | ||
657 | descriptors, and a separate count kept to check there are enough | ||
658 | free descriptors before beginning the mappings. | ||
659 | |||
660 | Updating The Available Ring | ||
661 | |||
662 | The head of the buffer we mapped is the first d in the algorithm | ||
663 | above. A naive implementation would do the following: | ||
664 | |||
665 | avail->ring[avail->idx % qsz] = head; | ||
666 | |||
667 | However, in general we can add many descriptors before we update | ||
668 | the “idx” field (at which point they become visible to the | ||
669 | device), so we keep a counter of how many we've added: | ||
670 | |||
671 | avail->ring[(avail->idx + added++) % qsz] = head; | ||
672 | |||
673 | Updating The Index Field | ||
674 | |||
675 | Once the idx field of the virtqueue is updated, the device will | ||
676 | be able to access the descriptor entries we've created and the | ||
677 | memory they refer to. This is why a memory barrier is generally | ||
678 | used before the idx update, to ensure it sees the most up-to-date | ||
679 | copy. | ||
680 | |||
681 | The idx field always increments, and we let it wrap naturally at | ||
682 | 65536: | ||
683 | |||
684 | avail->idx += added; | ||
685 | |||
686 | <sub:Notifying-The-Device>Notifying The Device | ||
687 | |||
688 | Device notification occurs by writing the 16-bit virtqueue index | ||
689 | of this virtqueue to the Queue Notify field of the virtio header | ||
690 | in the first I/O region of the PCI device. This can be expensive, | ||
691 | however, so the device can suppress such notifications if it | ||
692 | doesn't need them. We have to be careful to expose the new idx | ||
693 | value before checking the suppression flag: it's OK to notify | ||
694 | gratuitously, but not to omit a required notification. So again, | ||
695 | we use a memory barrier here before reading the flags or the | ||
696 | avail_event field. | ||
697 | |||
698 | If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if | ||
699 | the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to | ||
700 | the PCI configuration space. | ||
701 | |||
702 | If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the | ||
703 | avail_event field in the available ring structure. If the | ||
704 | available index crossed_the avail_event field value since the | ||
705 | last notification, we go ahead and write to the PCI configuration | ||
706 | space. The avail_event field wraps naturally at 65536 as well: | ||
707 | |||
708 | (u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx) | ||
709 | |||
710 | <sub:Receiving-Used-Buffers>Receiving Used Buffers From The | ||
711 | Device | ||
712 | |||
713 | Once the device has used a buffer (read from or written to it, or | ||
714 | parts of both, depending on the nature of the virtqueue and the | ||
715 | device), it sends an interrupt, following an algorithm very | ||
716 | similar to the algorithm used for the driver to send the device a | ||
717 | buffer: | ||
718 | |||
719 | Write the head descriptor number to the next field in the used | ||
720 | ring. | ||
721 | |||
722 | Update the used ring idx. | ||
723 | |||
724 | Determine whether an interrupt is necessary: | ||
725 | |||
726 | If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check | ||
727 | if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail- | ||
728 | >flags | ||
729 | |||
730 | If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check | ||
731 | whether the used index crossed the used_event field value | ||
732 | since the last update. The used_event field wraps naturally | ||
733 | at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx) | ||
734 | |||
735 | If an interrupt is necessary: | ||
736 | |||
737 | If MSI-X capability is disabled: | ||
738 | |||
739 | Set the lower bit of the ISR Status field for the device. | ||
740 | |||
741 | Send the appropriate PCI interrupt for the device. | ||
742 | |||
743 | If MSI-X capability is enabled: | ||
744 | |||
745 | Request the appropriate MSI-X interrupt message for the | ||
746 | device, Queue Vector field sets the MSI-X Table entry | ||
747 | number. | ||
748 | |||
749 | If Queue Vector field value is NO_VECTOR, no interrupt | ||
750 | message is requested for this event. | ||
751 | |||
752 | The guest interrupt handler should: | ||
753 | |||
754 | If MSI-X capability is disabled: read the ISR Status field, | ||
755 | which will reset it to zero. If the lower bit is zero, the | ||
756 | interrupt was not for this device. Otherwise, the guest driver | ||
757 | should look through the used rings of each virtqueue for the | ||
758 | device, to see if any progress has been made by the device | ||
759 | which requires servicing. | ||
760 | |||
761 | If MSI-X capability is enabled: look through the used rings of | ||
762 | each virtqueue mapped to the specific MSI-X vector for the | ||
763 | device, to see if any progress has been made by the device | ||
764 | which requires servicing. | ||
765 | |||
766 | For each ring, guest should then disable interrupts by writing | ||
767 | VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required. | ||
768 | It can then process used ring entries finally enabling interrupts | ||
769 | by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the | ||
770 | EVENT_IDX field in the available structure, Guest should then | ||
771 | execute a memory barrier, and then recheck the ring empty | ||
772 | condition. This is necessary to handle the case where, after the | ||
773 | last check and before enabling interrupts, an interrupt has been | ||
774 | suppressed by the device: | ||
775 | |||
776 | vring_disable_interrupts(vq); | ||
777 | |||
778 | for (;;) { | ||
779 | |||
780 | if (vq->last_seen_used != vring->used.idx) { | ||
781 | |||
782 | vring_enable_interrupts(vq); | ||
783 | |||
784 | mb(); | ||
785 | |||
786 | if (vq->last_seen_used != vring->used.idx) | ||
787 | |||
788 | break; | ||
789 | |||
790 | } | ||
791 | |||
792 | struct vring_used_elem *e = | ||
793 | vring.used->ring[vq->last_seen_used%vsz]; | ||
794 | |||
795 | process_buffer(e); | ||
796 | |||
797 | vq->last_seen_used++; | ||
798 | |||
799 | } | ||
800 | |||
801 | Dealing With Configuration Changes<sub:Dealing-With-Configuration> | ||
802 | |||
803 | Some virtio PCI devices can change the device configuration | ||
804 | state, as reflected in the virtio header in the PCI configuration | ||
805 | space. In this case: | ||
806 | |||
807 | If MSI-X capability is disabled: an interrupt is delivered and | ||
808 | the second highest bit is set in the ISR Status field to | ||
809 | indicate that the driver should re-examine the configuration | ||
810 | space.Note that a single interrupt can indicate both that one | ||
811 | or more virtqueue has been used and that the configuration | ||
812 | space has changed: even if the config bit is set, virtqueues | ||
813 | must be scanned. | ||
814 | |||
815 | If MSI-X capability is enabled: an interrupt message is | ||
816 | requested. The Configuration Vector field sets the MSI-X Table | ||
817 | entry number to use. If Configuration Vector field value is | ||
818 | NO_VECTOR, no interrupt message is requested for this event. | ||
819 | |||
820 | Creating New Device Types | ||
821 | |||
822 | Various considerations are necessary when creating a new device | ||
823 | type: | ||
824 | |||
825 | How Many Virtqueues? | ||
826 | |||
827 | It is possible that a very simple device will operate entirely | ||
828 | through its configuration space, but most will need at least one | ||
829 | virtqueue in which it will place requests. A device with both | ||
830 | input and output (eg. console and network devices described here) | ||
831 | need two queues: one which the driver fills with buffers to | ||
832 | receive input, and one which the driver places buffers to | ||
833 | transmit output. | ||
834 | |||
835 | What Configuration Space Layout? | ||
836 | |||
837 | Configuration space is generally used for rarely-changing or | ||
838 | initialization-time parameters. But it is a limited resource, so | ||
839 | it might be better to use a virtqueue to update configuration | ||
840 | information (the network device does this for filtering, | ||
841 | otherwise the table in the config space could potentially be very | ||
842 | large). | ||
843 | |||
844 | Note that this space is generally the guest's native endian, | ||
845 | rather than PCI's little-endian. | ||
846 | |||
847 | What Device Number? | ||
848 | |||
849 | Currently device numbers are assigned quite freely: a simple | ||
850 | request mail to the author of this document or the Linux | ||
851 | virtualization mailing list[footnote: | ||
852 | |||
853 | https://lists.linux-foundation.org/mailman/listinfo/virtualization | ||
854 | ] will be sufficient to secure a unique one. | ||
855 | |||
856 | Meanwhile for experimental drivers, use 65535 and work backwards. | ||
857 | |||
858 | How many MSI-X vectors? | ||
859 | |||
860 | Using the optional MSI-X capability devices can speed up | ||
861 | interrupt processing by removing the need to read ISR Status | ||
862 | register by guest driver (which might be an expensive operation), | ||
863 | reducing interrupt sharing between devices and queues within the | ||
864 | device, and handling interrupts from multiple CPUs. However, some | ||
865 | systems impose a limit (which might be as low as 256) on the | ||
866 | total number of MSI-X vectors that can be allocated to all | ||
867 | devices. Devices and/or device drivers should take this into | ||
868 | account, limiting the number of vectors used unless the device is | ||
869 | expected to cause a high volume of interrupts. Devices can | ||
870 | control the number of vectors used by limiting the MSI-X Table | ||
871 | Size or not presenting MSI-X capability in PCI configuration | ||
872 | space. Drivers can control this by mapping events to as small | ||
873 | number of vectors as possible, or disabling MSI-X capability | ||
874 | altogether. | ||
875 | |||
876 | Message Framing | ||
877 | |||
878 | The descriptors used for a buffer should not effect the semantics | ||
879 | of the message, except for the total length of the buffer. For | ||
880 | example, a network buffer consists of a 10 byte header followed | ||
881 | by the network packet. Whether this is presented in the ring | ||
882 | descriptor chain as (say) a 10 byte buffer and a 1514 byte | ||
883 | buffer, or a single 1524 byte buffer, or even three buffers, | ||
884 | should have no effect. | ||
885 | |||
886 | In particular, no implementation should use the descriptor | ||
887 | boundaries to determine the size of any header in a request.[footnote: | ||
888 | The current qemu device implementations mistakenly insist that | ||
889 | the first descriptor cover the header in these cases exactly, so | ||
890 | a cautious driver should arrange it so. | ||
891 | ] | ||
892 | |||
893 | Device Improvements | ||
894 | |||
895 | Any change to configuration space, or new virtqueues, or | ||
896 | behavioural changes, should be indicated by negotiation of a new | ||
897 | feature bit. This establishes clarity[footnote: | ||
898 | Even if it does mean documenting design or implementation | ||
899 | mistakes! | ||
900 | ] and avoids future expansion problems. | ||
901 | |||
902 | Clusters of functionality which are always implemented together | ||
903 | can use a single bit, but if one feature makes sense without the | ||
904 | others they should not be gratuitously grouped together to | ||
905 | conserve feature bits. We can always extend the spec when the | ||
906 | first person needs more than 24 feature bits for their device. | ||
907 | |||
908 | [LaTeX Command: printnomenclature] | ||
909 | |||
910 | Appendix A: virtio_ring.h | ||
911 | |||
912 | #ifndef VIRTIO_RING_H | ||
913 | |||
914 | #define VIRTIO_RING_H | ||
915 | |||
916 | /* An interface for efficient virtio implementation. | ||
917 | |||
918 | * | ||
919 | |||
920 | * This header is BSD licensed so anyone can use the definitions | ||
921 | |||
922 | * to implement compatible drivers/servers. | ||
923 | |||
924 | * | ||
925 | |||
926 | * Copyright 2007, 2009, IBM Corporation | ||
927 | |||
928 | * Copyright 2011, Red Hat, Inc | ||
929 | |||
930 | * All rights reserved. | ||
931 | |||
932 | * | ||
933 | |||
934 | * Redistribution and use in source and binary forms, with or | ||
935 | without | ||
936 | |||
937 | * modification, are permitted provided that the following | ||
938 | conditions | ||
939 | |||
940 | * are met: | ||
941 | |||
942 | * 1. Redistributions of source code must retain the above | ||
943 | copyright | ||
944 | |||
945 | * notice, this list of conditions and the following | ||
946 | disclaimer. | ||
947 | |||
948 | * 2. Redistributions in binary form must reproduce the above | ||
949 | copyright | ||
950 | |||
951 | * notice, this list of conditions and the following | ||
952 | disclaimer in the | ||
953 | |||
954 | * documentation and/or other materials provided with the | ||
955 | distribution. | ||
956 | |||
957 | * 3. Neither the name of IBM nor the names of its contributors | ||
958 | |||
959 | * may be used to endorse or promote products derived from | ||
960 | this software | ||
961 | |||
962 | * without specific prior written permission. | ||
963 | |||
964 | * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND | ||
965 | CONTRIBUTORS ``AS IS'' AND | ||
966 | |||
967 | * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED | ||
968 | TO, THE | ||
969 | |||
970 | * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A | ||
971 | PARTICULAR PURPOSE | ||
972 | |||
973 | * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE | ||
974 | LIABLE | ||
975 | |||
976 | * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR | ||
977 | CONSEQUENTIAL | ||
978 | |||
979 | * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF | ||
980 | SUBSTITUTE GOODS | ||
981 | |||
982 | * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS | ||
983 | INTERRUPTION) | ||
984 | |||
985 | * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN | ||
986 | CONTRACT, STRICT | ||
987 | |||
988 | * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING | ||
989 | IN ANY WAY | ||
990 | |||
991 | * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE | ||
992 | POSSIBILITY OF | ||
993 | |||
994 | * SUCH DAMAGE. | ||
995 | |||
996 | */ | ||
997 | |||
998 | |||
999 | |||
1000 | /* This marks a buffer as continuing via the next field. */ | ||
1001 | |||
1002 | #define VRING_DESC_F_NEXT 1 | ||
1003 | |||
1004 | /* This marks a buffer as write-only (otherwise read-only). */ | ||
1005 | |||
1006 | #define VRING_DESC_F_WRITE 2 | ||
1007 | |||
1008 | |||
1009 | |||
1010 | /* The Host uses this in used->flags to advise the Guest: don't | ||
1011 | kick me | ||
1012 | |||
1013 | * when you add a buffer. It's unreliable, so it's simply an | ||
1014 | |||
1015 | * optimization. Guest will still kick if it's out of buffers. | ||
1016 | */ | ||
1017 | |||
1018 | #define VRING_USED_F_NO_NOTIFY 1 | ||
1019 | |||
1020 | /* The Guest uses this in avail->flags to advise the Host: don't | ||
1021 | |||
1022 | * interrupt me when you consume a buffer. It's unreliable, so | ||
1023 | it's | ||
1024 | |||
1025 | * simply an optimization. */ | ||
1026 | |||
1027 | #define VRING_AVAIL_F_NO_INTERRUPT 1 | ||
1028 | |||
1029 | |||
1030 | |||
1031 | /* Virtio ring descriptors: 16 bytes. | ||
1032 | |||
1033 | * These can chain together via "next". */ | ||
1034 | |||
1035 | struct vring_desc { | ||
1036 | |||
1037 | /* Address (guest-physical). */ | ||
1038 | |||
1039 | uint64_t addr; | ||
1040 | |||
1041 | /* Length. */ | ||
1042 | |||
1043 | uint32_t len; | ||
1044 | |||
1045 | /* The flags as indicated above. */ | ||
1046 | |||
1047 | uint16_t flags; | ||
1048 | |||
1049 | /* We chain unused descriptors via this, too */ | ||
1050 | |||
1051 | uint16_t next; | ||
1052 | |||
1053 | }; | ||
1054 | |||
1055 | |||
1056 | |||
1057 | struct vring_avail { | ||
1058 | |||
1059 | uint16_t flags; | ||
1060 | |||
1061 | uint16_t idx; | ||
1062 | |||
1063 | uint16_t ring[]; | ||
1064 | |||
1065 | uint16_t used_event; | ||
1066 | |||
1067 | }; | ||
1068 | |||
1069 | |||
1070 | |||
1071 | /* u32 is used here for ids for padding reasons. */ | ||
1072 | |||
1073 | struct vring_used_elem { | ||
1074 | |||
1075 | /* Index of start of used descriptor chain. */ | ||
1076 | |||
1077 | uint32_t id; | ||
1078 | |||
1079 | /* Total length of the descriptor chain which was written | ||
1080 | to. */ | ||
1081 | |||
1082 | uint32_t len; | ||
1083 | |||
1084 | }; | ||
1085 | |||
1086 | |||
1087 | |||
1088 | struct vring_used { | ||
1089 | |||
1090 | uint16_t flags; | ||
1091 | |||
1092 | uint16_t idx; | ||
1093 | |||
1094 | struct vring_used_elem ring[]; | ||
1095 | |||
1096 | uint16_t avail_event; | ||
1097 | |||
1098 | }; | ||
1099 | |||
1100 | |||
1101 | |||
1102 | struct vring { | ||
1103 | |||
1104 | unsigned int num; | ||
1105 | |||
1106 | |||
1107 | |||
1108 | struct vring_desc *desc; | ||
1109 | |||
1110 | struct vring_avail *avail; | ||
1111 | |||
1112 | struct vring_used *used; | ||
1113 | |||
1114 | }; | ||
1115 | |||
1116 | |||
1117 | |||
1118 | /* The standard layout for the ring is a continuous chunk of | ||
1119 | memory which | ||
1120 | |||
1121 | * looks like this. We assume num is a power of 2. | ||
1122 | |||
1123 | * | ||
1124 | |||
1125 | * struct vring { | ||
1126 | |||
1127 | * // The actual descriptors (16 bytes each) | ||
1128 | |||
1129 | * struct vring_desc desc[num]; | ||
1130 | |||
1131 | * | ||
1132 | |||
1133 | * // A ring of available descriptor heads with free-running | ||
1134 | index. | ||
1135 | |||
1136 | * __u16 avail_flags; | ||
1137 | |||
1138 | * __u16 avail_idx; | ||
1139 | |||
1140 | * __u16 available[num]; | ||
1141 | |||
1142 | * | ||
1143 | |||
1144 | * // Padding to the next align boundary. | ||
1145 | |||
1146 | * char pad[]; | ||
1147 | |||
1148 | * | ||
1149 | |||
1150 | * // A ring of used descriptor heads with free-running | ||
1151 | index. | ||
1152 | |||
1153 | * __u16 used_flags; | ||
1154 | |||
1155 | * __u16 EVENT_IDX; | ||
1156 | |||
1157 | * struct vring_used_elem used[num]; | ||
1158 | |||
1159 | * }; | ||
1160 | |||
1161 | * Note: for virtio PCI, align is 4096. | ||
1162 | |||
1163 | */ | ||
1164 | |||
1165 | static inline void vring_init(struct vring *vr, unsigned int num, | ||
1166 | void *p, | ||
1167 | |||
1168 | unsigned long align) | ||
1169 | |||
1170 | { | ||
1171 | |||
1172 | vr->num = num; | ||
1173 | |||
1174 | vr->desc = p; | ||
1175 | |||
1176 | vr->avail = p + num*sizeof(struct vring_desc); | ||
1177 | |||
1178 | vr->used = (void *)(((unsigned long)&vr->avail->ring[num] | ||
1179 | |||
1180 | + align-1) | ||
1181 | |||
1182 | & ~(align - 1)); | ||
1183 | |||
1184 | } | ||
1185 | |||
1186 | |||
1187 | |||
1188 | static inline unsigned vring_size(unsigned int num, unsigned long | ||
1189 | align) | ||
1190 | |||
1191 | { | ||
1192 | |||
1193 | return ((sizeof(struct vring_desc)*num + | ||
1194 | sizeof(uint16_t)*(2+num) | ||
1195 | |||
1196 | + align - 1) & ~(align - 1)) | ||
1197 | |||
1198 | + sizeof(uint16_t)*3 + sizeof(struct | ||
1199 | vring_used_elem)*num; | ||
1200 | |||
1201 | } | ||
1202 | |||
1203 | |||
1204 | |||
1205 | static inline int vring_need_event(uint16_t event_idx, uint16_t | ||
1206 | new_idx, uint16_t old_idx) | ||
1207 | |||
1208 | { | ||
1209 | |||
1210 | return (uint16_t)(new_idx - event_idx - 1) < | ||
1211 | (uint16_t)(new_idx - old_idx); | ||
1212 | |||
1213 | } | ||
1214 | |||
1215 | #endif /* VIRTIO_RING_H */ | ||
1216 | |||
1217 | <cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits | ||
1218 | |||
1219 | Currently there are five device-independent feature bits defined: | ||
1220 | |||
1221 | VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature | ||
1222 | indicates that the driver wants an interrupt if the device runs | ||
1223 | out of available descriptors on a virtqueue, even though | ||
1224 | interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT | ||
1225 | flag or the used_event field. An example of this is the | ||
1226 | networking driver: it doesn't need to know every time a packet | ||
1227 | is transmitted, but it does need to free the transmitted | ||
1228 | packets a finite time after they are transmitted. It can avoid | ||
1229 | using a timer if the device interrupts it when all the packets | ||
1230 | are transmitted. | ||
1231 | |||
1232 | VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature | ||
1233 | indicates that the driver can use descriptors with the | ||
1234 | VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors] | ||
1235 | . | ||
1236 | |||
1237 | VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event | ||
1238 | and the avail_event fields. If set, it indicates that the | ||
1239 | device should ignore the flags field in the available ring | ||
1240 | structure. Instead, the used_event field in this structure is | ||
1241 | used by guest to suppress device interrupts. Further, the | ||
1242 | driver should ignore the flags field in the used ring | ||
1243 | structure. Instead, the avail_event field in this structure is | ||
1244 | used by the device to suppress notifications. If unset, the | ||
1245 | driver should ignore the used_event field; the device should | ||
1246 | ignore the avail_event field; the flags field is used | ||
1247 | |||
1248 | Appendix C: Network Device | ||
1249 | |||
1250 | The virtio network device is a virtual ethernet card, and is the | ||
1251 | most complex of the devices supported so far by virtio. It has | ||
1252 | enhanced rapidly and demonstrates clearly how support for new | ||
1253 | features should be added to an existing device. Empty buffers are | ||
1254 | placed in one virtqueue for receiving packets, and outgoing | ||
1255 | packets are enqueued into another for transmission in that order. | ||
1256 | A third command queue is used to control advanced filtering | ||
1257 | features. | ||
1258 | |||
1259 | Configuration | ||
1260 | |||
1261 | Subsystem Device ID 1 | ||
1262 | |||
1263 | Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote: | ||
1264 | Only if VIRTIO_NET_F_CTRL_VQ set | ||
1265 | ] | ||
1266 | |||
1267 | Feature bits | ||
1268 | |||
1269 | VIRTIO_NET_F_CSUM (0) Device handles packets with partial | ||
1270 | checksum | ||
1271 | |||
1272 | VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial | ||
1273 | checksum | ||
1274 | |||
1275 | VIRTIO_NET_F_MAC (5) Device has given MAC address. | ||
1276 | |||
1277 | VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with | ||
1278 | any GSO type.[footnote: | ||
1279 | It was supposed to indicate segmentation offload support, but | ||
1280 | upon further investigation it became clear that multiple bits | ||
1281 | were required. | ||
1282 | ] | ||
1283 | |||
1284 | VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4. | ||
1285 | |||
1286 | VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6. | ||
1287 | |||
1288 | VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN. | ||
1289 | |||
1290 | VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO. | ||
1291 | |||
1292 | VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4. | ||
1293 | |||
1294 | VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6. | ||
1295 | |||
1296 | VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN. | ||
1297 | |||
1298 | VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO. | ||
1299 | |||
1300 | VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers. | ||
1301 | |||
1302 | VIRTIO_NET_F_STATUS (16) Configuration status field is | ||
1303 | available. | ||
1304 | |||
1305 | VIRTIO_NET_F_CTRL_VQ (17) Control channel is available. | ||
1306 | |||
1307 | VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support. | ||
1308 | |||
1309 | VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering. | ||
1310 | |||
1311 | VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gratuitous | ||
1312 | packets. | ||
1313 | |||
1314 | Device configuration layout Two configuration fields are | ||
1315 | currently defined. The mac address field always exists (though | ||
1316 | is only valid if VIRTIO_NET_F_MAC is set), and the status field | ||
1317 | only exists if VIRTIO_NET_F_STATUS is set. Two read-only bits | ||
1318 | are currently defined for the status field: | ||
1319 | VIRTIO_NET_S_LINK_UP and VIRTIO_NET_S_ANNOUNCE. #define VIRTIO_NET_S_LINK_UP 1 | ||
1320 | |||
1321 | #define VIRTIO_NET_S_ANNOUNCE 2 | ||
1322 | |||
1323 | |||
1324 | |||
1325 | struct virtio_net_config { | ||
1326 | |||
1327 | u8 mac[6]; | ||
1328 | |||
1329 | u16 status; | ||
1330 | |||
1331 | }; | ||
1332 | |||
1333 | Device Initialization | ||
1334 | |||
1335 | The initialization routine should identify the receive and | ||
1336 | transmission virtqueues. | ||
1337 | |||
1338 | If the VIRTIO_NET_F_MAC feature bit is set, the configuration | ||
1339 | space “mac” entry indicates the “physical” address of the the | ||
1340 | network card, otherwise a private MAC address should be | ||
1341 | assigned. All guests are expected to negotiate this feature if | ||
1342 | it is set. | ||
1343 | |||
1344 | If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify | ||
1345 | the control virtqueue. | ||
1346 | |||
1347 | If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link | ||
1348 | status can be read from the bottom bit of the “status” config | ||
1349 | field. Otherwise, the link should be assumed active. | ||
1350 | |||
1351 | The receive virtqueue should be filled with receive buffers. | ||
1352 | This is described in detail below in “Setting Up Receive | ||
1353 | Buffers”. | ||
1354 | |||
1355 | A driver can indicate that it will generate checksumless | ||
1356 | packets by negotating the VIRTIO_NET_F_CSUM feature. This “ | ||
1357 | checksum offload” is a common feature on modern network cards. | ||
1358 | |||
1359 | If that feature is negotiated[footnote: | ||
1360 | ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are | ||
1361 | dependent on VIRTIO_NET_F_CSUM; a dvice which offers the offload | ||
1362 | features must offer the checksum feature, and a driver which | ||
1363 | accepts the offload features must accept the checksum feature. | ||
1364 | Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features | ||
1365 | depending on VIRTIO_NET_F_GUEST_CSUM. | ||
1366 | ], a driver can use TCP or UDP segmentation offload by | ||
1367 | negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP), | ||
1368 | VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO | ||
1369 | (UDP fragmentation) features. It should not send TCP packets | ||
1370 | requiring segmentation offload which have the Explicit | ||
1371 | Congestion Notification bit set, unless the | ||
1372 | VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote: | ||
1373 | This is a common restriction in real, older network cards. | ||
1374 | ] | ||
1375 | |||
1376 | The converse features are also available: a driver can save the | ||
1377 | virtual device some work by negotiating these features.[footnote: | ||
1378 | For example, a network packet transported between two guests on | ||
1379 | the same system may not require checksumming at all, nor | ||
1380 | segmentation, if both guests are amenable. | ||
1381 | ] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially | ||
1382 | checksummed packets can be received, and if it can do that then | ||
1383 | the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, | ||
1384 | VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input | ||
1385 | equivalents of the features described above. See “Receiving | ||
1386 | Packets” below. | ||
1387 | |||
1388 | Device Operation | ||
1389 | |||
1390 | Packets are transmitted by placing them in the transmitq, and | ||
1391 | buffers for incoming packets are placed in the receiveq. In each | ||
1392 | case, the packet itself is preceeded by a header: | ||
1393 | |||
1394 | struct virtio_net_hdr { | ||
1395 | |||
1396 | #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 | ||
1397 | |||
1398 | u8 flags; | ||
1399 | |||
1400 | #define VIRTIO_NET_HDR_GSO_NONE 0 | ||
1401 | |||
1402 | #define VIRTIO_NET_HDR_GSO_TCPV4 1 | ||
1403 | |||
1404 | #define VIRTIO_NET_HDR_GSO_UDP 3 | ||
1405 | |||
1406 | #define VIRTIO_NET_HDR_GSO_TCPV6 4 | ||
1407 | |||
1408 | #define VIRTIO_NET_HDR_GSO_ECN 0x80 | ||
1409 | |||
1410 | u8 gso_type; | ||
1411 | |||
1412 | u16 hdr_len; | ||
1413 | |||
1414 | u16 gso_size; | ||
1415 | |||
1416 | u16 csum_start; | ||
1417 | |||
1418 | u16 csum_offset; | ||
1419 | |||
1420 | /* Only if VIRTIO_NET_F_MRG_RXBUF: */ | ||
1421 | |||
1422 | u16 num_buffers | ||
1423 | |||
1424 | }; | ||
1425 | |||
1426 | The controlq is used to control device features such as | ||
1427 | filtering. | ||
1428 | |||
1429 | Packet Transmission | ||
1430 | |||
1431 | Transmitting a single packet is simple, but varies depending on | ||
1432 | the different features the driver negotiated. | ||
1433 | |||
1434 | If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has | ||
1435 | not been fully checksummed, then the virtio_net_hdr's fields | ||
1436 | are set as follows. Otherwise, the packet must be fully | ||
1437 | checksummed, and flags is zero. | ||
1438 | |||
1439 | flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set, | ||
1440 | |||
1441 | <ite:csum_start-is-set>csum_start is set to the offset within | ||
1442 | the packet to begin checksumming, and | ||
1443 | |||
1444 | csum_offset indicates how many bytes after the csum_start the | ||
1445 | new (16 bit ones' complement) checksum should be placed.[footnote: | ||
1446 | For example, consider a partially checksummed TCP (IPv4) packet. | ||
1447 | It will have a 14 byte ethernet header and 20 byte IP header | ||
1448 | followed by the TCP header (with the TCP checksum field 16 bytes | ||
1449 | into that header). csum_start will be 14+20 = 34 (the TCP | ||
1450 | checksum includes the header), and csum_offset will be 16. The | ||
1451 | value in the TCP checksum field should be initialized to the sum | ||
1452 | of the TCP pseudo header, so that replacing it by the ones' | ||
1453 | complement checksum of the TCP header and body will give the | ||
1454 | correct result. | ||
1455 | ] | ||
1456 | |||
1457 | <enu:If-the-driver>If the driver negotiated | ||
1458 | VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires | ||
1459 | TCP segmentation or UDP fragmentation, then the “gso_type” | ||
1460 | field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP. | ||
1461 | (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this | ||
1462 | case, packets larger than 1514 bytes can be transmitted: the | ||
1463 | metadata indicates how to replicate the packet header to cut it | ||
1464 | into smaller packets. The other gso fields are set: | ||
1465 | |||
1466 | hdr_len is a hint to the device as to how much of the header | ||
1467 | needs to be kept to copy into each packet, usually set to the | ||
1468 | length of the headers, including the transport header.[footnote: | ||
1469 | Due to various bugs in implementations, this field is not useful | ||
1470 | as a guarantee of the transport header size. | ||
1471 | ] | ||
1472 | |||
1473 | gso_size is the maximum size of each packet beyond that header | ||
1474 | (ie. MSS). | ||
1475 | |||
1476 | If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the | ||
1477 | VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well, | ||
1478 | indicating that the TCP packet has the ECN bit set.[footnote: | ||
1479 | This case is not handled by some older hardware, so is called out | ||
1480 | specifically in the protocol. | ||
1481 | ] | ||
1482 | |||
1483 | If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, | ||
1484 | the num_buffers field is set to zero. | ||
1485 | |||
1486 | The header and packet are added as one output buffer to the | ||
1487 | transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device] | ||
1488 | ).[footnote: | ||
1489 | Note that the header will be two bytes longer for the | ||
1490 | VIRTIO_NET_F_MRG_RXBUF case. | ||
1491 | ] | ||
1492 | |||
1493 | Packet Transmission Interrupt | ||
1494 | |||
1495 | Often a driver will suppress transmission interrupts using the | ||
1496 | VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers] | ||
1497 | ) and check for used packets in the transmit path of following | ||
1498 | packets. However, it will still receive interrupts if the | ||
1499 | VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that | ||
1500 | the transmission queue is completely emptied. | ||
1501 | |||
1502 | The normal behavior in this interrupt handler is to retrieve and | ||
1503 | new descriptors from the used ring and free the corresponding | ||
1504 | headers and packets. | ||
1505 | |||
1506 | Setting Up Receive Buffers | ||
1507 | |||
1508 | It is generally a good idea to keep the receive virtqueue as | ||
1509 | fully populated as possible: if it runs out, network performance | ||
1510 | will suffer. | ||
1511 | |||
1512 | If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or | ||
1513 | VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to | ||
1514 | accept packets of up to 65550 bytes long (the maximum size of a | ||
1515 | TCP or UDP packet, plus the 14 byte ethernet header), otherwise | ||
1516 | 1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every | ||
1517 | buffer in the receive queue needs to be at least this length [footnote: | ||
1518 | Obviously each one can be split across multiple descriptor | ||
1519 | elements. | ||
1520 | ]. | ||
1521 | |||
1522 | If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at | ||
1523 | least the size of the struct virtio_net_hdr. | ||
1524 | |||
1525 | Packet Receive Interrupt | ||
1526 | |||
1527 | When a packet is copied into a buffer in the receiveq, the | ||
1528 | optimal path is to disable further interrupts for the receiveq | ||
1529 | (see [sub:Receiving-Used-Buffers]) and process packets until no | ||
1530 | more are found, then re-enable them. | ||
1531 | |||
1532 | Processing packet involves: | ||
1533 | |||
1534 | If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature, | ||
1535 | then the “num_buffers” field indicates how many descriptors | ||
1536 | this packet is spread over (including this one). This allows | ||
1537 | receipt of large packets without having to allocate large | ||
1538 | buffers. In this case, there will be at least “num_buffers” in | ||
1539 | the used ring, and they should be chained together to form a | ||
1540 | single packet. The other buffers will not begin with a struct | ||
1541 | virtio_net_hdr. | ||
1542 | |||
1543 | If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or | ||
1544 | the “num_buffers” field is one, then the entire packet will be | ||
1545 | contained within this buffer, immediately following the struct | ||
1546 | virtio_net_hdr. | ||
1547 | |||
1548 | If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the | ||
1549 | VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be | ||
1550 | set: if so, the checksum on the packet is incomplete and the “ | ||
1551 | csum_start” and “csum_offset” fields indicate how to calculate | ||
1552 | it (see [ite:csum_start-is-set]). | ||
1553 | |||
1554 | If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were | ||
1555 | negotiated, then the “gso_type” may be something other than | ||
1556 | VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the | ||
1557 | desired MSS (see [enu:If-the-driver]). | ||
1558 | |||
1559 | Control Virtqueue | ||
1560 | |||
1561 | The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is | ||
1562 | negotiated) to send commands to manipulate various features of | ||
1563 | the device which would not easily map into the configuration | ||
1564 | space. | ||
1565 | |||
1566 | All commands are of the following form: | ||
1567 | |||
1568 | struct virtio_net_ctrl { | ||
1569 | |||
1570 | u8 class; | ||
1571 | |||
1572 | u8 command; | ||
1573 | |||
1574 | u8 command-specific-data[]; | ||
1575 | |||
1576 | u8 ack; | ||
1577 | |||
1578 | }; | ||
1579 | |||
1580 | |||
1581 | |||
1582 | /* ack values */ | ||
1583 | |||
1584 | #define VIRTIO_NET_OK 0 | ||
1585 | |||
1586 | #define VIRTIO_NET_ERR 1 | ||
1587 | |||
1588 | The class, command and command-specific-data are set by the | ||
1589 | driver, and the device sets the ack byte. There is little it can | ||
1590 | do except issue a diagnostic if the ack byte is not | ||
1591 | VIRTIO_NET_OK. | ||
1592 | |||
1593 | Packet Receive Filtering | ||
1594 | |||
1595 | If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can | ||
1596 | send control commands for promiscuous mode, multicast receiving, | ||
1597 | and filtering of MAC addresses. | ||
1598 | |||
1599 | Note that in general, these commands are best-effort: unwanted | ||
1600 | packets may still arrive. | ||
1601 | |||
1602 | Setting Promiscuous Mode | ||
1603 | |||
1604 | #define VIRTIO_NET_CTRL_RX 0 | ||
1605 | |||
1606 | #define VIRTIO_NET_CTRL_RX_PROMISC 0 | ||
1607 | |||
1608 | #define VIRTIO_NET_CTRL_RX_ALLMULTI 1 | ||
1609 | |||
1610 | The class VIRTIO_NET_CTRL_RX has two commands: | ||
1611 | VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and | ||
1612 | VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and | ||
1613 | off. The command-specific-data is one byte containing 0 (off) or | ||
1614 | 1 (on). | ||
1615 | |||
1616 | Setting MAC Address Filtering | ||
1617 | |||
1618 | struct virtio_net_ctrl_mac { | ||
1619 | |||
1620 | u32 entries; | ||
1621 | |||
1622 | u8 macs[entries][ETH_ALEN]; | ||
1623 | |||
1624 | }; | ||
1625 | |||
1626 | |||
1627 | |||
1628 | #define VIRTIO_NET_CTRL_MAC 1 | ||
1629 | |||
1630 | #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0 | ||
1631 | |||
1632 | The device can filter incoming packets by any number of | ||
1633 | destination MAC addresses.[footnote: | ||
1634 | Since there are no guarentees, it can use a hash filter | ||
1635 | orsilently switch to allmulti or promiscuous mode if it is given | ||
1636 | too many addresses. | ||
1637 | ] This table is set using the class VIRTIO_NET_CTRL_MAC and the | ||
1638 | command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data | ||
1639 | is two variable length tables of 6-byte MAC addresses. The first | ||
1640 | table contains unicast addresses, and the second contains | ||
1641 | multicast addresses. | ||
1642 | |||
1643 | VLAN Filtering | ||
1644 | |||
1645 | If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it | ||
1646 | can control a VLAN filter table in the device. | ||
1647 | |||
1648 | #define VIRTIO_NET_CTRL_VLAN 2 | ||
1649 | |||
1650 | #define VIRTIO_NET_CTRL_VLAN_ADD 0 | ||
1651 | |||
1652 | #define VIRTIO_NET_CTRL_VLAN_DEL 1 | ||
1653 | |||
1654 | Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL | ||
1655 | command take a 16-bit VLAN id as the command-specific-data. | ||
1656 | |||
1657 | Gratuitous Packet Sending | ||
1658 | |||
1659 | If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE (depends | ||
1660 | on VIRTIO_NET_F_CTRL_VQ), it can ask the guest to send gratuitous | ||
1661 | packets; this is usually done after the guest has been physically | ||
1662 | migrated, and needs to announce its presence on the new network | ||
1663 | links. (As hypervisor does not have the knowledge of guest | ||
1664 | network configuration (eg. tagged vlan) it is simplest to prod | ||
1665 | the guest in this way). | ||
1666 | |||
1667 | #define VIRTIO_NET_CTRL_ANNOUNCE 3 | ||
1668 | |||
1669 | #define VIRTIO_NET_CTRL_ANNOUNCE_ACK 0 | ||
1670 | |||
1671 | The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status | ||
1672 | field when it notices the changes of device configuration. The | ||
1673 | command VIRTIO_NET_CTRL_ANNOUNCE_ACK is used to indicate that | ||
1674 | driver has recevied the notification and device would clear the | ||
1675 | VIRTIO_NET_S_ANNOUNCE bit in the status filed after it received | ||
1676 | this command. | ||
1677 | |||
1678 | Processing this notification involves: | ||
1679 | |||
1680 | Sending the gratuitous packets or marking there are pending | ||
1681 | gratuitous packets to be sent and letting deferred routine to | ||
1682 | send them. | ||
1683 | |||
1684 | Sending VIRTIO_NET_CTRL_ANNOUNCE_ACK command through control | ||
1685 | vq. | ||
1686 | |||
1687 | . | ||
1688 | |||
1689 | Appendix D: Block Device | ||
1690 | |||
1691 | The virtio block device is a simple virtual block device (ie. | ||
1692 | disk). Read and write requests (and other exotic requests) are | ||
1693 | placed in the queue, and serviced (probably out of order) by the | ||
1694 | device except where noted. | ||
1695 | |||
1696 | Configuration | ||
1697 | |||
1698 | Subsystem Device ID 2 | ||
1699 | |||
1700 | Virtqueues 0:requestq. | ||
1701 | |||
1702 | Feature bits | ||
1703 | |||
1704 | VIRTIO_BLK_F_BARRIER (0) Host supports request barriers. | ||
1705 | |||
1706 | VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is | ||
1707 | in “size_max”. | ||
1708 | |||
1709 | VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a | ||
1710 | request is in “seg_max”. | ||
1711 | |||
1712 | VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “ | ||
1713 | geometry”. | ||
1714 | |||
1715 | VIRTIO_BLK_F_RO (5) Device is read-only. | ||
1716 | |||
1717 | VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”. | ||
1718 | |||
1719 | VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands. | ||
1720 | |||
1721 | VIRTIO_BLK_F_FLUSH (9) Cache flush command support. | ||
1722 | |||
1723 | Device configuration layout The capacity of the device | ||
1724 | (expressed in 512-byte sectors) is always present. The | ||
1725 | availability of the others all depend on various feature bits | ||
1726 | as indicated above. struct virtio_blk_config { | ||
1727 | |||
1728 | u64 capacity; | ||
1729 | |||
1730 | u32 size_max; | ||
1731 | |||
1732 | u32 seg_max; | ||
1733 | |||
1734 | struct virtio_blk_geometry { | ||
1735 | |||
1736 | u16 cylinders; | ||
1737 | |||
1738 | u8 heads; | ||
1739 | |||
1740 | u8 sectors; | ||
1741 | |||
1742 | } geometry; | ||
1743 | |||
1744 | u32 blk_size; | ||
1745 | |||
1746 | |||
1747 | |||
1748 | }; | ||
1749 | |||
1750 | Device Initialization | ||
1751 | |||
1752 | The device size should be read from the “capacity” | ||
1753 | configuration field. No requests should be submitted which goes | ||
1754 | beyond this limit. | ||
1755 | |||
1756 | If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the | ||
1757 | blk_size field can be read to determine the optimal sector size | ||
1758 | for the driver to use. This does not effect the units used in | ||
1759 | the protocol (always 512 bytes), but awareness of the correct | ||
1760 | value can effect performance. | ||
1761 | |||
1762 | If the VIRTIO_BLK_F_RO feature is set by the device, any write | ||
1763 | requests will fail. | ||
1764 | |||
1765 | Device Operation | ||
1766 | |||
1767 | The driver queues requests to the virtqueue, and they are used by | ||
1768 | the device (not necessarily in order). Each request is of form: | ||
1769 | |||
1770 | struct virtio_blk_req { | ||
1771 | |||
1772 | |||
1773 | |||
1774 | u32 type; | ||
1775 | |||
1776 | u32 ioprio; | ||
1777 | |||
1778 | u64 sector; | ||
1779 | |||
1780 | char data[][512]; | ||
1781 | |||
1782 | u8 status; | ||
1783 | |||
1784 | }; | ||
1785 | |||
1786 | If the device has VIRTIO_BLK_F_SCSI feature, it can also support | ||
1787 | scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req { | ||
1788 | |||
1789 | u32 type; | ||
1790 | |||
1791 | u32 ioprio; | ||
1792 | |||
1793 | u64 sector; | ||
1794 | |||
1795 | char cmd[]; | ||
1796 | |||
1797 | char data[][512]; | ||
1798 | |||
1799 | #define SCSI_SENSE_BUFFERSIZE 96 | ||
1800 | |||
1801 | u8 sense[SCSI_SENSE_BUFFERSIZE]; | ||
1802 | |||
1803 | u32 errors; | ||
1804 | |||
1805 | u32 data_len; | ||
1806 | |||
1807 | u32 sense_len; | ||
1808 | |||
1809 | u32 residual; | ||
1810 | |||
1811 | u8 status; | ||
1812 | |||
1813 | }; | ||
1814 | |||
1815 | The type of the request is either a read (VIRTIO_BLK_T_IN), a | ||
1816 | write (VIRTIO_BLK_T_OUT), a scsi packet command | ||
1817 | (VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote: | ||
1818 | the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device | ||
1819 | does not distinguish between them | ||
1820 | ]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote: | ||
1821 | the FLUSH and FLUSH_OUT types are equivalent, the device does not | ||
1822 | distinguish between them | ||
1823 | ]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit | ||
1824 | (VIRTIO_BLK_T_BARRIER) indicates that this request acts as a | ||
1825 | barrier and that all preceeding requests must be complete before | ||
1826 | this one, and all following requests must not be started until | ||
1827 | this is complete. Note that a barrier does not flush caches in | ||
1828 | the underlying backend device in host, and thus does not serve as | ||
1829 | data consistency guarantee. Driver must use FLUSH request to | ||
1830 | flush the host cache. | ||
1831 | |||
1832 | #define VIRTIO_BLK_T_IN 0 | ||
1833 | |||
1834 | #define VIRTIO_BLK_T_OUT 1 | ||
1835 | |||
1836 | #define VIRTIO_BLK_T_SCSI_CMD 2 | ||
1837 | |||
1838 | #define VIRTIO_BLK_T_SCSI_CMD_OUT 3 | ||
1839 | |||
1840 | #define VIRTIO_BLK_T_FLUSH 4 | ||
1841 | |||
1842 | #define VIRTIO_BLK_T_FLUSH_OUT 5 | ||
1843 | |||
1844 | #define VIRTIO_BLK_T_BARRIER 0x80000000 | ||
1845 | |||
1846 | The ioprio field is a hint about the relative priorities of | ||
1847 | requests to the device: higher numbers indicate more important | ||
1848 | requests. | ||
1849 | |||
1850 | The sector number indicates the offset (multiplied by 512) where | ||
1851 | the read or write is to occur. This field is unused and set to 0 | ||
1852 | for scsi packet commands and for flush commands. | ||
1853 | |||
1854 | The cmd field is only present for scsi packet command requests, | ||
1855 | and indicates the command to perform. This field must reside in a | ||
1856 | single, separate read-only buffer; command length can be derived | ||
1857 | from the length of this buffer. | ||
1858 | |||
1859 | Note that these first three (four for scsi packet commands) | ||
1860 | fields are always read-only: the data field is either read-only | ||
1861 | or write-only, depending on the request. The size of the read or | ||
1862 | write can be derived from the total size of the request buffers. | ||
1863 | |||
1864 | The sense field is only present for scsi packet command requests, | ||
1865 | and indicates the buffer for scsi sense data. | ||
1866 | |||
1867 | The data_len field is only present for scsi packet command | ||
1868 | requests, this field is deprecated, and should be ignored by the | ||
1869 | driver. Historically, devices copied data length there. | ||
1870 | |||
1871 | The sense_len field is only present for scsi packet command | ||
1872 | requests and indicates the number of bytes actually written to | ||
1873 | the sense buffer. | ||
1874 | |||
1875 | The residual field is only present for scsi packet command | ||
1876 | requests and indicates the residual size, calculated as data | ||
1877 | length - number of bytes actually transferred. | ||
1878 | |||
1879 | The final status byte is written by the device: either | ||
1880 | VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest | ||
1881 | error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0 | ||
1882 | |||
1883 | #define VIRTIO_BLK_S_IOERR 1 | ||
1884 | |||
1885 | #define VIRTIO_BLK_S_UNSUPP 2 | ||
1886 | |||
1887 | Historically, devices assumed that the fields type, ioprio and | ||
1888 | sector reside in a single, separate read-only buffer; the fields | ||
1889 | errors, data_len, sense_len and residual reside in a single, | ||
1890 | separate write-only buffer; the sense field in a separate | ||
1891 | write-only buffer of size 96 bytes, by itself; the fields errors, | ||
1892 | data_len, sense_len and residual in a single write-only buffer; | ||
1893 | and the status field is a separate read-only buffer of size 1 | ||
1894 | byte, by itself. | ||
1895 | |||
1896 | Appendix E: Console Device | ||
1897 | |||
1898 | The virtio console device is a simple device for data input and | ||
1899 | output. A device may have one or more ports. Each port has a pair | ||
1900 | of input and output virtqueues. Moreover, a device has a pair of | ||
1901 | control IO virtqueues. The control virtqueues are used to | ||
1902 | communicate information between the device and the driver about | ||
1903 | ports being opened and closed on either side of the connection, | ||
1904 | indication from the host about whether a particular port is a | ||
1905 | console port, adding new ports, port hot-plug/unplug, etc., and | ||
1906 | indication from the guest about whether a port or a device was | ||
1907 | successfully added, port open/close, etc.. For data IO, one or | ||
1908 | more empty buffers are placed in the receive queue for incoming | ||
1909 | data and outgoing characters are placed in the transmit queue. | ||
1910 | |||
1911 | Configuration | ||
1912 | |||
1913 | Subsystem Device ID 3 | ||
1914 | |||
1915 | Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control | ||
1916 | receiveq[footnote: | ||
1917 | Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set | ||
1918 | ], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1), | ||
1919 | ... | ||
1920 | |||
1921 | Feature bits | ||
1922 | |||
1923 | VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields | ||
1924 | are valid. | ||
1925 | |||
1926 | VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple | ||
1927 | ports; configuration fields nr_ports and max_nr_ports are | ||
1928 | valid and control virtqueues will be used. | ||
1929 | |||
1930 | Device configuration layout The size of the console is supplied | ||
1931 | in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature | ||
1932 | is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature | ||
1933 | is set, the maximum number of ports supported by the device can | ||
1934 | be fetched.struct virtio_console_config { | ||
1935 | |||
1936 | u16 cols; | ||
1937 | |||
1938 | u16 rows; | ||
1939 | |||
1940 | |||
1941 | |||
1942 | u32 max_nr_ports; | ||
1943 | |||
1944 | }; | ||
1945 | |||
1946 | Device Initialization | ||
1947 | |||
1948 | If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver | ||
1949 | can read the console dimensions from the configuration fields. | ||
1950 | |||
1951 | If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the | ||
1952 | driver can spawn multiple ports, not all of which may be | ||
1953 | attached to a console. Some could be generic ports. In this | ||
1954 | case, the control virtqueues are enabled and according to the | ||
1955 | max_nr_ports configuration-space value, the appropriate number | ||
1956 | of virtqueues are created. A control message indicating the | ||
1957 | driver is ready is sent to the host. The host can then send | ||
1958 | control messages for adding new ports to the device. After | ||
1959 | creating and initializing each port, a | ||
1960 | VIRTIO_CONSOLE_PORT_READY control message is sent to the host | ||
1961 | for that port so the host can let us know of any additional | ||
1962 | configuration options set for that port. | ||
1963 | |||
1964 | The receiveq for each port is populated with one or more | ||
1965 | receive buffers. | ||
1966 | |||
1967 | Device Operation | ||
1968 | |||
1969 | For output, a buffer containing the characters is placed in the | ||
1970 | port's transmitq.[footnote: | ||
1971 | Because this is high importance and low bandwidth, the current | ||
1972 | Linux implementation polls for the buffer to be used, rather than | ||
1973 | waiting for an interrupt, simplifying the implementation | ||
1974 | significantly. However, for generic serial ports with the | ||
1975 | O_NONBLOCK flag set, the polling limitation is relaxed and the | ||
1976 | consumed buffers are freed upon the next write or poll call or | ||
1977 | when a port is closed or hot-unplugged. | ||
1978 | ] | ||
1979 | |||
1980 | When a buffer is used in the receiveq (signalled by an | ||
1981 | interrupt), the contents is the input to the port associated | ||
1982 | with the virtqueue for which the notification was received. | ||
1983 | |||
1984 | If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a | ||
1985 | configuration change interrupt may occur. The updated size can | ||
1986 | be read from the configuration fields. | ||
1987 | |||
1988 | If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT | ||
1989 | feature, active ports are announced by the host using the | ||
1990 | VIRTIO_CONSOLE_PORT_ADD control message. The same message is | ||
1991 | used for port hot-plug as well. | ||
1992 | |||
1993 | If the host specified a port `name', a sysfs attribute is | ||
1994 | created with the name filled in, so that udev rules can be | ||
1995 | written that can create a symlink from the port's name to the | ||
1996 | char device for port discovery by applications in the guest. | ||
1997 | |||
1998 | Changes to ports' state are effected by control messages. | ||
1999 | Appropriate action is taken on the port indicated in the | ||
2000 | control message. The layout of the structure of the control | ||
2001 | buffer and the events associated are:struct virtio_console_control { | ||
2002 | |||
2003 | uint32_t id; /* Port number */ | ||
2004 | |||
2005 | uint16_t event; /* The kind of control event */ | ||
2006 | |||
2007 | uint16_t value; /* Extra information for the event */ | ||
2008 | |||
2009 | }; | ||
2010 | |||
2011 | |||
2012 | |||
2013 | /* Some events for the internal messages (control packets) */ | ||
2014 | |||
2015 | |||
2016 | |||
2017 | #define VIRTIO_CONSOLE_DEVICE_READY 0 | ||
2018 | |||
2019 | #define VIRTIO_CONSOLE_PORT_ADD 1 | ||
2020 | |||
2021 | #define VIRTIO_CONSOLE_PORT_REMOVE 2 | ||
2022 | |||
2023 | #define VIRTIO_CONSOLE_PORT_READY 3 | ||
2024 | |||
2025 | #define VIRTIO_CONSOLE_CONSOLE_PORT 4 | ||
2026 | |||
2027 | #define VIRTIO_CONSOLE_RESIZE 5 | ||
2028 | |||
2029 | #define VIRTIO_CONSOLE_PORT_OPEN 6 | ||
2030 | |||
2031 | #define VIRTIO_CONSOLE_PORT_NAME 7 | ||
2032 | |||
2033 | Appendix F: Entropy Device | ||
2034 | |||
2035 | The virtio entropy device supplies high-quality randomness for | ||
2036 | guest use. | ||
2037 | |||
2038 | Configuration | ||
2039 | |||
2040 | Subsystem Device ID 4 | ||
2041 | |||
2042 | Virtqueues 0:requestq. | ||
2043 | |||
2044 | Feature bits None currently defined | ||
2045 | |||
2046 | Device configuration layout None currently defined. | ||
2047 | |||
2048 | Device Initialization | ||
2049 | |||
2050 | The virtqueue is initialized | ||
2051 | |||
2052 | Device Operation | ||
2053 | |||
2054 | When the driver requires random bytes, it places the descriptor | ||
2055 | of one or more buffers in the queue. It will be completely filled | ||
2056 | by random data by the device. | ||
2057 | |||
2058 | Appendix G: Memory Balloon Device | ||
2059 | |||
2060 | The virtio memory balloon device is a primitive device for | ||
2061 | managing guest memory: the device asks for a certain amount of | ||
2062 | memory, and the guest supplies it (or withdraws it, if the device | ||
2063 | has more than it asks for). This allows the guest to adapt to | ||
2064 | changes in allowance of underlying physical memory. If the | ||
2065 | feature is negotiated, the device can also be used to communicate | ||
2066 | guest memory statistics to the host. | ||
2067 | |||
2068 | Configuration | ||
2069 | |||
2070 | Subsystem Device ID 5 | ||
2071 | |||
2072 | Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote: | ||
2073 | Only if VIRTIO_BALLON_F_STATS_VQ set | ||
2074 | ] | ||
2075 | |||
2076 | Feature bits | ||
2077 | |||
2078 | VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before | ||
2079 | pages from the balloon are used. | ||
2080 | |||
2081 | VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest | ||
2082 | memory statistics is present. | ||
2083 | |||
2084 | Device configuration layout Both fields of this configuration | ||
2085 | are always available. Note that they are little endian, despite | ||
2086 | convention that device fields are guest endian:struct virtio_balloon_config { | ||
2087 | |||
2088 | u32 num_pages; | ||
2089 | |||
2090 | u32 actual; | ||
2091 | |||
2092 | }; | ||
2093 | |||
2094 | Device Initialization | ||
2095 | |||
2096 | The inflate and deflate virtqueues are identified. | ||
2097 | |||
2098 | If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated: | ||
2099 | |||
2100 | Identify the stats virtqueue. | ||
2101 | |||
2102 | Add one empty buffer to the stats virtqueue and notify the | ||
2103 | host. | ||
2104 | |||
2105 | Device operation begins immediately. | ||
2106 | |||
2107 | Device Operation | ||
2108 | |||
2109 | Memory Ballooning The device is driven by the receipt of a | ||
2110 | configuration change interrupt. | ||
2111 | |||
2112 | The “num_pages” configuration field is examined. If this is | ||
2113 | greater than the “actual” number of pages, memory must be given | ||
2114 | to the balloon. If it is less than the “actual” number of | ||
2115 | pages, memory may be taken back from the balloon for general | ||
2116 | use. | ||
2117 | |||
2118 | To supply memory to the balloon (aka. inflate): | ||
2119 | |||
2120 | The driver constructs an array of addresses of unused memory | ||
2121 | pages. These addresses are divided by 4096[footnote: | ||
2122 | This is historical, and independent of the guest page size | ||
2123 | ] and the descriptor describing the resulting 32-bit array is | ||
2124 | added to the inflateq. | ||
2125 | |||
2126 | To remove memory from the balloon (aka. deflate): | ||
2127 | |||
2128 | The driver constructs an array of addresses of memory pages it | ||
2129 | has previously given to the balloon, as described above. This | ||
2130 | descriptor is added to the deflateq. | ||
2131 | |||
2132 | If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the | ||
2133 | guest may not use these requested pages until that descriptor | ||
2134 | in the deflateq has been used by the device. | ||
2135 | |||
2136 | Otherwise, the guest may begin to re-use pages previously given | ||
2137 | to the balloon before the device has acknowledged their | ||
2138 | withdrawl. [footnote: | ||
2139 | In this case, deflation advice is merely a courtesy | ||
2140 | ] | ||
2141 | |||
2142 | In either case, once the device has completed the inflation or | ||
2143 | deflation, the “actual” field of the configuration should be | ||
2144 | updated to reflect the new number of pages in the balloon.[footnote: | ||
2145 | As updates to configuration space are not atomic, this field | ||
2146 | isn't particularly reliable, but can be used to diagnose buggy | ||
2147 | guests. | ||
2148 | ] | ||
2149 | |||
2150 | Memory Statistics | ||
2151 | |||
2152 | The stats virtqueue is atypical because communication is driven | ||
2153 | by the device (not the driver). The channel becomes active at | ||
2154 | driver initialization time when the driver adds an empty buffer | ||
2155 | and notifies the device. A request for memory statistics proceeds | ||
2156 | as follows: | ||
2157 | |||
2158 | The device pushes the buffer onto the used ring and sends an | ||
2159 | interrupt. | ||
2160 | |||
2161 | The driver pops the used buffer and discards it. | ||
2162 | |||
2163 | The driver collects memory statistics and writes them into a | ||
2164 | new buffer. | ||
2165 | |||
2166 | The driver adds the buffer to the virtqueue and notifies the | ||
2167 | device. | ||
2168 | |||
2169 | The device pops the buffer (retaining it to initiate a | ||
2170 | subsequent request) and consumes the statistics. | ||
2171 | |||
2172 | Memory Statistics Format Each statistic consists of a 16 bit | ||
2173 | tag and a 64 bit value. Both quantities are represented in the | ||
2174 | native endian of the guest. All statistics are optional and the | ||
2175 | driver may choose which ones to supply. To guarantee backwards | ||
2176 | compatibility, unsupported statistics should be omitted. | ||
2177 | |||
2178 | struct virtio_balloon_stat { | ||
2179 | |||
2180 | #define VIRTIO_BALLOON_S_SWAP_IN 0 | ||
2181 | |||
2182 | #define VIRTIO_BALLOON_S_SWAP_OUT 1 | ||
2183 | |||
2184 | #define VIRTIO_BALLOON_S_MAJFLT 2 | ||
2185 | |||
2186 | #define VIRTIO_BALLOON_S_MINFLT 3 | ||
2187 | |||
2188 | #define VIRTIO_BALLOON_S_MEMFREE 4 | ||
2189 | |||
2190 | #define VIRTIO_BALLOON_S_MEMTOT 5 | ||
2191 | |||
2192 | u16 tag; | ||
2193 | |||
2194 | u64 val; | ||
2195 | |||
2196 | } __attribute__((packed)); | ||
2197 | |||
2198 | Tags | ||
2199 | |||
2200 | VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been | ||
2201 | swapped in (in bytes). | ||
2202 | |||
2203 | VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been | ||
2204 | swapped out to disk (in bytes). | ||
2205 | |||
2206 | VIRTIO_BALLOON_S_MAJFLT The number of major page faults that | ||
2207 | have occurred. | ||
2208 | |||
2209 | VIRTIO_BALLOON_S_MINFLT The number of minor page faults that | ||
2210 | have occurred. | ||
2211 | |||
2212 | VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used | ||
2213 | for any purpose (in bytes). | ||
2214 | |||
2215 | VIRTIO_BALLOON_S_MEMTOT The total amount of memory available | ||
2216 | (in bytes). | ||
2217 | |||
2218 | Appendix H: Rpmsg: Remote Processor Messaging | ||
2219 | |||
2220 | Virtio rpmsg devices represent remote processors on the system | ||
2221 | which run in asymmetric multi-processing (AMP) configuration, and | ||
2222 | which are usually used to offload cpu-intensive tasks from the | ||
2223 | main application processor (a typical SoC methodology). | ||
2224 | |||
2225 | Virtio is being used to communicate with those remote processors; | ||
2226 | empty buffers are placed in one virtqueue for receiving messages, | ||
2227 | and non-empty buffers, containing outbound messages, are enqueued | ||
2228 | in a second virtqueue for transmission. | ||
2229 | |||
2230 | Numerous communication channels can be multiplexed over those two | ||
2231 | virtqueues, so different entities, running on the application and | ||
2232 | remote processor, can directly communicate in a point-to-point | ||
2233 | fashion. | ||
2234 | |||
2235 | Configuration | ||
2236 | |||
2237 | Subsystem Device ID 7 | ||
2238 | |||
2239 | Virtqueues 0:receiveq. 1:transmitq. | ||
2240 | |||
2241 | Feature bits | ||
2242 | |||
2243 | VIRTIO_RPMSG_F_NS (0) Device sends (and capable of receiving) | ||
2244 | name service messages announcing the creation (or | ||
2245 | destruction) of a channel:/** | ||
2246 | |||
2247 | * struct rpmsg_ns_msg - dynamic name service announcement | ||
2248 | message | ||
2249 | |||
2250 | * @name: name of remote service that is published | ||
2251 | |||
2252 | * @addr: address of remote service that is published | ||
2253 | |||
2254 | * @flags: indicates whether service is created or destroyed | ||
2255 | |||
2256 | * | ||
2257 | |||
2258 | * This message is sent across to publish a new service (or | ||
2259 | announce | ||
2260 | |||
2261 | * about its removal). When we receives these messages, an | ||
2262 | appropriate | ||
2263 | |||
2264 | * rpmsg channel (i.e device) is created/destroyed. | ||
2265 | |||
2266 | */ | ||
2267 | |||
2268 | struct rpmsg_ns_msgoon_config { | ||
2269 | |||
2270 | char name[RPMSG_NAME_SIZE]; | ||
2271 | |||
2272 | u32 addr; | ||
2273 | |||
2274 | u32 flags; | ||
2275 | |||
2276 | } __packed; | ||
2277 | |||
2278 | |||
2279 | |||
2280 | /** | ||
2281 | |||
2282 | * enum rpmsg_ns_flags - dynamic name service announcement flags | ||
2283 | |||
2284 | * | ||
2285 | |||
2286 | * @RPMSG_NS_CREATE: a new remote service was just created | ||
2287 | |||
2288 | * @RPMSG_NS_DESTROY: a remote service was just destroyed | ||
2289 | |||
2290 | */ | ||
2291 | |||
2292 | enum rpmsg_ns_flags { | ||
2293 | |||
2294 | RPMSG_NS_CREATE = 0, | ||
2295 | |||
2296 | RPMSG_NS_DESTROY = 1, | ||
2297 | |||
2298 | }; | ||
2299 | |||
2300 | Device configuration layout | ||
2301 | |||
2302 | At his point none currently defined. | ||
2303 | |||
2304 | Device Initialization | ||
2305 | |||
2306 | The initialization routine should identify the receive and | ||
2307 | transmission virtqueues. | ||
2308 | |||
2309 | The receive virtqueue should be filled with receive buffers. | ||
2310 | |||
2311 | Device Operation | ||
2312 | |||
2313 | Messages are transmitted by placing them in the transmitq, and | ||
2314 | buffers for inbound messages are placed in the receiveq. In any | ||
2315 | case, messages are always preceded by the following header: /** | ||
2316 | |||
2317 | * struct rpmsg_hdr - common header for all rpmsg messages | ||
2318 | |||
2319 | * @src: source address | ||
2320 | |||
2321 | * @dst: destination address | ||
2322 | |||
2323 | * @reserved: reserved for future use | ||
2324 | |||
2325 | * @len: length of payload (in bytes) | ||
2326 | |||
2327 | * @flags: message flags | ||
2328 | |||
2329 | * @data: @len bytes of message payload data | ||
2330 | |||
2331 | * | ||
2332 | |||
2333 | * Every message sent(/received) on the rpmsg bus begins with | ||
2334 | this header. | ||
2335 | |||
2336 | */ | ||
2337 | |||
2338 | struct rpmsg_hdr { | ||
2339 | |||
2340 | u32 src; | ||
2341 | |||
2342 | u32 dst; | ||
2343 | |||
2344 | u32 reserved; | ||
2345 | |||
2346 | u16 len; | ||
2347 | |||
2348 | u16 flags; | ||
2349 | |||
2350 | u8 data[0]; | ||
2351 | |||
2352 | } __packed; | ||
2353 | |||
2354 | Appendix I: SCSI Host Device | ||
2355 | |||
2356 | The virtio SCSI host device groups together one or more virtual | ||
2357 | logical units (such as disks), and allows communicating to them | ||
2358 | using the SCSI protocol. An instance of the device represents a | ||
2359 | SCSI host to which many targets and LUNs are attached. | ||
2360 | |||
2361 | The virtio SCSI device services two kinds of requests: | ||
2362 | |||
2363 | command requests for a logical unit; | ||
2364 | |||
2365 | task management functions related to a logical unit, target or | ||
2366 | command. | ||
2367 | |||
2368 | The device is also able to send out notifications about added and | ||
2369 | removed logical units. Together, these capabilities provide a | ||
2370 | SCSI transport protocol that uses virtqueues as the transfer | ||
2371 | medium. In the transport protocol, the virtio driver acts as the | ||
2372 | initiator, while the virtio SCSI host provides one or more | ||
2373 | targets that receive and process the requests. | ||
2374 | |||
2375 | Configuration | ||
2376 | |||
2377 | Subsystem Device ID 8 | ||
2378 | |||
2379 | Virtqueues 0:controlq; 1:eventq; 2..n:request queues. | ||
2380 | |||
2381 | Feature bits | ||
2382 | |||
2383 | VIRTIO_SCSI_F_INOUT (0) A single request can include both | ||
2384 | read-only and write-only data buffers. | ||
2385 | |||
2386 | VIRTIO_SCSI_F_HOTPLUG (1) The host should enable | ||
2387 | hot-plug/hot-unplug of new LUNs and targets on the SCSI bus. | ||
2388 | |||
2389 | Device configuration layout All fields of this configuration | ||
2390 | are always available. sense_size and cdb_size are writable by | ||
2391 | the guest.struct virtio_scsi_config { | ||
2392 | |||
2393 | u32 num_queues; | ||
2394 | |||
2395 | u32 seg_max; | ||
2396 | |||
2397 | u32 max_sectors; | ||
2398 | |||
2399 | u32 cmd_per_lun; | ||
2400 | |||
2401 | u32 event_info_size; | ||
2402 | |||
2403 | u32 sense_size; | ||
2404 | |||
2405 | u32 cdb_size; | ||
2406 | |||
2407 | u16 max_channel; | ||
2408 | |||
2409 | u16 max_target; | ||
2410 | |||
2411 | u32 max_lun; | ||
2412 | |||
2413 | }; | ||
2414 | |||
2415 | num_queues is the total number of request virtqueues exposed by | ||
2416 | the device. The driver is free to use only one request queue, | ||
2417 | or it can use more to achieve better performance. | ||
2418 | |||
2419 | seg_max is the maximum number of segments that can be in a | ||
2420 | command. A bidirectional command can include seg_max input | ||
2421 | segments and seg_max output segments. | ||
2422 | |||
2423 | max_sectors is a hint to the guest about the maximum transfer | ||
2424 | size it should use. | ||
2425 | |||
2426 | cmd_per_lun is a hint to the guest about the maximum number of | ||
2427 | linked commands it should send to one LUN. The actual value | ||
2428 | to be used is the minimum of cmd_per_lun and the virtqueue | ||
2429 | size. | ||
2430 | |||
2431 | event_info_size is the maximum size that the device will fill | ||
2432 | for buffers that the driver places in the eventq. The driver | ||
2433 | should always put buffers at least of this size. It is | ||
2434 | written by the device depending on the set of negotated | ||
2435 | features. | ||
2436 | |||
2437 | sense_size is the maximum size of the sense data that the | ||
2438 | device will write. The default value is written by the device | ||
2439 | and will always be 96, but the driver can modify it. It is | ||
2440 | restored to the default when the device is reset. | ||
2441 | |||
2442 | cdb_size is the maximum size of the CDB that the driver will | ||
2443 | write. The default value is written by the device and will | ||
2444 | always be 32, but the driver can likewise modify it. It is | ||
2445 | restored to the default when the device is reset. | ||
2446 | |||
2447 | max_channel, max_target and max_lun can be used by the driver | ||
2448 | as hints to constrain scanning the logical units on the | ||
2449 | host.h | ||
2450 | |||
2451 | Device Initialization | ||
2452 | |||
2453 | The initialization routine should first of all discover the | ||
2454 | device's virtqueues. | ||
2455 | |||
2456 | If the driver uses the eventq, it should then place at least a | ||
2457 | buffer in the eventq. | ||
2458 | |||
2459 | The driver can immediately issue requests (for example, INQUIRY | ||
2460 | or REPORT LUNS) or task management functions (for example, I_T | ||
2461 | RESET). | ||
2462 | |||
2463 | Device Operation: request queues | ||
2464 | |||
2465 | The driver queues requests to an arbitrary request queue, and | ||
2466 | they are used by the device on that same queue. It is the | ||
2467 | responsibility of the driver to ensure strict request ordering | ||
2468 | for commands placed on different queues, because they will be | ||
2469 | consumed with no order constraints. | ||
2470 | |||
2471 | Requests have the following format: | ||
2472 | |||
2473 | struct virtio_scsi_req_cmd { | ||
2474 | |||
2475 | // Read-only | ||
2476 | |||
2477 | u8 lun[8]; | ||
2478 | |||
2479 | u64 id; | ||
2480 | |||
2481 | u8 task_attr; | ||
2482 | |||
2483 | u8 prio; | ||
2484 | |||
2485 | u8 crn; | ||
2486 | |||
2487 | char cdb[cdb_size]; | ||
2488 | |||
2489 | char dataout[]; | ||
2490 | |||
2491 | // Write-only part | ||
2492 | |||
2493 | u32 sense_len; | ||
2494 | |||
2495 | u32 residual; | ||
2496 | |||
2497 | u16 status_qualifier; | ||
2498 | |||
2499 | u8 status; | ||
2500 | |||
2501 | u8 response; | ||
2502 | |||
2503 | u8 sense[sense_size]; | ||
2504 | |||
2505 | char datain[]; | ||
2506 | |||
2507 | }; | ||
2508 | |||
2509 | |||
2510 | |||
2511 | /* command-specific response values */ | ||
2512 | |||
2513 | #define VIRTIO_SCSI_S_OK 0 | ||
2514 | |||
2515 | #define VIRTIO_SCSI_S_OVERRUN 1 | ||
2516 | |||
2517 | #define VIRTIO_SCSI_S_ABORTED 2 | ||
2518 | |||
2519 | #define VIRTIO_SCSI_S_BAD_TARGET 3 | ||
2520 | |||
2521 | #define VIRTIO_SCSI_S_RESET 4 | ||
2522 | |||
2523 | #define VIRTIO_SCSI_S_BUSY 5 | ||
2524 | |||
2525 | #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 | ||
2526 | |||
2527 | #define VIRTIO_SCSI_S_TARGET_FAILURE 7 | ||
2528 | |||
2529 | #define VIRTIO_SCSI_S_NEXUS_FAILURE 8 | ||
2530 | |||
2531 | #define VIRTIO_SCSI_S_FAILURE 9 | ||
2532 | |||
2533 | |||
2534 | |||
2535 | /* task_attr */ | ||
2536 | |||
2537 | #define VIRTIO_SCSI_S_SIMPLE 0 | ||
2538 | |||
2539 | #define VIRTIO_SCSI_S_ORDERED 1 | ||
2540 | |||
2541 | #define VIRTIO_SCSI_S_HEAD 2 | ||
2542 | |||
2543 | #define VIRTIO_SCSI_S_ACA 3 | ||
2544 | |||
2545 | The lun field addresses a target and logical unit in the | ||
2546 | virtio-scsi device's SCSI domain. The only supported format for | ||
2547 | the LUN field is: first byte set to 1, second byte set to target, | ||
2548 | third and fourth byte representing a single level LUN structure, | ||
2549 | followed by four zero bytes. With this representation, a | ||
2550 | virtio-scsi device can serve up to 256 targets and 16384 LUNs per | ||
2551 | target. | ||
2552 | |||
2553 | The id field is the command identifier (“tag”). | ||
2554 | |||
2555 | task_attr, prio and crn should be left to zero. task_attr defines | ||
2556 | the task attribute as in the table above, but all task attributes | ||
2557 | may be mapped to SIMPLE by the device; crn may also be provided | ||
2558 | by clients, but is generally expected to be 0. The maximum CRN | ||
2559 | value defined by the protocol is 255, since CRN is stored in an | ||
2560 | 8-bit integer. | ||
2561 | |||
2562 | All of these fields are defined in SAM. They are always | ||
2563 | read-only, as are the cdb and dataout field. The cdb_size is | ||
2564 | taken from the configuration space. | ||
2565 | |||
2566 | sense and subsequent fields are always write-only. The sense_len | ||
2567 | field indicates the number of bytes actually written to the sense | ||
2568 | buffer. The residual field indicates the residual size, | ||
2569 | calculated as “data_length - number_of_transferred_bytes”, for | ||
2570 | read or write operations. For bidirectional commands, the | ||
2571 | number_of_transferred_bytes includes both read and written bytes. | ||
2572 | A residual field that is less than the size of datain means that | ||
2573 | the dataout field was processed entirely. A residual field that | ||
2574 | exceeds the size of datain means that the dataout field was | ||
2575 | processed partially and the datain field was not processed at | ||
2576 | all. | ||
2577 | |||
2578 | The status byte is written by the device to be the status code as | ||
2579 | defined in SAM. | ||
2580 | |||
2581 | The response byte is written by the device to be one of the | ||
2582 | following: | ||
2583 | |||
2584 | VIRTIO_SCSI_S_OK when the request was completed and the status | ||
2585 | byte is filled with a SCSI status code (not necessarily | ||
2586 | "GOOD"). | ||
2587 | |||
2588 | VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires | ||
2589 | transferring more data than is available in the data buffers. | ||
2590 | |||
2591 | VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an | ||
2592 | ABORT TASK or ABORT TASK SET task management function. | ||
2593 | |||
2594 | VIRTIO_SCSI_S_BAD_TARGET if the request was never processed | ||
2595 | because the target indicated by the lun field does not exist. | ||
2596 | |||
2597 | VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus | ||
2598 | or device reset (including a task management function). | ||
2599 | |||
2600 | VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed due to a | ||
2601 | problem in the connection between the host and the target | ||
2602 | (severed link). | ||
2603 | |||
2604 | VIRTIO_SCSI_S_TARGET_FAILURE if the target is suffering a | ||
2605 | failure and the guest should not retry on other paths. | ||
2606 | |||
2607 | VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is suffering a failure | ||
2608 | but retrying on other paths might yield a different result. | ||
2609 | |||
2610 | VIRTIO_SCSI_S_BUSY if the request failed but retrying on the | ||
2611 | same path should work. | ||
2612 | |||
2613 | VIRTIO_SCSI_S_FAILURE for other host or guest error. In | ||
2614 | particular, if neither dataout nor datain is empty, and the | ||
2615 | VIRTIO_SCSI_F_INOUT feature has not been negotiated, the | ||
2616 | request will be immediately returned with a response equal to | ||
2617 | VIRTIO_SCSI_S_FAILURE. | ||
2618 | |||
2619 | Device Operation: controlq | ||
2620 | |||
2621 | The controlq is used for other SCSI transport operations. | ||
2622 | Requests have the following format: | ||
2623 | |||
2624 | struct virtio_scsi_ctrl { | ||
2625 | |||
2626 | u32 type; | ||
2627 | |||
2628 | ... | ||
2629 | |||
2630 | u8 response; | ||
2631 | |||
2632 | }; | ||
2633 | |||
2634 | |||
2635 | |||
2636 | /* response values valid for all commands */ | ||
2637 | |||
2638 | #define VIRTIO_SCSI_S_OK 0 | ||
2639 | |||
2640 | #define VIRTIO_SCSI_S_BAD_TARGET 3 | ||
2641 | |||
2642 | #define VIRTIO_SCSI_S_BUSY 5 | ||
2643 | |||
2644 | #define VIRTIO_SCSI_S_TRANSPORT_FAILURE 6 | ||
2645 | |||
2646 | #define VIRTIO_SCSI_S_TARGET_FAILURE 7 | ||
2647 | |||
2648 | #define VIRTIO_SCSI_S_NEXUS_FAILURE 8 | ||
2649 | |||
2650 | #define VIRTIO_SCSI_S_FAILURE 9 | ||
2651 | |||
2652 | #define VIRTIO_SCSI_S_INCORRECT_LUN 12 | ||
2653 | |||
2654 | The type identifies the remaining fields. | ||
2655 | |||
2656 | The following commands are defined: | ||
2657 | |||
2658 | Task management function | ||
2659 | #define VIRTIO_SCSI_T_TMF 0 | ||
2660 | |||
2661 | |||
2662 | |||
2663 | #define VIRTIO_SCSI_T_TMF_ABORT_TASK 0 | ||
2664 | |||
2665 | #define VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1 | ||
2666 | |||
2667 | #define VIRTIO_SCSI_T_TMF_CLEAR_ACA 2 | ||
2668 | |||
2669 | #define VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3 | ||
2670 | |||
2671 | #define VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4 | ||
2672 | |||
2673 | #define VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5 | ||
2674 | |||
2675 | #define VIRTIO_SCSI_T_TMF_QUERY_TASK 6 | ||
2676 | |||
2677 | #define VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7 | ||
2678 | |||
2679 | |||
2680 | |||
2681 | struct virtio_scsi_ctrl_tmf | ||
2682 | |||
2683 | { | ||
2684 | |||
2685 | // Read-only part | ||
2686 | |||
2687 | u32 type; | ||
2688 | |||
2689 | u32 subtype; | ||
2690 | |||
2691 | u8 lun[8]; | ||
2692 | |||
2693 | u64 id; | ||
2694 | |||
2695 | // Write-only part | ||
2696 | |||
2697 | u8 response; | ||
2698 | |||
2699 | } | ||
2700 | |||
2701 | |||
2702 | |||
2703 | /* command-specific response values */ | ||
2704 | |||
2705 | #define VIRTIO_SCSI_S_FUNCTION_COMPLETE 0 | ||
2706 | |||
2707 | #define VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10 | ||
2708 | |||
2709 | #define VIRTIO_SCSI_S_FUNCTION_REJECTED 11 | ||
2710 | |||
2711 | The type is VIRTIO_SCSI_T_TMF; the subtype field defines. All | ||
2712 | fields except response are filled by the driver. The subtype | ||
2713 | field must always be specified and identifies the requested | ||
2714 | task management function. | ||
2715 | |||
2716 | Other fields may be irrelevant for the requested TMF; if so, | ||
2717 | they are ignored but they should still be present. The lun | ||
2718 | field is in the same format specified for request queues; the | ||
2719 | single level LUN is ignored when the task management function | ||
2720 | addresses a whole I_T nexus. When relevant, the value of the id | ||
2721 | field is matched against the id values passed on the requestq. | ||
2722 | |||
2723 | The outcome of the task management function is written by the | ||
2724 | device in the response field. The command-specific response | ||
2725 | values map 1-to-1 with those defined in SAM. | ||
2726 | |||
2727 | Asynchronous notification query | ||
2728 | #define VIRTIO_SCSI_T_AN_QUERY 1 | ||
2729 | |||
2730 | |||
2731 | |||
2732 | struct virtio_scsi_ctrl_an { | ||
2733 | |||
2734 | // Read-only part | ||
2735 | |||
2736 | u32 type; | ||
2737 | |||
2738 | u8 lun[8]; | ||
2739 | |||
2740 | u32 event_requested; | ||
2741 | |||
2742 | // Write-only part | ||
2743 | |||
2744 | u32 event_actual; | ||
2745 | |||
2746 | u8 response; | ||
2747 | |||
2748 | } | ||
2749 | |||
2750 | |||
2751 | |||
2752 | #define VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2 | ||
2753 | |||
2754 | #define VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4 | ||
2755 | |||
2756 | #define VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8 | ||
2757 | |||
2758 | #define VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16 | ||
2759 | |||
2760 | #define VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32 | ||
2761 | |||
2762 | #define VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64 | ||
2763 | |||
2764 | By sending this command, the driver asks the device which | ||
2765 | events the given LUN can report, as described in paragraphs 6.6 | ||
2766 | and A.6 of the SCSI MMC specification. The driver writes the | ||
2767 | events it is interested in into the event_requested; the device | ||
2768 | responds by writing the events that it supports into | ||
2769 | event_actual. | ||
2770 | |||
2771 | The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested | ||
2772 | fields are written by the driver. The event_actual and response | ||
2773 | fields are written by the device. | ||
2774 | |||
2775 | No command-specific values are defined for the response byte. | ||
2776 | |||
2777 | Asynchronous notification subscription | ||
2778 | #define VIRTIO_SCSI_T_AN_SUBSCRIBE 2 | ||
2779 | |||
2780 | |||
2781 | |||
2782 | struct virtio_scsi_ctrl_an { | ||
2783 | |||
2784 | // Read-only part | ||
2785 | |||
2786 | u32 type; | ||
2787 | |||
2788 | u8 lun[8]; | ||
2789 | |||
2790 | u32 event_requested; | ||
2791 | |||
2792 | // Write-only part | ||
2793 | |||
2794 | u32 event_actual; | ||
2795 | |||
2796 | u8 response; | ||
2797 | |||
2798 | } | ||
2799 | |||
2800 | By sending this command, the driver asks the specified LUN to | ||
2801 | report events for its physical interface, again as described in | ||
2802 | the SCSI MMC specification. The driver writes the events it is | ||
2803 | interested in into the event_requested; the device responds by | ||
2804 | writing the events that it supports into event_actual. | ||
2805 | |||
2806 | Event types are the same as for the asynchronous notification | ||
2807 | query message. | ||
2808 | |||
2809 | The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and | ||
2810 | event_requested fields are written by the driver. The | ||
2811 | event_actual and response fields are written by the device. | ||
2812 | |||
2813 | No command-specific values are defined for the response byte. | ||
2814 | |||
2815 | Device Operation: eventq | ||
2816 | |||
2817 | The eventq is used by the device to report information on logical | ||
2818 | units that are attached to it. The driver should always leave a | ||
2819 | few buffers ready in the eventq. In general, the device will not | ||
2820 | queue events to cope with an empty eventq, and will end up | ||
2821 | dropping events if it finds no buffer ready. However, when | ||
2822 | reporting events for many LUNs (e.g. when a whole target | ||
2823 | disappears), the device can throttle events to avoid dropping | ||
2824 | them. For this reason, placing 10-15 buffers on the event queue | ||
2825 | should be enough. | ||
2826 | |||
2827 | Buffers are placed in the eventq and filled by the device when | ||
2828 | interesting events occur. The buffers should be strictly | ||
2829 | write-only (device-filled) and the size of the buffers should be | ||
2830 | at least the value given in the device's configuration | ||
2831 | information. | ||
2832 | |||
2833 | Buffers returned by the device on the eventq will be referred to | ||
2834 | as "events" in the rest of this section. Events have the | ||
2835 | following format: | ||
2836 | |||
2837 | #define VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000 | ||
2838 | |||
2839 | |||
2840 | |||
2841 | struct virtio_scsi_event { | ||
2842 | |||
2843 | // Write-only part | ||
2844 | |||
2845 | u32 event; | ||
2846 | |||
2847 | ... | ||
2848 | |||
2849 | } | ||
2850 | |||
2851 | If bit 31 is set in the event field, the device failed to report | ||
2852 | an event due to missing buffers. In this case, the driver should | ||
2853 | poll the logical units for unit attention conditions, and/or do | ||
2854 | whatever form of bus scan is appropriate for the guest operating | ||
2855 | system. | ||
2856 | |||
2857 | Other data that the device writes to the buffer depends on the | ||
2858 | contents of the event field. The following events are defined: | ||
2859 | |||
2860 | No event | ||
2861 | #define VIRTIO_SCSI_T_NO_EVENT 0 | ||
2862 | |||
2863 | This event is fired in the following cases: | ||
2864 | |||
2865 | When the device detects in the eventq a buffer that is shorter | ||
2866 | than what is indicated in the configuration field, it might | ||
2867 | use it immediately and put this dummy value in the event | ||
2868 | field. A well-written driver will never observe this | ||
2869 | situation. | ||
2870 | |||
2871 | When events are dropped, the device may signal this event as | ||
2872 | soon as the drivers makes a buffer available, in order to | ||
2873 | request action from the driver. In this case, of course, this | ||
2874 | event will be reported with the VIRTIO_SCSI_T_EVENTS_MISSED | ||
2875 | flag. | ||
2876 | |||
2877 | Transport reset | ||
2878 | #define VIRTIO_SCSI_T_TRANSPORT_RESET 1 | ||
2879 | |||
2880 | |||
2881 | |||
2882 | struct virtio_scsi_event_reset { | ||
2883 | |||
2884 | // Write-only part | ||
2885 | |||
2886 | u32 event; | ||
2887 | |||
2888 | u8 lun[8]; | ||
2889 | |||
2890 | u32 reason; | ||
2891 | |||
2892 | } | ||
2893 | |||
2894 | |||
2895 | |||
2896 | #define VIRTIO_SCSI_EVT_RESET_HARD 0 | ||
2897 | |||
2898 | #define VIRTIO_SCSI_EVT_RESET_RESCAN 1 | ||
2899 | |||
2900 | #define VIRTIO_SCSI_EVT_RESET_REMOVED 2 | ||
2901 | |||
2902 | By sending this event, the device signals that a logical unit | ||
2903 | on a target has been reset, including the case of a new device | ||
2904 | appearing or disappearing on the bus.The device fills in all | ||
2905 | fields. The event field is set to | ||
2906 | VIRTIO_SCSI_T_TRANSPORT_RESET. The lun field addresses a | ||
2907 | logical unit in the SCSI host. | ||
2908 | |||
2909 | The reason value is one of the three #define values appearing | ||
2910 | above: | ||
2911 | |||
2912 | VIRTIO_SCSI_EVT_RESET_REMOVED (“LUN/target removed”) is used if | ||
2913 | the target or logical unit is no longer able to receive | ||
2914 | commands. | ||
2915 | |||
2916 | VIRTIO_SCSI_EVT_RESET_HARD (“LUN hard reset”) is used if the | ||
2917 | logical unit has been reset, but is still present. | ||
2918 | |||
2919 | VIRTIO_SCSI_EVT_RESET_RESCAN (“rescan LUN/target”) is used if a | ||
2920 | target or logical unit has just appeared on the device. | ||
2921 | |||
2922 | The “removed” and “rescan” events, when sent for LUN 0, may | ||
2923 | apply to the entire target. After receiving them the driver | ||
2924 | should ask the initiator to rescan the target, in order to | ||
2925 | detect the case when an entire target has appeared or | ||
2926 | disappeared. These two events will never be reported unless the | ||
2927 | VIRTIO_SCSI_F_HOTPLUG feature was negotiated between the host | ||
2928 | and the guest. | ||
2929 | |||
2930 | Events will also be reported via sense codes (this obviously | ||
2931 | does not apply to newly appeared buses or targets, since the | ||
2932 | application has never discovered them): | ||
2933 | |||
2934 | “LUN/target removed” maps to sense key ILLEGAL REQUEST, asc | ||
2935 | 0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED) | ||
2936 | |||
2937 | “LUN hard reset” maps to sense key UNIT ATTENTION, asc 0x29 | ||
2938 | (POWER ON, RESET OR BUS DEVICE RESET OCCURRED) | ||
2939 | |||
2940 | “rescan LUN/target” maps to sense key UNIT ATTENTION, asc 0x3f, | ||
2941 | ascq 0x0e (REPORTED LUNS DATA HAS CHANGED) | ||
2942 | |||
2943 | The preferred way to detect transport reset is always to use | ||
2944 | events, because sense codes are only seen by the driver when it | ||
2945 | sends a SCSI command to the logical unit or target. However, in | ||
2946 | case events are dropped, the initiator will still be able to | ||
2947 | synchronize with the actual state of the controller if the | ||
2948 | driver asks the initiator to rescan of the SCSI bus. During the | ||
2949 | rescan, the initiator will be able to observe the above sense | ||
2950 | codes, and it will process them as if it the driver had | ||
2951 | received the equivalent event. | ||
2952 | |||
2953 | Asynchronous notification | ||
2954 | #define VIRTIO_SCSI_T_ASYNC_NOTIFY 2 | ||
2955 | |||
2956 | |||
2957 | |||
2958 | struct virtio_scsi_event_an { | ||
2959 | |||
2960 | // Write-only part | ||
2961 | |||
2962 | u32 event; | ||
2963 | |||
2964 | u8 lun[8]; | ||
2965 | |||
2966 | u32 reason; | ||
2967 | |||
2968 | } | ||
2969 | |||
2970 | By sending this event, the device signals that an asynchronous | ||
2971 | event was fired from a physical interface. | ||
2972 | |||
2973 | All fields are written by the device. The event field is set to | ||
2974 | VIRTIO_SCSI_T_ASYNC_NOTIFY. The lun field addresses a logical | ||
2975 | unit in the SCSI host. The reason field is a subset of the | ||
2976 | events that the driver has subscribed to via the "Asynchronous | ||
2977 | notification subscription" command. | ||
2978 | |||
2979 | When dropped events are reported, the driver should poll for | ||
2980 | asynchronous events manually using SCSI commands. | ||
2981 | |||
2982 | Appendix X: virtio-mmio | ||
2983 | |||
2984 | Virtual environments without PCI support (a common situation in | ||
2985 | embedded devices models) might use simple memory mapped device (“ | ||
2986 | virtio-mmio”) instead of the PCI device. | ||
2987 | |||
2988 | The memory mapped virtio device behaviour is based on the PCI | ||
2989 | device specification. Therefore most of operations like device | ||
2990 | initialization, queues configuration and buffer transfers are | ||
2991 | nearly identical. Existing differences are described in the | ||
2992 | following sections. | ||
2993 | |||
2994 | Device Initialization | ||
2995 | |||
2996 | Instead of using the PCI IO space for virtio header, the “ | ||
2997 | virtio-mmio” device provides a set of memory mapped control | ||
2998 | registers, all 32 bits wide, followed by device-specific | ||
2999 | configuration space. The following list presents their layout: | ||
3000 | |||
3001 | Offset from the device base address | Direction | Name | ||
3002 | Description | ||
3003 | |||
3004 | 0x000 | R | MagicValue | ||
3005 | “virt” string. | ||
3006 | |||
3007 | 0x004 | R | Version | ||
3008 | Device version number. Currently must be 1. | ||
3009 | |||
3010 | 0x008 | R | DeviceID | ||
3011 | Virtio Subsystem Device ID (ie. 1 for network card). | ||
3012 | |||
3013 | 0x00c | R | VendorID | ||
3014 | Virtio Subsystem Vendor ID. | ||
3015 | |||
3016 | 0x010 | R | HostFeatures | ||
3017 | Flags representing features the device supports. | ||
3018 | Reading from this register returns 32 consecutive flag bits, | ||
3019 | first bit depending on the last value written to | ||
3020 | HostFeaturesSel register. Access to this register returns bits HostFeaturesSel*32 | ||
3021 | |||
3022 | to (HostFeaturesSel*32)+31 | ||
3023 | , eg. feature bits 0 to 31 if | ||
3024 | HostFeaturesSel is set to 0 and features bits 32 to 63 if | ||
3025 | HostFeaturesSel is set to 1. Also see [sub:Feature-Bits] | ||
3026 | |||
3027 | 0x014 | W | HostFeaturesSel | ||
3028 | Device (Host) features word selection. | ||
3029 | Writing to this register selects a set of 32 device feature bits | ||
3030 | accessible by reading from HostFeatures register. Device driver | ||
3031 | must write a value to the HostFeaturesSel register before | ||
3032 | reading from the HostFeatures register. | ||
3033 | |||
3034 | 0x020 | W | GuestFeatures | ||
3035 | Flags representing device features understood and activated by | ||
3036 | the driver. | ||
3037 | Writing to this register sets 32 consecutive flag bits, first | ||
3038 | bit depending on the last value written to GuestFeaturesSel | ||
3039 | register. Access to this register sets bits GuestFeaturesSel*32 | ||
3040 | |||
3041 | to (GuestFeaturesSel*32)+31 | ||
3042 | , eg. feature bits 0 to 31 if | ||
3043 | GuestFeaturesSel is set to 0 and features bits 32 to 63 if | ||
3044 | GuestFeaturesSel is set to 1. Also see [sub:Feature-Bits] | ||
3045 | |||
3046 | 0x024 | W | GuestFeaturesSel | ||
3047 | Activated (Guest) features word selection. | ||
3048 | Writing to this register selects a set of 32 activated feature | ||
3049 | bits accessible by writing to the GuestFeatures register. | ||
3050 | Device driver must write a value to the GuestFeaturesSel | ||
3051 | register before writing to the GuestFeatures register. | ||
3052 | |||
3053 | 0x028 | W | GuestPageSize | ||
3054 | Guest page size. | ||
3055 | Device driver must write the guest page size in bytes to the | ||
3056 | register during initialization, before any queues are used. | ||
3057 | This value must be a power of 2 and is used by the Host to | ||
3058 | calculate Guest address of the first queue page (see QueuePFN). | ||
3059 | |||
3060 | 0x030 | W | QueueSel | ||
3061 | Virtual queue index (first queue is 0). | ||
3062 | Writing to this register selects the virtual queue that the | ||
3063 | following operations on QueueNum, QueueAlign and QueuePFN apply | ||
3064 | to. | ||
3065 | |||
3066 | 0x034 | R | QueueNumMax | ||
3067 | Maximum virtual queue size. | ||
3068 | Reading from the register returns the maximum size of the queue | ||
3069 | the Host is ready to process or zero (0x0) if the queue is not | ||
3070 | available. This applies to the queue selected by writing to | ||
3071 | QueueSel and is allowed only when QueuePFN is set to zero | ||
3072 | (0x0), so when the queue is not actively used. | ||
3073 | |||
3074 | 0x038 | W | QueueNum | ||
3075 | Virtual queue size. | ||
3076 | Queue size is a number of elements in the queue, therefore size | ||
3077 | of the descriptor table and both available and used rings. | ||
3078 | Writing to this register notifies the Host what size of the | ||
3079 | queue the Guest will use. This applies to the queue selected by | ||
3080 | writing to QueueSel. | ||
3081 | |||
3082 | 0x03c | W | QueueAlign | ||
3083 | Used Ring alignment in the virtual queue. | ||
3084 | Writing to this register notifies the Host about alignment | ||
3085 | boundary of the Used Ring in bytes. This value must be a power | ||
3086 | of 2 and applies to the queue selected by writing to QueueSel. | ||
3087 | |||
3088 | 0x040 | RW | QueuePFN | ||
3089 | Guest physical page number of the virtual queue. | ||
3090 | Writing to this register notifies the host about location of the | ||
3091 | virtual queue in the Guest's physical address space. This value | ||
3092 | is the index number of a page starting with the queue | ||
3093 | Descriptor Table. Value zero (0x0) means physical address zero | ||
3094 | (0x00000000) and is illegal. When the Guest stops using the | ||
3095 | queue it must write zero (0x0) to this register. | ||
3096 | Reading from this register returns the currently used page | ||
3097 | number of the queue, therefore a value other than zero (0x0) | ||
3098 | means that the queue is in use. | ||
3099 | Both read and write accesses apply to the queue selected by | ||
3100 | writing to QueueSel. | ||
3101 | |||
3102 | 0x050 | W | QueueNotify | ||
3103 | Queue notifier. | ||
3104 | Writing a queue index to this register notifies the Host that | ||
3105 | there are new buffers to process in the queue. | ||
3106 | |||
3107 | 0x60 | R | InterruptStatus | ||
3108 | Interrupt status. | ||
3109 | Reading from this register returns a bit mask of interrupts | ||
3110 | asserted by the device. An interrupt is asserted if the | ||
3111 | corresponding bit is set, ie. equals one (1). | ||
3112 | |||
3113 | Bit 0 | Used Ring Update | ||
3114 | This interrupt is asserted when the Host has updated the Used | ||
3115 | Ring in at least one of the active virtual queues. | ||
3116 | |||
3117 | Bit 1 | Configuration change | ||
3118 | This interrupt is asserted when configuration of the device has | ||
3119 | changed. | ||
3120 | |||
3121 | 0x064 | W | InterruptACK | ||
3122 | Interrupt acknowledge. | ||
3123 | Writing to this register notifies the Host that the Guest | ||
3124 | finished handling interrupts. Set bits in the value clear the | ||
3125 | corresponding bits of the InterruptStatus register. | ||
3126 | |||
3127 | 0x070 | RW | Status | ||
3128 | Device status. | ||
3129 | Reading from this register returns the current device status | ||
3130 | flags. | ||
3131 | Writing non-zero values to this register sets the status flags, | ||
3132 | indicating the Guest progress. Writing zero (0x0) to this | ||
3133 | register triggers a device reset. | ||
3134 | Also see [sub:Device-Initialization-Sequence] | ||
3135 | |||
3136 | 0x100+ | RW | Config | ||
3137 | Device-specific configuration space starts at an offset 0x100 | ||
3138 | and is accessed with byte alignment. Its meaning and size | ||
3139 | depends on the device and the driver. | ||
3140 | |||
3141 | Virtual queue size is a number of elements in the queue, | ||
3142 | therefore size of the descriptor table and both available and | ||
3143 | used rings. | ||
3144 | |||
3145 | The endianness of the registers follows the native endianness of | ||
3146 | the Guest. Writing to registers described as “R” and reading from | ||
3147 | registers described as “W” is not permitted and can cause | ||
3148 | undefined behavior. | ||
3149 | |||
3150 | The device initialization is performed as described in [sub:Device-Initialization-Sequence] | ||
3151 | with one exception: the Guest must notify the Host about its | ||
3152 | page size, writing the size in bytes to GuestPageSize register | ||
3153 | before the initialization is finished. | ||
3154 | |||
3155 | The memory mapped virtio devices generate single interrupt only, | ||
3156 | therefore no special configuration is required. | ||
3157 | |||
3158 | Virtqueue Configuration | ||
3159 | |||
3160 | The virtual queue configuration is performed in a similar way to | ||
3161 | the one described in [sec:Virtqueue-Configuration] with a few | ||
3162 | additional operations: | ||
3163 | |||
3164 | Select the queue writing its index (first queue is 0) to the | ||
3165 | QueueSel register. | ||
3166 | |||
3167 | Check if the queue is not already in use: read QueuePFN | ||
3168 | register, returned value should be zero (0x0). | ||
3169 | |||
3170 | Read maximum queue size (number of elements) from the | ||
3171 | QueueNumMax register. If the returned value is zero (0x0) the | ||
3172 | queue is not available. | ||
3173 | |||
3174 | Allocate and zero the queue pages in contiguous virtual memory, | ||
3175 | aligning the Used Ring to an optimal boundary (usually page | ||
3176 | size). Size of the allocated queue may be smaller than or equal | ||
3177 | to the maximum size returned by the Host. | ||
3178 | |||
3179 | Notify the Host about the queue size by writing the size to | ||
3180 | QueueNum register. | ||
3181 | |||
3182 | Notify the Host about the used alignment by writing its value | ||
3183 | in bytes to QueueAlign register. | ||
3184 | |||
3185 | Write the physical number of the first page of the queue to the | ||
3186 | QueuePFN register. | ||
3187 | |||
3188 | The queue and the device are ready to begin normal operations | ||
3189 | now. | ||
3190 | |||
3191 | Device Operation | ||
3192 | |||
3193 | The memory mapped virtio device behaves in the same way as | ||
3194 | described in [sec:Device-Operation], with the following | ||
3195 | exceptions: | ||
3196 | |||
3197 | The device is notified about new buffers available in a queue | ||
3198 | by writing the queue index to register QueueNum instead of the | ||
3199 | virtio header in PCI I/O space ([sub:Notifying-The-Device]). | ||
3200 | |||
3201 | The memory mapped virtio device is using single, dedicated | ||
3202 | interrupt signal, which is raised when at least one of the | ||
3203 | interrupts described in the InterruptStatus register | ||
3204 | description is asserted. After receiving an interrupt, the | ||
3205 | driver must read the InterruptStatus register to check what | ||
3206 | caused the interrupt (see the register description). After the | ||
3207 | interrupt is handled, the driver must acknowledge it by writing | ||
3208 | a bit mask corresponding to the serviced interrupt to the | ||
3209 | InterruptACK register. | ||
3210 | |||