summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab@s-opensource.com>2017-05-17 08:38:00 -0400
committerJonathan Corbet <corbet@lwn.net>2017-07-14 15:58:10 -0400
commitc6f4d41338a78bcc3ddcc4e00f5de63c8ee2ad20 (patch)
tree6f0e730ae79628ee10ff7dd363b317eb1d87e5e8
parent2a26ed8e4afff2bb48c044dc3ad69da19d66debf (diff)
vfio.txt: standardize document format
Each text file under Documentation follows a different format. Some doesn't even have titles! Change its representation to follow the adopted standard, using ReST markups for it to be parseable by Sphinx: - adjust title marks; - use footnote marks; - mark literal blocks; - adjust identation. Signed-off-by: Mauro Carvalho Chehab <mchehab@s-opensource.com> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
-rw-r--r--Documentation/vfio.txt281
1 files changed, 144 insertions, 137 deletions
diff --git a/Documentation/vfio.txt b/Documentation/vfio.txt
index 1dd3fddfd3a1..ef6a5111eaa1 100644
--- a/Documentation/vfio.txt
+++ b/Documentation/vfio.txt
@@ -1,5 +1,7 @@
1VFIO - "Virtual Function I/O"[1] 1==================================
2------------------------------------------------------------------------------- 2VFIO - "Virtual Function I/O" [1]_
3==================================
4
3Many modern system now provide DMA and interrupt remapping facilities 5Many modern system now provide DMA and interrupt remapping facilities
4to help ensure I/O devices behave within the boundaries they've been 6to help ensure I/O devices behave within the boundaries they've been
5allotted. This includes x86 hardware with AMD-Vi and Intel VT-d, 7allotted. This includes x86 hardware with AMD-Vi and Intel VT-d,
@@ -7,14 +9,14 @@ POWER systems with Partitionable Endpoints (PEs) and embedded PowerPC
7systems such as Freescale PAMU. The VFIO driver is an IOMMU/device 9systems such as Freescale PAMU. The VFIO driver is an IOMMU/device
8agnostic framework for exposing direct device access to userspace, in 10agnostic framework for exposing direct device access to userspace, in
9a secure, IOMMU protected environment. In other words, this allows 11a secure, IOMMU protected environment. In other words, this allows
10safe[2], non-privileged, userspace drivers. 12safe [2]_, non-privileged, userspace drivers.
11 13
12Why do we want that? Virtual machines often make use of direct device 14Why do we want that? Virtual machines often make use of direct device
13access ("device assignment") when configured for the highest possible 15access ("device assignment") when configured for the highest possible
14I/O performance. From a device and host perspective, this simply 16I/O performance. From a device and host perspective, this simply
15turns the VM into a userspace driver, with the benefits of 17turns the VM into a userspace driver, with the benefits of
16significantly reduced latency, higher bandwidth, and direct use of 18significantly reduced latency, higher bandwidth, and direct use of
17bare-metal device drivers[3]. 19bare-metal device drivers [3]_.
18 20
19Some applications, particularly in the high performance computing 21Some applications, particularly in the high performance computing
20field, also benefit from low-overhead, direct device access from 22field, also benefit from low-overhead, direct device access from
@@ -31,7 +33,7 @@ KVM PCI specific device assignment code as well as provide a more
31secure, more featureful userspace driver environment than UIO. 33secure, more featureful userspace driver environment than UIO.
32 34
33Groups, Devices, and IOMMUs 35Groups, Devices, and IOMMUs
34------------------------------------------------------------------------------- 36---------------------------
35 37
36Devices are the main target of any I/O driver. Devices typically 38Devices are the main target of any I/O driver. Devices typically
37create a programming interface made up of I/O access, interrupts, 39create a programming interface made up of I/O access, interrupts,
@@ -114,40 +116,40 @@ well as mechanisms for describing and registering interrupt
114notifications. 116notifications.
115 117
116VFIO Usage Example 118VFIO Usage Example
117------------------------------------------------------------------------------- 119------------------
118 120
119Assume user wants to access PCI device 0000:06:0d.0 121Assume user wants to access PCI device 0000:06:0d.0::
120 122
121$ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group 123 $ readlink /sys/bus/pci/devices/0000:06:0d.0/iommu_group
122../../../../kernel/iommu_groups/26 124 ../../../../kernel/iommu_groups/26
123 125
124This device is therefore in IOMMU group 26. This device is on the 126This device is therefore in IOMMU group 26. This device is on the
125pci bus, therefore the user will make use of vfio-pci to manage the 127pci bus, therefore the user will make use of vfio-pci to manage the
126group: 128group::
127 129
128# modprobe vfio-pci 130 # modprobe vfio-pci
129 131
130Binding this device to the vfio-pci driver creates the VFIO group 132Binding this device to the vfio-pci driver creates the VFIO group
131character devices for this group: 133character devices for this group::
132 134
133$ lspci -n -s 0000:06:0d.0 135 $ lspci -n -s 0000:06:0d.0
13406:0d.0 0401: 1102:0002 (rev 08) 136 06:0d.0 0401: 1102:0002 (rev 08)
135# echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind 137 # echo 0000:06:0d.0 > /sys/bus/pci/devices/0000:06:0d.0/driver/unbind
136# echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id 138 # echo 1102 0002 > /sys/bus/pci/drivers/vfio-pci/new_id
137 139
138Now we need to look at what other devices are in the group to free 140Now we need to look at what other devices are in the group to free
139it for use by VFIO: 141it for use by VFIO::
140 142
141$ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices 143 $ ls -l /sys/bus/pci/devices/0000:06:0d.0/iommu_group/devices
142total 0 144 total 0
143lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 -> 145 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:00:1e.0 ->
144 ../../../../devices/pci0000:00/0000:00:1e.0 146 ../../../../devices/pci0000:00/0000:00:1e.0
145lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 -> 147 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.0 ->
146 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0 148 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.0
147lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 -> 149 lrwxrwxrwx. 1 root root 0 Apr 23 16:13 0000:06:0d.1 ->
148 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1 150 ../../../../devices/pci0000:00/0000:00:1e.0/0000:06:0d.1
149 151
150This device is behind a PCIe-to-PCI bridge[4], therefore we also 152This device is behind a PCIe-to-PCI bridge [4]_, therefore we also
151need to add device 0000:06:0d.1 to the group following the same 153need to add device 0000:06:0d.1 to the group following the same
152procedure as above. Device 0000:00:1e.0 is a bridge that does 154procedure as above. Device 0000:00:1e.0 is a bridge that does
153not currently have a host driver, therefore it's not required to 155not currently have a host driver, therefore it's not required to
@@ -157,12 +159,12 @@ support PCI bridges).
157The final step is to provide the user with access to the group if 159The final step is to provide the user with access to the group if
158unprivileged operation is desired (note that /dev/vfio/vfio provides 160unprivileged operation is desired (note that /dev/vfio/vfio provides
159no capabilities on its own and is therefore expected to be set to 161no capabilities on its own and is therefore expected to be set to
160mode 0666 by the system). 162mode 0666 by the system)::
161 163
162# chown user:user /dev/vfio/26 164 # chown user:user /dev/vfio/26
163 165
164The user now has full access to all the devices and the iommu for this 166The user now has full access to all the devices and the iommu for this
165group and can access them as follows: 167group and can access them as follows::
166 168
167 int container, group, device, i; 169 int container, group, device, i;
168 struct vfio_group_status group_status = 170 struct vfio_group_status group_status =
@@ -248,31 +250,31 @@ VFIO bus driver API
248VFIO bus drivers, such as vfio-pci make use of only a few interfaces 250VFIO bus drivers, such as vfio-pci make use of only a few interfaces
249into VFIO core. When devices are bound and unbound to the driver, 251into VFIO core. When devices are bound and unbound to the driver,
250the driver should call vfio_add_group_dev() and vfio_del_group_dev() 252the driver should call vfio_add_group_dev() and vfio_del_group_dev()
251respectively: 253respectively::
252 254
253extern int vfio_add_group_dev(struct iommu_group *iommu_group, 255 extern int vfio_add_group_dev(struct iommu_group *iommu_group,
254 struct device *dev, 256 struct device *dev,
255 const struct vfio_device_ops *ops, 257 const struct vfio_device_ops *ops,
256 void *device_data); 258 void *device_data);
257 259
258extern void *vfio_del_group_dev(struct device *dev); 260 extern void *vfio_del_group_dev(struct device *dev);
259 261
260vfio_add_group_dev() indicates to the core to begin tracking the 262vfio_add_group_dev() indicates to the core to begin tracking the
261specified iommu_group and register the specified dev as owned by 263specified iommu_group and register the specified dev as owned by
262a VFIO bus driver. The driver provides an ops structure for callbacks 264a VFIO bus driver. The driver provides an ops structure for callbacks
263similar to a file operations structure: 265similar to a file operations structure::
264 266
265struct vfio_device_ops { 267 struct vfio_device_ops {
266 int (*open)(void *device_data); 268 int (*open)(void *device_data);
267 void (*release)(void *device_data); 269 void (*release)(void *device_data);
268 ssize_t (*read)(void *device_data, char __user *buf, 270 ssize_t (*read)(void *device_data, char __user *buf,
269 size_t count, loff_t *ppos); 271 size_t count, loff_t *ppos);
270 ssize_t (*write)(void *device_data, const char __user *buf, 272 ssize_t (*write)(void *device_data, const char __user *buf,
271 size_t size, loff_t *ppos); 273 size_t size, loff_t *ppos);
272 long (*ioctl)(void *device_data, unsigned int cmd, 274 long (*ioctl)(void *device_data, unsigned int cmd,
273 unsigned long arg); 275 unsigned long arg);
274 int (*mmap)(void *device_data, struct vm_area_struct *vma); 276 int (*mmap)(void *device_data, struct vm_area_struct *vma);
275}; 277 };
276 278
277Each function is passed the device_data that was originally registered 279Each function is passed the device_data that was originally registered
278in the vfio_add_group_dev() call above. This allows the bus driver 280in the vfio_add_group_dev() call above. This allows the bus driver
@@ -285,50 +287,55 @@ own VFIO_DEVICE_GET_REGION_INFO ioctl.
285 287
286 288
287PPC64 sPAPR implementation note 289PPC64 sPAPR implementation note
288------------------------------------------------------------------------------- 290-------------------------------
289 291
290This implementation has some specifics: 292This implementation has some specifics:
291 293
2921) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per 2941) On older systems (POWER7 with P5IOC2/IODA1) only one IOMMU group per
293container is supported as an IOMMU table is allocated at the boot time, 295 container is supported as an IOMMU table is allocated at the boot time,
294one table per a IOMMU group which is a Partitionable Endpoint (PE) 296 one table per a IOMMU group which is a Partitionable Endpoint (PE)
295(PE is often a PCI domain but not always). 297 (PE is often a PCI domain but not always).
296Newer systems (POWER8 with IODA2) have improved hardware design which allows 298
297to remove this limitation and have multiple IOMMU groups per a VFIO container. 299 Newer systems (POWER8 with IODA2) have improved hardware design which allows
300 to remove this limitation and have multiple IOMMU groups per a VFIO
301 container.
298 302
2992) The hardware supports so called DMA windows - the PCI address range 3032) The hardware supports so called DMA windows - the PCI address range
300within which DMA transfer is allowed, any attempt to access address space 304 within which DMA transfer is allowed, any attempt to access address space
301out of the window leads to the whole PE isolation. 305 out of the window leads to the whole PE isolation.
302 306
3033) PPC64 guests are paravirtualized but not fully emulated. There is an API 3073) PPC64 guests are paravirtualized but not fully emulated. There is an API
304to map/unmap pages for DMA, and it normally maps 1..32 pages per call and 308 to map/unmap pages for DMA, and it normally maps 1..32 pages per call and
305currently there is no way to reduce the number of calls. In order to make things 309 currently there is no way to reduce the number of calls. In order to make
306faster, the map/unmap handling has been implemented in real mode which provides 310 things faster, the map/unmap handling has been implemented in real mode
307an excellent performance which has limitations such as inability to do 311 which provides an excellent performance which has limitations such as
308locked pages accounting in real time. 312 inability to do locked pages accounting in real time.
309 313
3104) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O 3144) According to sPAPR specification, A Partitionable Endpoint (PE) is an I/O
311subtree that can be treated as a unit for the purposes of partitioning and 315 subtree that can be treated as a unit for the purposes of partitioning and
312error recovery. A PE may be a single or multi-function IOA (IO Adapter), a 316 error recovery. A PE may be a single or multi-function IOA (IO Adapter), a
313function of a multi-function IOA, or multiple IOAs (possibly including switch 317 function of a multi-function IOA, or multiple IOAs (possibly including
314and bridge structures above the multiple IOAs). PPC64 guests detect PCI errors 318 switch and bridge structures above the multiple IOAs). PPC64 guests detect
315and recover from them via EEH RTAS services, which works on the basis of 319 PCI errors and recover from them via EEH RTAS services, which works on the
316additional ioctl commands. 320 basis of additional ioctl commands.
317 321
318So 4 additional ioctls have been added: 322 So 4 additional ioctls have been added:
319 323
320 VFIO_IOMMU_SPAPR_TCE_GET_INFO - returns the size and the start 324 VFIO_IOMMU_SPAPR_TCE_GET_INFO
321 of the DMA window on the PCI bus. 325 returns the size and the start of the DMA window on the PCI bus.
322 326
323 VFIO_IOMMU_ENABLE - enables the container. The locked pages accounting 327 VFIO_IOMMU_ENABLE
328 enables the container. The locked pages accounting
324 is done at this point. This lets user first to know what 329 is done at this point. This lets user first to know what
325 the DMA window is and adjust rlimit before doing any real job. 330 the DMA window is and adjust rlimit before doing any real job.
326 331
327 VFIO_IOMMU_DISABLE - disables the container. 332 VFIO_IOMMU_DISABLE
333 disables the container.
328 334
329 VFIO_EEH_PE_OP - provides an API for EEH setup, error detection and recovery. 335 VFIO_EEH_PE_OP
336 provides an API for EEH setup, error detection and recovery.
330 337
331The code flow from the example above should be slightly changed: 338 The code flow from the example above should be slightly changed::
332 339
333 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 }; 340 struct vfio_eeh_pe_op pe_op = { .argsz = sizeof(pe_op), .flags = 0 };
334 341
@@ -442,73 +449,73 @@ The code flow from the example above should be slightly changed:
442 .... 449 ....
443 450
4445) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/ 4515) There is v2 of SPAPR TCE IOMMU. It deprecates VFIO_IOMMU_ENABLE/
445VFIO_IOMMU_DISABLE and implements 2 new ioctls: 452 VFIO_IOMMU_DISABLE and implements 2 new ioctls:
446VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY 453 VFIO_IOMMU_SPAPR_REGISTER_MEMORY and VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY
447(which are unsupported in v1 IOMMU). 454 (which are unsupported in v1 IOMMU).
448 455
449PPC64 paravirtualized guests generate a lot of map/unmap requests, 456 PPC64 paravirtualized guests generate a lot of map/unmap requests,
450and the handling of those includes pinning/unpinning pages and updating 457 and the handling of those includes pinning/unpinning pages and updating
451mm::locked_vm counter to make sure we do not exceed the rlimit. 458 mm::locked_vm counter to make sure we do not exceed the rlimit.
452The v2 IOMMU splits accounting and pinning into separate operations: 459 The v2 IOMMU splits accounting and pinning into separate operations:
453 460
454- VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls 461 - VFIO_IOMMU_SPAPR_REGISTER_MEMORY/VFIO_IOMMU_SPAPR_UNREGISTER_MEMORY ioctls
455receive a user space address and size of the block to be pinned. 462 receive a user space address and size of the block to be pinned.
456Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to 463 Bisecting is not supported and VFIO_IOMMU_UNREGISTER_MEMORY is expected to
457be called with the exact address and size used for registering 464 be called with the exact address and size used for registering
458the memory block. The userspace is not expected to call these often. 465 the memory block. The userspace is not expected to call these often.
459The ranges are stored in a linked list in a VFIO container. 466 The ranges are stored in a linked list in a VFIO container.
460 467
461- VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual 468 - VFIO_IOMMU_MAP_DMA/VFIO_IOMMU_UNMAP_DMA ioctls only update the actual
462IOMMU table and do not do pinning; instead these check that the userspace 469 IOMMU table and do not do pinning; instead these check that the userspace
463address is from pre-registered range. 470 address is from pre-registered range.
464 471
465This separation helps in optimizing DMA for guests. 472 This separation helps in optimizing DMA for guests.
466 473
4676) sPAPR specification allows guests to have an additional DMA window(s) on 4746) sPAPR specification allows guests to have an additional DMA window(s) on
468a PCI bus with a variable page size. Two ioctls have been added to support 475 a PCI bus with a variable page size. Two ioctls have been added to support
469this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE. 476 this: VFIO_IOMMU_SPAPR_TCE_CREATE and VFIO_IOMMU_SPAPR_TCE_REMOVE.
470The platform has to support the functionality or error will be returned to 477 The platform has to support the functionality or error will be returned to
471the userspace. The existing hardware supports up to 2 DMA windows, one is 478 the userspace. The existing hardware supports up to 2 DMA windows, one is
4722GB long, uses 4K pages and called "default 32bit window"; the other can 479 2GB long, uses 4K pages and called "default 32bit window"; the other can
473be as big as entire RAM, use different page size, it is optional - guests 480 be as big as entire RAM, use different page size, it is optional - guests
474create those in run-time if the guest driver supports 64bit DMA. 481 create those in run-time if the guest driver supports 64bit DMA.
475 482
476VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and 483 VFIO_IOMMU_SPAPR_TCE_CREATE receives a page shift, a DMA window size and
477a number of TCE table levels (if a TCE table is going to be big enough and 484 a number of TCE table levels (if a TCE table is going to be big enough and
478the kernel may not be able to allocate enough of physically contiguous memory). 485 the kernel may not be able to allocate enough of physically contiguous
479It creates a new window in the available slot and returns the bus address where 486 memory). It creates a new window in the available slot and returns the bus
480the new window starts. Due to hardware limitation, the user space cannot choose 487 address where the new window starts. Due to hardware limitation, the user
481the location of DMA windows. 488 space cannot choose the location of DMA windows.
482 489
483VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window 490 VFIO_IOMMU_SPAPR_TCE_REMOVE receives the bus start address of the window
484and removes it. 491 and removes it.
485 492
486------------------------------------------------------------------------------- 493-------------------------------------------------------------------------------
487 494
488[1] VFIO was originally an acronym for "Virtual Function I/O" in its 495.. [1] VFIO was originally an acronym for "Virtual Function I/O" in its
489initial implementation by Tom Lyon while as Cisco. We've since 496 initial implementation by Tom Lyon while as Cisco. We've since
490outgrown the acronym, but it's catchy. 497 outgrown the acronym, but it's catchy.
491 498
492[2] "safe" also depends upon a device being "well behaved". It's 499.. [2] "safe" also depends upon a device being "well behaved". It's
493possible for multi-function devices to have backdoors between 500 possible for multi-function devices to have backdoors between
494functions and even for single function devices to have alternative 501 functions and even for single function devices to have alternative
495access to things like PCI config space through MMIO registers. To 502 access to things like PCI config space through MMIO registers. To
496guard against the former we can include additional precautions in the 503 guard against the former we can include additional precautions in the
497IOMMU driver to group multi-function PCI devices together 504 IOMMU driver to group multi-function PCI devices together
498(iommu=group_mf). The latter we can't prevent, but the IOMMU should 505 (iommu=group_mf). The latter we can't prevent, but the IOMMU should
499still provide isolation. For PCI, SR-IOV Virtual Functions are the 506 still provide isolation. For PCI, SR-IOV Virtual Functions are the
500best indicator of "well behaved", as these are designed for 507 best indicator of "well behaved", as these are designed for
501virtualization usage models. 508 virtualization usage models.
502 509
503[3] As always there are trade-offs to virtual machine device 510.. [3] As always there are trade-offs to virtual machine device
504assignment that are beyond the scope of VFIO. It's expected that 511 assignment that are beyond the scope of VFIO. It's expected that
505future IOMMU technologies will reduce some, but maybe not all, of 512 future IOMMU technologies will reduce some, but maybe not all, of
506these trade-offs. 513 these trade-offs.
507 514
508[4] In this case the device is below a PCI bridge, so transactions 515.. [4] In this case the device is below a PCI bridge, so transactions
509from either function of the device are indistinguishable to the iommu: 516 from either function of the device are indistinguishable to the iommu::
510 517
511-[0000:00]-+-1e.0-[06]--+-0d.0 518 -[0000:00]-+-1e.0-[06]--+-0d.0
512 \-0d.1 519 \-0d.1
513 520
51400:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90) 521 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev 90)