1 files changed, 301 insertions, 0 deletions
diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
new file mode 100644
index 000000000000..b55c5cd83f8d
--- /dev/null
+++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,301 @@
+Wei Yang <weiyang@linux.vnet.ibm.com>
+Benjamin Herrenschmidt <benh@au1.ibm.com>
+Bjorn Helgaas <bhelgaas@google.com>
+26 Aug 2014
+This document describes the requirement from hardware for PCI MMIO resource
+sizing and assignment on PowerKVM and how generic PCI code handles this
+requirement. The first two sections describe the concepts of Partitionable
+Endpoints and the implementation on P8 (IODA2). The next two sections talks
+about considerations on enabling SRIOV on IODA2.
+1. Introduction to Partitionable Endpoints
+A Partitionable Endpoint (PE) is a way to group the various resources
+associated with a device or a set of devices to provide isolation between
+partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
+to freeze a device that is causing errors in order to limit the possibility
+of propagation of bad data.
+There is thus, in HW, a table of PE states that contains a pair of "frozen"
+state bits (one for MMIO and one for DMA, they get set together but can be
+cleared independently) for each PE.
+When a PE is frozen, all stores in any direction are dropped and all loads
+return all 1's value. MSIs are also blocked. There's a bit more state that
+captures things like the details of the error that caused the freeze etc., but
+that's not critical.
+The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
+are matched to their corresponding PEs.
+The following section provides a rough description of what we have on P8
+(IODA2).  Keep in mind that this is all per PHB (PCI host bridge).  Each PHB
+is a completely separate HW entity that replicates the entire logic, so has
+its own set of PEs, etc.
+2. Implementation of Partitionable Endpoints on P8 (IODA2)
+P8 supports up to 256 Partitionable Endpoints per PHB.
+  * Inbound
+    For DMA, MSIs and inbound PCIe error messages, we have a table (in
+    memory but accessed in HW by the chip) that provides a direct
+    correspondence between a PCIe RID (bus/dev/fn) with a PE number.
+    We call this the RTT.
+    - For DMA we then provide an entire address space for each PE that can
+      contain two "windows", depending on the value of PCI address bit 59.
+      Each window can be configured to be remapped via a "TCE table" (IOMMU
+      translation table), which has various configurable characteristics
+      not described here.
+    - For MSIs, we have two windows in the address space (one at the top of
+      the 32-bit space and one much higher) which, via a combination of the
+      address and MSI value, will result in one of the 2048 interrupts per
+      bridge being triggered.  There's a PE# in the interrupt controller
+      descriptor table as well which is compared with the PE# obtained from
+      the RTT to "authorize" the device to emit that specific interrupt.
+    - Error messages just use the RTT.
+  * Outbound.  That's where the tricky part is.
+    Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
+    from the CPU address space to the PCI address space.  There is one M32
+    window and sixteen M64 windows.  They have different characteristics.
+    First what they have in common: they forward a configurable portion of
+    the CPU address space to the PCIe bus and must be naturally aligned
+    power of two in size.  The rest is different:
+    - The M32 window:
+      * Is limited to 4GB in size.
+      * Drops the top bits of the address (above the size) and replaces
+        them with a configurable value.  This is typically used to generate
+        32-bit PCIe accesses.  We configure that window at boot from FW and
+        don't touch it from Linux; it's usually set to forward a 2GB
+        portion of address space from the CPU to PCIe
+        0x8000_0000..0xffff_ffff.  (Note: The top 64KB are actually
+        reserved for MSIs but this is not a problem at this point; we just
+        need to ensure Linux doesn't assign anything there, the M32 logic
+        ignores that however and will forward in that space if we try).
+      * It is divided into 256 segments of equal size.  A table in the chip
+        maps each segment to a PE#.  That allows portions of the MMIO space
+        to be assigned to PEs on a segment granularity.  For a 2GB window,
+        the segment granularity is 2GB/256 = 8MB.
+    Now, this is the "main" window we use in Linux today (excluding
+    SR-IOV).  We basically use the trick of forcing the bridge MMIO windows
+    onto a segment alignment/granularity so that the space behind a bridge
+    can be assigned to a PE.
+    Ideally we would like to be able to have individual functions in PEs
+    but that would mean using a completely different address allocation
+    scheme where individual function BARs can be "grouped" to fit in one or
+    more segments.
+    - The M64 windows:
+      * Must be at least 256MB in size.
+      * Do not translate addresses (the address on PCIe is the same as the
+        address on the PowerBus).  There is a way to also set the top 14
+        bits which are not conveyed by PowerBus but we don't use this.
+      * Can be configured to be segmented.  When not segmented, we can
+        specify the PE# for the entire window.  When segmented, a window
+        has 256 segments; however, there is no table for mapping a segment
+        to a PE#.  The segment number *is* the PE#.
+      * Support overlaps.  If an address is covered by multiple windows,
+        there's a defined ordering for which window applies.
+    We have code (fairly new compared to the M32 stuff) that exploits that
+    for large BARs in 64-bit space:
+    We configure an M64 window to cover the entire region of address space
+    that has been assigned by FW for the PHB (about 64GB, ignore the space
+    for the M32, it comes out of a different "reserve").  We configure it
+    as segmented.
+    Then we do the same thing as with M32, using the bridge alignment
+    trick, to match to those giant segments.
+    Since we cannot remap, we have two additional constraints:
+    - We do the PE# allocation *after* the 64-bit space has been assigned
+      because the addresses we use directly determine the PE#.  We then
+      update the M32 PE# for the devices that use both 32-bit and 64-bit
+      spaces or assign the remaining PE# to 32-bit only devices.
+    - We cannot "group" segments in HW, so if a device ends up using more
+      than one segment, we end up with more than one PE#.  There is a HW
+      mechanism to make the freeze state cascade to "companion" PEs but
+      that only works for PCIe error messages (typically used so that if
+      you freeze a switch, it freezes all its children).  So we do it in
+      SW.  We lose a bit of effectiveness of EEH in that case, but that's
+      the best we found.  So when any of the PEs freezes, we freeze the
+      other ones for that "domain".  We thus introduce the concept of
+      "master PE" which is the one used for DMA, MSIs, etc., and "secondary
+      PEs" that are used for the remaining M64 segments.
+    We would like to investigate using additional M64 windows in "single
+    PE" mode to overlay over specific BARs to work around some of that, for
+    example for devices with very large BARs, e.g., GPUs.  It would make
+    sense, but we haven't done it yet.
+3. Considerations for SR-IOV on PowerKVM
+  * SR-IOV Background
+    The PCIe SR-IOV feature allows a single Physical Function (PF) to
+    support several Virtual Functions (VFs).  Registers in the PF's SR-IOV
+    Capability control the number of VFs and whether they are enabled.
+    When VFs are enabled, they appear in Configuration Space like normal
+    PCI devices, but the BARs in VF config space headers are unusual.  For
+    a non-VF device, software uses BARs in the config space header to
+    discover the BAR sizes and assign addresses for them.  For VF devices,
+    software uses VF BAR registers in the *PF* SR-IOV Capability to
+    discover sizes and assign addresses.  The BARs in the VF's config space
+    header are read-only zeros.
+    When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
+    base address for all the corresponding VF(n) BARs.  For example, if the
+    PF SR-IOV Capability is programmed to enable eight VFs, and it has a
+    1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
+    This region is divided into eight contiguous 1MB regions, each of which
+    is a BAR0 for one of the VFs.  Note that even though the VF BAR
+    describes an 8MB region, the alignment requirement is for a single VF,
+    i.e., 1MB in this example.
+  There are several strategies for isolating VFs in PEs:
+  - M32 window: There's one M32 window, and it is split into 256
+    equally-sized segments.  The finest granularity possible is a 256MB
+    window with 1MB segments.  VF BARs that are 1MB or larger could be
+    mapped to separate PEs in this window.  Each segment can be
+    individually mapped to a PE via the lookup table, so this is quite
+    flexible, but it works best when all the VF BARs are the same size.  If
+    they are different sizes, the entire window has to be small enough that
+    the segment size matches the smallest VF BAR, which means larger VF
+    BARs span several segments.
+  - Non-segmented M64 window: A non-segmented M64 window is mapped entirely
+    to a single PE, so it could only isolate one VF.
+  - Single segmented M64 windows: A segmented M64 window could be used just
+    like the M32 window, but the segments can't be individually mapped to
+    PEs (the segment number is the PE#), so there isn't as much
+    flexibility.  A VF with multiple BARs would have to be in a "domain" of
+    multiple PEs, which is not as well isolated as a single PE.
+  - Multiple segmented M64 windows: As usual, each window is split into 256
+    equally-sized segments, and the segment number is the PE#.  But if we
+    use several M64 windows, they can be set to different base addresses
+    and different segment sizes.  If we have VFs that each have a 1MB BAR
+    and a 32MB BAR, we could use one M64 window to assign 1MB segments and
+    another M64 window to assign 32MB segments.
+  Finally, the plan to use M64 windows for SR-IOV, which will be described
+  more in the next two sections.  For a given VF BAR, we need to
+  effectively reserve the entire 256 segments (256 * VF BAR size) and
+  position the VF BAR to start at the beginning of a free range of
+  segments/PEs inside that M64 window.
+  The goal is of course to be able to give a separate PE for each VF.
+  The IODA2 platform has 16 M64 windows, which are used to map MMIO
+  range to PE#.  Each M64 window defines one MMIO range and this range is
+  divided into 256 segments, with each segment corresponding to one PE.
+  We decide to leverage this M64 window to map VFs to individual PEs, since
+  SR-IOV VF BARs are all the same size.
+  But doing so introduces another problem: total_VFs is usually smaller
+  than the number of M64 window segments, so if we map one VF BAR directly
+  to one M64 window, some part of the M64 window will map to another
+  device's MMIO range.
+  IODA supports 256 PEs, so segmented windows contain 256 segments, so if
+  total_VFs is less than 256, we have the situation in Figure 1.0, where
+  segments [total_VFs, 255] of the M64 window may map to some MMIO range on
+  other devices:
+     0      1                     total_VFs - 1
+     +------+------+-     -+------+------+
+     |      |      |  ...  |      |      |
+     +------+------+-     -+------+------+
+                           VF(n) BAR space
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+                           M64 window
+                Figure 1.0 Direct map VF(n) BAR space
+  Our current solution is to allocate 256 segments even if the VF(n) BAR
+  space doesn't need that much, as shown in Figure 1.1:
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+                           VF(n) BAR space + extra
+     0      1                     total_VFs - 1                255
+     +------+------+-     -+------+------+-      -+------+------+
+     |      |      |  ...  |      |      |   ...  |      |      |
+     +------+------+-     -+------+------+-      -+------+------+
+                           M64 window
+                Figure 1.1 Map VF(n) BAR space + extra
+  Allocating the extra space ensures that the entire M64 window will be
+  assigned to this one SR-IOV device and none of the space will be
+  available for other devices.  Note that this only expands the space
+  reserved in software; there are still only total_VFs VFs, and they only
+  respond to segments [0, total_VFs - 1].  There's nothing in hardware that
+  responds to segments [total_VFs, 255].
+4. Implications for the Generic PCI Code
+The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
+aligned to the size of an individual VF BAR.
+In IODA2, the MMIO address determines the PE#.  If the address is in an M32
+window, we can set the PE# by updating the table that translates segments
+to PE#s.  Similarly, if the address is in an unsegmented M64 window, we can
+set the PE# for the window.  But if it's in a segmented M64 window, the
+segment number is the PE#.
+Therefore, the only way to control the PE# for a VF is to change the base
+of the VF(n) BAR space in the VF BAR.  If the PCI core allocates the exact
+amount of space required for the VF(n) BAR space, the VF BAR value is fixed
+and cannot be changed.
+On the other hand, if the PCI core allocates additional space, the VF BAR
+value can be changed as long as the entire VF(n) BAR space remains inside
+the space allocated by the core.
+Ideally the segment size will be the same as an individual VF BAR size.
+Then each VF will be in its own PE.  The VF BARs (and therefore the PE#s)
+are contiguous.  If VF0 is in PE(x), then VF(n) is in PE(x+n).  If we
+allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
+If the segment size is smaller than the VF BAR size, it will take several
+segments to cover a VF BAR, and a VF will be in several PEs.  This is
+possible, but the isolation isn't as good, and it reduces the number of PE#
+choices because instead of consuming only numVFs segments, the VF(n) BAR
+space will consume (numVFs * n) segments.  That means there aren't as many
+available segments for adjusting base of the VF(n) BAR space.

diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt new file mode 100644 index 000000000000..b55c5cd83f8d --- /dev/null +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt
@@ -0,0 +1,301 @@
	1	Wei Yang <weiyang@linux.vnet.ibm.com>
	2	Benjamin Herrenschmidt <benh@au1.ibm.com>
	3	Bjorn Helgaas <bhelgaas@google.com>
	4	26 Aug 2014
	5
	6	This document describes the requirement from hardware for PCI MMIO resource
	7	sizing and assignment on PowerKVM and how generic PCI code handles this
	8	requirement. The first two sections describe the concepts of Partitionable
	9	Endpoints and the implementation on P8 (IODA2). The next two sections talks
	10	about considerations on enabling SRIOV on IODA2.
	11
	12	1. Introduction to Partitionable Endpoints
	13
	14	A Partitionable Endpoint (PE) is a way to group the various resources
	15	associated with a device or a set of devices to provide isolation between
	16	partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism
	17	to freeze a device that is causing errors in order to limit the possibility
	18	of propagation of bad data.
	19
	20	There is thus, in HW, a table of PE states that contains a pair of "frozen"
	21	state bits (one for MMIO and one for DMA, they get set together but can be
	22	cleared independently) for each PE.
	23
	24	When a PE is frozen, all stores in any direction are dropped and all loads
	25	return all 1's value. MSIs are also blocked. There's a bit more state that
	26	captures things like the details of the error that caused the freeze etc., but
	27	that's not critical.
	28
	29	The interesting part is how the various PCIe transactions (MMIO, DMA, ...)
	30	are matched to their corresponding PEs.
	31
	32	The following section provides a rough description of what we have on P8
	33	(IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB
	34	is a completely separate HW entity that replicates the entire logic, so has
	35	its own set of PEs, etc.
	36
	37	2. Implementation of Partitionable Endpoints on P8 (IODA2)
	38
	39	P8 supports up to 256 Partitionable Endpoints per PHB.
	40
	41	* Inbound
	42
	43	For DMA, MSIs and inbound PCIe error messages, we have a table (in
	44	memory but accessed in HW by the chip) that provides a direct
	45	correspondence between a PCIe RID (bus/dev/fn) with a PE number.
	46	We call this the RTT.
	47
	48	- For DMA we then provide an entire address space for each PE that can
	49	contain two "windows", depending on the value of PCI address bit 59.
	50	Each window can be configured to be remapped via a "TCE table" (IOMMU
	51	translation table), which has various configurable characteristics
	52	not described here.
	53
	54	- For MSIs, we have two windows in the address space (one at the top of
	55	the 32-bit space and one much higher) which, via a combination of the
	56	address and MSI value, will result in one of the 2048 interrupts per
	57	bridge being triggered. There's a PE# in the interrupt controller
	58	descriptor table as well which is compared with the PE# obtained from
	59	the RTT to "authorize" the device to emit that specific interrupt.
	60
	61	- Error messages just use the RTT.
	62
	63	* Outbound. That's where the tricky part is.
	64
	65	Like other PCI host bridges, the Power8 IODA2 PHB supports "windows"
	66	from the CPU address space to the PCI address space. There is one M32
	67	window and sixteen M64 windows. They have different characteristics.
	68	First what they have in common: they forward a configurable portion of
	69	the CPU address space to the PCIe bus and must be naturally aligned
	70	power of two in size. The rest is different:
	71
	72	- The M32 window:
	73
	74	* Is limited to 4GB in size.
	75
	76	* Drops the top bits of the address (above the size) and replaces
	77	them with a configurable value. This is typically used to generate
	78	32-bit PCIe accesses. We configure that window at boot from FW and
	79	don't touch it from Linux; it's usually set to forward a 2GB
	80	portion of address space from the CPU to PCIe
	81	0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually
	82	reserved for MSIs but this is not a problem at this point; we just
	83	need to ensure Linux doesn't assign anything there, the M32 logic
	84	ignores that however and will forward in that space if we try).
	85
	86	* It is divided into 256 segments of equal size. A table in the chip
	87	maps each segment to a PE#. That allows portions of the MMIO space
	88	to be assigned to PEs on a segment granularity. For a 2GB window,
	89	the segment granularity is 2GB/256 = 8MB.
	90
	91	Now, this is the "main" window we use in Linux today (excluding
	92	SR-IOV). We basically use the trick of forcing the bridge MMIO windows
	93	onto a segment alignment/granularity so that the space behind a bridge
	94	can be assigned to a PE.
	95
	96	Ideally we would like to be able to have individual functions in PEs
	97	but that would mean using a completely different address allocation
	98	scheme where individual function BARs can be "grouped" to fit in one or
	99	more segments.
	100
	101	- The M64 windows:
	102
	103	* Must be at least 256MB in size.
	104
	105	* Do not translate addresses (the address on PCIe is the same as the
	106	address on the PowerBus). There is a way to also set the top 14
	107	bits which are not conveyed by PowerBus but we don't use this.
	108
	109	* Can be configured to be segmented. When not segmented, we can
	110	specify the PE# for the entire window. When segmented, a window
	111	has 256 segments; however, there is no table for mapping a segment
	112	to a PE#. The segment number is the PE#.
	113
	114	* Support overlaps. If an address is covered by multiple windows,
	115	there's a defined ordering for which window applies.
	116
	117	We have code (fairly new compared to the M32 stuff) that exploits that
	118	for large BARs in 64-bit space:
	119
	120	We configure an M64 window to cover the entire region of address space
	121	that has been assigned by FW for the PHB (about 64GB, ignore the space
	122	for the M32, it comes out of a different "reserve"). We configure it
	123	as segmented.
	124
	125	Then we do the same thing as with M32, using the bridge alignment
	126	trick, to match to those giant segments.
	127
	128	Since we cannot remap, we have two additional constraints:
	129
	130	- We do the PE# allocation after the 64-bit space has been assigned
	131	because the addresses we use directly determine the PE#. We then
	132	update the M32 PE# for the devices that use both 32-bit and 64-bit
	133	spaces or assign the remaining PE# to 32-bit only devices.
	134
	135	- We cannot "group" segments in HW, so if a device ends up using more
	136	than one segment, we end up with more than one PE#. There is a HW
	137	mechanism to make the freeze state cascade to "companion" PEs but
	138	that only works for PCIe error messages (typically used so that if
	139	you freeze a switch, it freezes all its children). So we do it in
	140	SW. We lose a bit of effectiveness of EEH in that case, but that's
	141	the best we found. So when any of the PEs freezes, we freeze the
	142	other ones for that "domain". We thus introduce the concept of
	143	"master PE" which is the one used for DMA, MSIs, etc., and "secondary
	144	PEs" that are used for the remaining M64 segments.
	145
	146	We would like to investigate using additional M64 windows in "single
	147	PE" mode to overlay over specific BARs to work around some of that, for
	148	example for devices with very large BARs, e.g., GPUs. It would make
	149	sense, but we haven't done it yet.
	150
	151	3. Considerations for SR-IOV on PowerKVM
	152
	153	* SR-IOV Background
	154
	155	The PCIe SR-IOV feature allows a single Physical Function (PF) to
	156	support several Virtual Functions (VFs). Registers in the PF's SR-IOV
	157	Capability control the number of VFs and whether they are enabled.
	158
	159	When VFs are enabled, they appear in Configuration Space like normal
	160	PCI devices, but the BARs in VF config space headers are unusual. For
	161	a non-VF device, software uses BARs in the config space header to
	162	discover the BAR sizes and assign addresses for them. For VF devices,
	163	software uses VF BAR registers in the PF SR-IOV Capability to
	164	discover sizes and assign addresses. The BARs in the VF's config space
	165	header are read-only zeros.
	166
	167	When a VF BAR in the PF SR-IOV Capability is programmed, it sets the
	168	base address for all the corresponding VF(n) BARs. For example, if the
	169	PF SR-IOV Capability is programmed to enable eight VFs, and it has a
	170	1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region.
	171	This region is divided into eight contiguous 1MB regions, each of which
	172	is a BAR0 for one of the VFs. Note that even though the VF BAR
	173	describes an 8MB region, the alignment requirement is for a single VF,
	174	i.e., 1MB in this example.
	175
	176	There are several strategies for isolating VFs in PEs:
	177
	178	- M32 window: There's one M32 window, and it is split into 256
	179	equally-sized segments. The finest granularity possible is a 256MB
	180	window with 1MB segments. VF BARs that are 1MB or larger could be
	181	mapped to separate PEs in this window. Each segment can be
	182	individually mapped to a PE via the lookup table, so this is quite
	183	flexible, but it works best when all the VF BARs are the same size. If
	184	they are different sizes, the entire window has to be small enough that
	185	the segment size matches the smallest VF BAR, which means larger VF
	186	BARs span several segments.
	187
	188	- Non-segmented M64 window: A non-segmented M64 window is mapped entirely
	189	to a single PE, so it could only isolate one VF.
	190
	191	- Single segmented M64 windows: A segmented M64 window could be used just
	192	like the M32 window, but the segments can't be individually mapped to
	193	PEs (the segment number is the PE#), so there isn't as much
	194	flexibility. A VF with multiple BARs would have to be in a "domain" of
	195	multiple PEs, which is not as well isolated as a single PE.
	196
	197	- Multiple segmented M64 windows: As usual, each window is split into 256
	198	equally-sized segments, and the segment number is the PE#. But if we
	199	use several M64 windows, they can be set to different base addresses
	200	and different segment sizes. If we have VFs that each have a 1MB BAR
	201	and a 32MB BAR, we could use one M64 window to assign 1MB segments and
	202	another M64 window to assign 32MB segments.
	203
	204	Finally, the plan to use M64 windows for SR-IOV, which will be described
	205	more in the next two sections. For a given VF BAR, we need to
	206	effectively reserve the entire 256 segments (256 * VF BAR size) and
	207	position the VF BAR to start at the beginning of a free range of
	208	segments/PEs inside that M64 window.
	209
	210	The goal is of course to be able to give a separate PE for each VF.
	211
	212	The IODA2 platform has 16 M64 windows, which are used to map MMIO
	213	range to PE#. Each M64 window defines one MMIO range and this range is
	214	divided into 256 segments, with each segment corresponding to one PE.
	215
	216	We decide to leverage this M64 window to map VFs to individual PEs, since
	217	SR-IOV VF BARs are all the same size.
	218
	219	But doing so introduces another problem: total_VFs is usually smaller
	220	than the number of M64 window segments, so if we map one VF BAR directly
	221	to one M64 window, some part of the M64 window will map to another
	222	device's MMIO range.
	223
	224	IODA supports 256 PEs, so segmented windows contain 256 segments, so if
	225	total_VFs is less than 256, we have the situation in Figure 1.0, where
	226	segments [total_VFs, 255] of the M64 window may map to some MMIO range on
	227	other devices:
	228
	229	0 1 total_VFs - 1
	230	+------+------+- -+------+------+
	231	\| \| \| ... \| \| \|
	232	+------+------+- -+------+------+
	233
	234	VF(n) BAR space
	235
	236	0 1 total_VFs - 1 255
	237	+------+------+- -+------+------+- -+------+------+
	238	\| \| \| ... \| \| \| ... \| \| \|
	239	+------+------+- -+------+------+- -+------+------+
	240
	241	M64 window
	242
	243	Figure 1.0 Direct map VF(n) BAR space
	244
	245	Our current solution is to allocate 256 segments even if the VF(n) BAR
	246	space doesn't need that much, as shown in Figure 1.1:
	247
	248	0 1 total_VFs - 1 255
	249	+------+------+- -+------+------+- -+------+------+
	250	\| \| \| ... \| \| \| ... \| \| \|
	251	+------+------+- -+------+------+- -+------+------+
	252
	253	VF(n) BAR space + extra
	254
	255	0 1 total_VFs - 1 255
	256	+------+------+- -+------+------+- -+------+------+
	257	\| \| \| ... \| \| \| ... \| \| \|
	258	+------+------+- -+------+------+- -+------+------+
	259
	260	M64 window
	261
	262	Figure 1.1 Map VF(n) BAR space + extra
	263
	264	Allocating the extra space ensures that the entire M64 window will be
	265	assigned to this one SR-IOV device and none of the space will be
	266	available for other devices. Note that this only expands the space
	267	reserved in software; there are still only total_VFs VFs, and they only
	268	respond to segments [0, total_VFs - 1]. There's nothing in hardware that
	269	responds to segments [total_VFs, 255].
	270
	271	4. Implications for the Generic PCI Code
	272
	273	The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be
	274	aligned to the size of an individual VF BAR.
	275
	276	In IODA2, the MMIO address determines the PE#. If the address is in an M32
	277	window, we can set the PE# by updating the table that translates segments
	278	to PE#s. Similarly, if the address is in an unsegmented M64 window, we can
	279	set the PE# for the window. But if it's in a segmented M64 window, the
	280	segment number is the PE#.
	281
	282	Therefore, the only way to control the PE# for a VF is to change the base
	283	of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact
	284	amount of space required for the VF(n) BAR space, the VF BAR value is fixed
	285	and cannot be changed.
	286
	287	On the other hand, if the PCI core allocates additional space, the VF BAR
	288	value can be changed as long as the entire VF(n) BAR space remains inside
	289	the space allocated by the core.
	290
	291	Ideally the segment size will be the same as an individual VF BAR size.
	292	Then each VF will be in its own PE. The VF BARs (and therefore the PE#s)
	293	are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we
	294	allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0.
	295
	296	If the segment size is smaller than the VF BAR size, it will take several
	297	segments to cover a VF BAR, and a VF will be in several PEs. This is
	298	possible, but the isolation isn't as good, and it reduces the number of PE#
	299	choices because instead of consuming only numVFs segments, the VF(n) BAR
	300	space will consume (numVFs * n) segments. That means there aren't as many
	301	available segments for adjusting base of the VF(n) BAR space.