diff options
-rw-r--r-- | Documentation/powerpc/pci_iov_resource_on_powernv.txt | 301 |
1 files changed, 301 insertions, 0 deletions
diff --git a/Documentation/powerpc/pci_iov_resource_on_powernv.txt b/Documentation/powerpc/pci_iov_resource_on_powernv.txt new file mode 100644 index 000000000000..b55c5cd83f8d --- /dev/null +++ b/Documentation/powerpc/pci_iov_resource_on_powernv.txt | |||
@@ -0,0 +1,301 @@ | |||
1 | Wei Yang <weiyang@linux.vnet.ibm.com> | ||
2 | Benjamin Herrenschmidt <benh@au1.ibm.com> | ||
3 | Bjorn Helgaas <bhelgaas@google.com> | ||
4 | 26 Aug 2014 | ||
5 | |||
6 | This document describes the requirement from hardware for PCI MMIO resource | ||
7 | sizing and assignment on PowerKVM and how generic PCI code handles this | ||
8 | requirement. The first two sections describe the concepts of Partitionable | ||
9 | Endpoints and the implementation on P8 (IODA2). The next two sections talks | ||
10 | about considerations on enabling SRIOV on IODA2. | ||
11 | |||
12 | 1. Introduction to Partitionable Endpoints | ||
13 | |||
14 | A Partitionable Endpoint (PE) is a way to group the various resources | ||
15 | associated with a device or a set of devices to provide isolation between | ||
16 | partitions (i.e., filtering of DMA, MSIs etc.) and to provide a mechanism | ||
17 | to freeze a device that is causing errors in order to limit the possibility | ||
18 | of propagation of bad data. | ||
19 | |||
20 | There is thus, in HW, a table of PE states that contains a pair of "frozen" | ||
21 | state bits (one for MMIO and one for DMA, they get set together but can be | ||
22 | cleared independently) for each PE. | ||
23 | |||
24 | When a PE is frozen, all stores in any direction are dropped and all loads | ||
25 | return all 1's value. MSIs are also blocked. There's a bit more state that | ||
26 | captures things like the details of the error that caused the freeze etc., but | ||
27 | that's not critical. | ||
28 | |||
29 | The interesting part is how the various PCIe transactions (MMIO, DMA, ...) | ||
30 | are matched to their corresponding PEs. | ||
31 | |||
32 | The following section provides a rough description of what we have on P8 | ||
33 | (IODA2). Keep in mind that this is all per PHB (PCI host bridge). Each PHB | ||
34 | is a completely separate HW entity that replicates the entire logic, so has | ||
35 | its own set of PEs, etc. | ||
36 | |||
37 | 2. Implementation of Partitionable Endpoints on P8 (IODA2) | ||
38 | |||
39 | P8 supports up to 256 Partitionable Endpoints per PHB. | ||
40 | |||
41 | * Inbound | ||
42 | |||
43 | For DMA, MSIs and inbound PCIe error messages, we have a table (in | ||
44 | memory but accessed in HW by the chip) that provides a direct | ||
45 | correspondence between a PCIe RID (bus/dev/fn) with a PE number. | ||
46 | We call this the RTT. | ||
47 | |||
48 | - For DMA we then provide an entire address space for each PE that can | ||
49 | contain two "windows", depending on the value of PCI address bit 59. | ||
50 | Each window can be configured to be remapped via a "TCE table" (IOMMU | ||
51 | translation table), which has various configurable characteristics | ||
52 | not described here. | ||
53 | |||
54 | - For MSIs, we have two windows in the address space (one at the top of | ||
55 | the 32-bit space and one much higher) which, via a combination of the | ||
56 | address and MSI value, will result in one of the 2048 interrupts per | ||
57 | bridge being triggered. There's a PE# in the interrupt controller | ||
58 | descriptor table as well which is compared with the PE# obtained from | ||
59 | the RTT to "authorize" the device to emit that specific interrupt. | ||
60 | |||
61 | - Error messages just use the RTT. | ||
62 | |||
63 | * Outbound. That's where the tricky part is. | ||
64 | |||
65 | Like other PCI host bridges, the Power8 IODA2 PHB supports "windows" | ||
66 | from the CPU address space to the PCI address space. There is one M32 | ||
67 | window and sixteen M64 windows. They have different characteristics. | ||
68 | First what they have in common: they forward a configurable portion of | ||
69 | the CPU address space to the PCIe bus and must be naturally aligned | ||
70 | power of two in size. The rest is different: | ||
71 | |||
72 | - The M32 window: | ||
73 | |||
74 | * Is limited to 4GB in size. | ||
75 | |||
76 | * Drops the top bits of the address (above the size) and replaces | ||
77 | them with a configurable value. This is typically used to generate | ||
78 | 32-bit PCIe accesses. We configure that window at boot from FW and | ||
79 | don't touch it from Linux; it's usually set to forward a 2GB | ||
80 | portion of address space from the CPU to PCIe | ||
81 | 0x8000_0000..0xffff_ffff. (Note: The top 64KB are actually | ||
82 | reserved for MSIs but this is not a problem at this point; we just | ||
83 | need to ensure Linux doesn't assign anything there, the M32 logic | ||
84 | ignores that however and will forward in that space if we try). | ||
85 | |||
86 | * It is divided into 256 segments of equal size. A table in the chip | ||
87 | maps each segment to a PE#. That allows portions of the MMIO space | ||
88 | to be assigned to PEs on a segment granularity. For a 2GB window, | ||
89 | the segment granularity is 2GB/256 = 8MB. | ||
90 | |||
91 | Now, this is the "main" window we use in Linux today (excluding | ||
92 | SR-IOV). We basically use the trick of forcing the bridge MMIO windows | ||
93 | onto a segment alignment/granularity so that the space behind a bridge | ||
94 | can be assigned to a PE. | ||
95 | |||
96 | Ideally we would like to be able to have individual functions in PEs | ||
97 | but that would mean using a completely different address allocation | ||
98 | scheme where individual function BARs can be "grouped" to fit in one or | ||
99 | more segments. | ||
100 | |||
101 | - The M64 windows: | ||
102 | |||
103 | * Must be at least 256MB in size. | ||
104 | |||
105 | * Do not translate addresses (the address on PCIe is the same as the | ||
106 | address on the PowerBus). There is a way to also set the top 14 | ||
107 | bits which are not conveyed by PowerBus but we don't use this. | ||
108 | |||
109 | * Can be configured to be segmented. When not segmented, we can | ||
110 | specify the PE# for the entire window. When segmented, a window | ||
111 | has 256 segments; however, there is no table for mapping a segment | ||
112 | to a PE#. The segment number *is* the PE#. | ||
113 | |||
114 | * Support overlaps. If an address is covered by multiple windows, | ||
115 | there's a defined ordering for which window applies. | ||
116 | |||
117 | We have code (fairly new compared to the M32 stuff) that exploits that | ||
118 | for large BARs in 64-bit space: | ||
119 | |||
120 | We configure an M64 window to cover the entire region of address space | ||
121 | that has been assigned by FW for the PHB (about 64GB, ignore the space | ||
122 | for the M32, it comes out of a different "reserve"). We configure it | ||
123 | as segmented. | ||
124 | |||
125 | Then we do the same thing as with M32, using the bridge alignment | ||
126 | trick, to match to those giant segments. | ||
127 | |||
128 | Since we cannot remap, we have two additional constraints: | ||
129 | |||
130 | - We do the PE# allocation *after* the 64-bit space has been assigned | ||
131 | because the addresses we use directly determine the PE#. We then | ||
132 | update the M32 PE# for the devices that use both 32-bit and 64-bit | ||
133 | spaces or assign the remaining PE# to 32-bit only devices. | ||
134 | |||
135 | - We cannot "group" segments in HW, so if a device ends up using more | ||
136 | than one segment, we end up with more than one PE#. There is a HW | ||
137 | mechanism to make the freeze state cascade to "companion" PEs but | ||
138 | that only works for PCIe error messages (typically used so that if | ||
139 | you freeze a switch, it freezes all its children). So we do it in | ||
140 | SW. We lose a bit of effectiveness of EEH in that case, but that's | ||
141 | the best we found. So when any of the PEs freezes, we freeze the | ||
142 | other ones for that "domain". We thus introduce the concept of | ||
143 | "master PE" which is the one used for DMA, MSIs, etc., and "secondary | ||
144 | PEs" that are used for the remaining M64 segments. | ||
145 | |||
146 | We would like to investigate using additional M64 windows in "single | ||
147 | PE" mode to overlay over specific BARs to work around some of that, for | ||
148 | example for devices with very large BARs, e.g., GPUs. It would make | ||
149 | sense, but we haven't done it yet. | ||
150 | |||
151 | 3. Considerations for SR-IOV on PowerKVM | ||
152 | |||
153 | * SR-IOV Background | ||
154 | |||
155 | The PCIe SR-IOV feature allows a single Physical Function (PF) to | ||
156 | support several Virtual Functions (VFs). Registers in the PF's SR-IOV | ||
157 | Capability control the number of VFs and whether they are enabled. | ||
158 | |||
159 | When VFs are enabled, they appear in Configuration Space like normal | ||
160 | PCI devices, but the BARs in VF config space headers are unusual. For | ||
161 | a non-VF device, software uses BARs in the config space header to | ||
162 | discover the BAR sizes and assign addresses for them. For VF devices, | ||
163 | software uses VF BAR registers in the *PF* SR-IOV Capability to | ||
164 | discover sizes and assign addresses. The BARs in the VF's config space | ||
165 | header are read-only zeros. | ||
166 | |||
167 | When a VF BAR in the PF SR-IOV Capability is programmed, it sets the | ||
168 | base address for all the corresponding VF(n) BARs. For example, if the | ||
169 | PF SR-IOV Capability is programmed to enable eight VFs, and it has a | ||
170 | 1MB VF BAR0, the address in that VF BAR sets the base of an 8MB region. | ||
171 | This region is divided into eight contiguous 1MB regions, each of which | ||
172 | is a BAR0 for one of the VFs. Note that even though the VF BAR | ||
173 | describes an 8MB region, the alignment requirement is for a single VF, | ||
174 | i.e., 1MB in this example. | ||
175 | |||
176 | There are several strategies for isolating VFs in PEs: | ||
177 | |||
178 | - M32 window: There's one M32 window, and it is split into 256 | ||
179 | equally-sized segments. The finest granularity possible is a 256MB | ||
180 | window with 1MB segments. VF BARs that are 1MB or larger could be | ||
181 | mapped to separate PEs in this window. Each segment can be | ||
182 | individually mapped to a PE via the lookup table, so this is quite | ||
183 | flexible, but it works best when all the VF BARs are the same size. If | ||
184 | they are different sizes, the entire window has to be small enough that | ||
185 | the segment size matches the smallest VF BAR, which means larger VF | ||
186 | BARs span several segments. | ||
187 | |||
188 | - Non-segmented M64 window: A non-segmented M64 window is mapped entirely | ||
189 | to a single PE, so it could only isolate one VF. | ||
190 | |||
191 | - Single segmented M64 windows: A segmented M64 window could be used just | ||
192 | like the M32 window, but the segments can't be individually mapped to | ||
193 | PEs (the segment number is the PE#), so there isn't as much | ||
194 | flexibility. A VF with multiple BARs would have to be in a "domain" of | ||
195 | multiple PEs, which is not as well isolated as a single PE. | ||
196 | |||
197 | - Multiple segmented M64 windows: As usual, each window is split into 256 | ||
198 | equally-sized segments, and the segment number is the PE#. But if we | ||
199 | use several M64 windows, they can be set to different base addresses | ||
200 | and different segment sizes. If we have VFs that each have a 1MB BAR | ||
201 | and a 32MB BAR, we could use one M64 window to assign 1MB segments and | ||
202 | another M64 window to assign 32MB segments. | ||
203 | |||
204 | Finally, the plan to use M64 windows for SR-IOV, which will be described | ||
205 | more in the next two sections. For a given VF BAR, we need to | ||
206 | effectively reserve the entire 256 segments (256 * VF BAR size) and | ||
207 | position the VF BAR to start at the beginning of a free range of | ||
208 | segments/PEs inside that M64 window. | ||
209 | |||
210 | The goal is of course to be able to give a separate PE for each VF. | ||
211 | |||
212 | The IODA2 platform has 16 M64 windows, which are used to map MMIO | ||
213 | range to PE#. Each M64 window defines one MMIO range and this range is | ||
214 | divided into 256 segments, with each segment corresponding to one PE. | ||
215 | |||
216 | We decide to leverage this M64 window to map VFs to individual PEs, since | ||
217 | SR-IOV VF BARs are all the same size. | ||
218 | |||
219 | But doing so introduces another problem: total_VFs is usually smaller | ||
220 | than the number of M64 window segments, so if we map one VF BAR directly | ||
221 | to one M64 window, some part of the M64 window will map to another | ||
222 | device's MMIO range. | ||
223 | |||
224 | IODA supports 256 PEs, so segmented windows contain 256 segments, so if | ||
225 | total_VFs is less than 256, we have the situation in Figure 1.0, where | ||
226 | segments [total_VFs, 255] of the M64 window may map to some MMIO range on | ||
227 | other devices: | ||
228 | |||
229 | 0 1 total_VFs - 1 | ||
230 | +------+------+- -+------+------+ | ||
231 | | | | ... | | | | ||
232 | +------+------+- -+------+------+ | ||
233 | |||
234 | VF(n) BAR space | ||
235 | |||
236 | 0 1 total_VFs - 1 255 | ||
237 | +------+------+- -+------+------+- -+------+------+ | ||
238 | | | | ... | | | ... | | | | ||
239 | +------+------+- -+------+------+- -+------+------+ | ||
240 | |||
241 | M64 window | ||
242 | |||
243 | Figure 1.0 Direct map VF(n) BAR space | ||
244 | |||
245 | Our current solution is to allocate 256 segments even if the VF(n) BAR | ||
246 | space doesn't need that much, as shown in Figure 1.1: | ||
247 | |||
248 | 0 1 total_VFs - 1 255 | ||
249 | +------+------+- -+------+------+- -+------+------+ | ||
250 | | | | ... | | | ... | | | | ||
251 | +------+------+- -+------+------+- -+------+------+ | ||
252 | |||
253 | VF(n) BAR space + extra | ||
254 | |||
255 | 0 1 total_VFs - 1 255 | ||
256 | +------+------+- -+------+------+- -+------+------+ | ||
257 | | | | ... | | | ... | | | | ||
258 | +------+------+- -+------+------+- -+------+------+ | ||
259 | |||
260 | M64 window | ||
261 | |||
262 | Figure 1.1 Map VF(n) BAR space + extra | ||
263 | |||
264 | Allocating the extra space ensures that the entire M64 window will be | ||
265 | assigned to this one SR-IOV device and none of the space will be | ||
266 | available for other devices. Note that this only expands the space | ||
267 | reserved in software; there are still only total_VFs VFs, and they only | ||
268 | respond to segments [0, total_VFs - 1]. There's nothing in hardware that | ||
269 | responds to segments [total_VFs, 255]. | ||
270 | |||
271 | 4. Implications for the Generic PCI Code | ||
272 | |||
273 | The PCIe SR-IOV spec requires that the base of the VF(n) BAR space be | ||
274 | aligned to the size of an individual VF BAR. | ||
275 | |||
276 | In IODA2, the MMIO address determines the PE#. If the address is in an M32 | ||
277 | window, we can set the PE# by updating the table that translates segments | ||
278 | to PE#s. Similarly, if the address is in an unsegmented M64 window, we can | ||
279 | set the PE# for the window. But if it's in a segmented M64 window, the | ||
280 | segment number is the PE#. | ||
281 | |||
282 | Therefore, the only way to control the PE# for a VF is to change the base | ||
283 | of the VF(n) BAR space in the VF BAR. If the PCI core allocates the exact | ||
284 | amount of space required for the VF(n) BAR space, the VF BAR value is fixed | ||
285 | and cannot be changed. | ||
286 | |||
287 | On the other hand, if the PCI core allocates additional space, the VF BAR | ||
288 | value can be changed as long as the entire VF(n) BAR space remains inside | ||
289 | the space allocated by the core. | ||
290 | |||
291 | Ideally the segment size will be the same as an individual VF BAR size. | ||
292 | Then each VF will be in its own PE. The VF BARs (and therefore the PE#s) | ||
293 | are contiguous. If VF0 is in PE(x), then VF(n) is in PE(x+n). If we | ||
294 | allocate 256 segments, there are (256 - numVFs) choices for the PE# of VF0. | ||
295 | |||
296 | If the segment size is smaller than the VF BAR size, it will take several | ||
297 | segments to cover a VF BAR, and a VF will be in several PEs. This is | ||
298 | possible, but the isolation isn't as good, and it reduces the number of PE# | ||
299 | choices because instead of consuming only numVFs segments, the VF(n) BAR | ||
300 | space will consume (numVFs * n) segments. That means there aren't as many | ||
301 | available segments for adjusting base of the VF(n) BAR space. | ||