diff options
Diffstat (limited to 'Documentation/powerpc/cxl.rst')
-rw-r--r-- | Documentation/powerpc/cxl.rst | 467 |
1 files changed, 467 insertions, 0 deletions
diff --git a/Documentation/powerpc/cxl.rst b/Documentation/powerpc/cxl.rst new file mode 100644 index 000000000000..920546d81326 --- /dev/null +++ b/Documentation/powerpc/cxl.rst | |||
@@ -0,0 +1,467 @@ | |||
1 | ==================================== | ||
2 | Coherent Accelerator Interface (CXL) | ||
3 | ==================================== | ||
4 | |||
5 | Introduction | ||
6 | ============ | ||
7 | |||
8 | The coherent accelerator interface is designed to allow the | ||
9 | coherent connection of accelerators (FPGAs and other devices) to a | ||
10 | POWER system. These devices need to adhere to the Coherent | ||
11 | Accelerator Interface Architecture (CAIA). | ||
12 | |||
13 | IBM refers to this as the Coherent Accelerator Processor Interface | ||
14 | or CAPI. In the kernel it's referred to by the name CXL to avoid | ||
15 | confusion with the ISDN CAPI subsystem. | ||
16 | |||
17 | Coherent in this context means that the accelerator and CPUs can | ||
18 | both access system memory directly and with the same effective | ||
19 | addresses. | ||
20 | |||
21 | |||
22 | Hardware overview | ||
23 | ================= | ||
24 | |||
25 | :: | ||
26 | |||
27 | POWER8/9 FPGA | ||
28 | +----------+ +---------+ | ||
29 | | | | | | ||
30 | | CPU | | AFU | | ||
31 | | | | | | ||
32 | | | | | | ||
33 | | | | | | ||
34 | +----------+ +---------+ | ||
35 | | PHB | | | | ||
36 | | +------+ | PSL | | ||
37 | | | CAPP |<------>| | | ||
38 | +---+------+ PCIE +---------+ | ||
39 | |||
40 | The POWER8/9 chip has a Coherently Attached Processor Proxy (CAPP) | ||
41 | unit which is part of the PCIe Host Bridge (PHB). This is managed | ||
42 | by Linux by calls into OPAL. Linux doesn't directly program the | ||
43 | CAPP. | ||
44 | |||
45 | The FPGA (or coherently attached device) consists of two parts. | ||
46 | The POWER Service Layer (PSL) and the Accelerator Function Unit | ||
47 | (AFU). The AFU is used to implement specific functionality behind | ||
48 | the PSL. The PSL, among other things, provides memory address | ||
49 | translation services to allow each AFU direct access to userspace | ||
50 | memory. | ||
51 | |||
52 | The AFU is the core part of the accelerator (eg. the compression, | ||
53 | crypto etc function). The kernel has no knowledge of the function | ||
54 | of the AFU. Only userspace interacts directly with the AFU. | ||
55 | |||
56 | The PSL provides the translation and interrupt services that the | ||
57 | AFU needs. This is what the kernel interacts with. For example, if | ||
58 | the AFU needs to read a particular effective address, it sends | ||
59 | that address to the PSL, the PSL then translates it, fetches the | ||
60 | data from memory and returns it to the AFU. If the PSL has a | ||
61 | translation miss, it interrupts the kernel and the kernel services | ||
62 | the fault. The context to which this fault is serviced is based on | ||
63 | who owns that acceleration function. | ||
64 | |||
65 | - POWER8 and PSL Version 8 are compliant to the CAIA Version 1.0. | ||
66 | - POWER9 and PSL Version 9 are compliant to the CAIA Version 2.0. | ||
67 | |||
68 | This PSL Version 9 provides new features such as: | ||
69 | |||
70 | * Interaction with the nest MMU on the P9 chip. | ||
71 | * Native DMA support. | ||
72 | * Supports sending ASB_Notify messages for host thread wakeup. | ||
73 | * Supports Atomic operations. | ||
74 | * etc. | ||
75 | |||
76 | Cards with a PSL9 won't work on a POWER8 system and cards with a | ||
77 | PSL8 won't work on a POWER9 system. | ||
78 | |||
79 | AFU Modes | ||
80 | ========= | ||
81 | |||
82 | There are two programming modes supported by the AFU. Dedicated | ||
83 | and AFU directed. AFU may support one or both modes. | ||
84 | |||
85 | When using dedicated mode only one MMU context is supported. In | ||
86 | this mode, only one userspace process can use the accelerator at | ||
87 | time. | ||
88 | |||
89 | When using AFU directed mode, up to 16K simultaneous contexts can | ||
90 | be supported. This means up to 16K simultaneous userspace | ||
91 | applications may use the accelerator (although specific AFUs may | ||
92 | support fewer). In this mode, the AFU sends a 16 bit context ID | ||
93 | with each of its requests. This tells the PSL which context is | ||
94 | associated with each operation. If the PSL can't translate an | ||
95 | operation, the ID can also be accessed by the kernel so it can | ||
96 | determine the userspace context associated with an operation. | ||
97 | |||
98 | |||
99 | MMIO space | ||
100 | ========== | ||
101 | |||
102 | A portion of the accelerator MMIO space can be directly mapped | ||
103 | from the AFU to userspace. Either the whole space can be mapped or | ||
104 | just a per context portion. The hardware is self describing, hence | ||
105 | the kernel can determine the offset and size of the per context | ||
106 | portion. | ||
107 | |||
108 | |||
109 | Interrupts | ||
110 | ========== | ||
111 | |||
112 | AFUs may generate interrupts that are destined for userspace. These | ||
113 | are received by the kernel as hardware interrupts and passed onto | ||
114 | userspace by a read syscall documented below. | ||
115 | |||
116 | Data storage faults and error interrupts are handled by the kernel | ||
117 | driver. | ||
118 | |||
119 | |||
120 | Work Element Descriptor (WED) | ||
121 | ============================= | ||
122 | |||
123 | The WED is a 64-bit parameter passed to the AFU when a context is | ||
124 | started. Its format is up to the AFU hence the kernel has no | ||
125 | knowledge of what it represents. Typically it will be the | ||
126 | effective address of a work queue or status block where the AFU | ||
127 | and userspace can share control and status information. | ||
128 | |||
129 | |||
130 | |||
131 | |||
132 | User API | ||
133 | ======== | ||
134 | |||
135 | 1. AFU character devices | ||
136 | |||
137 | For AFUs operating in AFU directed mode, two character device | ||
138 | files will be created. /dev/cxl/afu0.0m will correspond to a | ||
139 | master context and /dev/cxl/afu0.0s will correspond to a slave | ||
140 | context. Master contexts have access to the full MMIO space an | ||
141 | AFU provides. Slave contexts have access to only the per process | ||
142 | MMIO space an AFU provides. | ||
143 | |||
144 | For AFUs operating in dedicated process mode, the driver will | ||
145 | only create a single character device per AFU called | ||
146 | /dev/cxl/afu0.0d. This will have access to the entire MMIO space | ||
147 | that the AFU provides (like master contexts in AFU directed). | ||
148 | |||
149 | The types described below are defined in include/uapi/misc/cxl.h | ||
150 | |||
151 | The following file operations are supported on both slave and | ||
152 | master devices. | ||
153 | |||
154 | A userspace library libcxl is available here: | ||
155 | |||
156 | https://github.com/ibm-capi/libcxl | ||
157 | |||
158 | This provides a C interface to this kernel API. | ||
159 | |||
160 | open | ||
161 | ---- | ||
162 | |||
163 | Opens the device and allocates a file descriptor to be used with | ||
164 | the rest of the API. | ||
165 | |||
166 | A dedicated mode AFU only has one context and only allows the | ||
167 | device to be opened once. | ||
168 | |||
169 | An AFU directed mode AFU can have many contexts, the device can be | ||
170 | opened once for each context that is available. | ||
171 | |||
172 | When all available contexts are allocated the open call will fail | ||
173 | and return -ENOSPC. | ||
174 | |||
175 | Note: | ||
176 | IRQs need to be allocated for each context, which may limit | ||
177 | the number of contexts that can be created, and therefore | ||
178 | how many times the device can be opened. The POWER8 CAPP | ||
179 | supports 2040 IRQs and 3 are used by the kernel, so 2037 are | ||
180 | left. If 1 IRQ is needed per context, then only 2037 | ||
181 | contexts can be allocated. If 4 IRQs are needed per context, | ||
182 | then only 2037/4 = 509 contexts can be allocated. | ||
183 | |||
184 | |||
185 | ioctl | ||
186 | ----- | ||
187 | |||
188 | CXL_IOCTL_START_WORK: | ||
189 | Starts the AFU context and associates it with the current | ||
190 | process. Once this ioctl is successfully executed, all memory | ||
191 | mapped into this process is accessible to this AFU context | ||
192 | using the same effective addresses. No additional calls are | ||
193 | required to map/unmap memory. The AFU memory context will be | ||
194 | updated as userspace allocates and frees memory. This ioctl | ||
195 | returns once the AFU context is started. | ||
196 | |||
197 | Takes a pointer to a struct cxl_ioctl_start_work | ||
198 | |||
199 | :: | ||
200 | |||
201 | struct cxl_ioctl_start_work { | ||
202 | __u64 flags; | ||
203 | __u64 work_element_descriptor; | ||
204 | __u64 amr; | ||
205 | __s16 num_interrupts; | ||
206 | __s16 reserved1; | ||
207 | __s32 reserved2; | ||
208 | __u64 reserved3; | ||
209 | __u64 reserved4; | ||
210 | __u64 reserved5; | ||
211 | __u64 reserved6; | ||
212 | }; | ||
213 | |||
214 | flags: | ||
215 | Indicates which optional fields in the structure are | ||
216 | valid. | ||
217 | |||
218 | work_element_descriptor: | ||
219 | The Work Element Descriptor (WED) is a 64-bit argument | ||
220 | defined by the AFU. Typically this is an effective | ||
221 | address pointing to an AFU specific structure | ||
222 | describing what work to perform. | ||
223 | |||
224 | amr: | ||
225 | Authority Mask Register (AMR), same as the powerpc | ||
226 | AMR. This field is only used by the kernel when the | ||
227 | corresponding CXL_START_WORK_AMR value is specified in | ||
228 | flags. If not specified the kernel will use a default | ||
229 | value of 0. | ||
230 | |||
231 | num_interrupts: | ||
232 | Number of userspace interrupts to request. This field | ||
233 | is only used by the kernel when the corresponding | ||
234 | CXL_START_WORK_NUM_IRQS value is specified in flags. | ||
235 | If not specified the minimum number required by the | ||
236 | AFU will be allocated. The min and max number can be | ||
237 | obtained from sysfs. | ||
238 | |||
239 | reserved fields: | ||
240 | For ABI padding and future extensions | ||
241 | |||
242 | CXL_IOCTL_GET_PROCESS_ELEMENT: | ||
243 | Get the current context id, also known as the process element. | ||
244 | The value is returned from the kernel as a __u32. | ||
245 | |||
246 | |||
247 | mmap | ||
248 | ---- | ||
249 | |||
250 | An AFU may have an MMIO space to facilitate communication with the | ||
251 | AFU. If it does, the MMIO space can be accessed via mmap. The size | ||
252 | and contents of this area are specific to the particular AFU. The | ||
253 | size can be discovered via sysfs. | ||
254 | |||
255 | In AFU directed mode, master contexts are allowed to map all of | ||
256 | the MMIO space and slave contexts are allowed to only map the per | ||
257 | process MMIO space associated with the context. In dedicated | ||
258 | process mode the entire MMIO space can always be mapped. | ||
259 | |||
260 | This mmap call must be done after the START_WORK ioctl. | ||
261 | |||
262 | Care should be taken when accessing MMIO space. Only 32 and 64-bit | ||
263 | accesses are supported by POWER8. Also, the AFU will be designed | ||
264 | with a specific endianness, so all MMIO accesses should consider | ||
265 | endianness (recommend endian(3) variants like: le64toh(), | ||
266 | be64toh() etc). These endian issues equally apply to shared memory | ||
267 | queues the WED may describe. | ||
268 | |||
269 | |||
270 | read | ||
271 | ---- | ||
272 | |||
273 | Reads events from the AFU. Blocks if no events are pending | ||
274 | (unless O_NONBLOCK is supplied). Returns -EIO in the case of an | ||
275 | unrecoverable error or if the card is removed. | ||
276 | |||
277 | read() will always return an integral number of events. | ||
278 | |||
279 | The buffer passed to read() must be at least 4K bytes. | ||
280 | |||
281 | The result of the read will be a buffer of one or more events, | ||
282 | each event is of type struct cxl_event, of varying size:: | ||
283 | |||
284 | struct cxl_event { | ||
285 | struct cxl_event_header header; | ||
286 | union { | ||
287 | struct cxl_event_afu_interrupt irq; | ||
288 | struct cxl_event_data_storage fault; | ||
289 | struct cxl_event_afu_error afu_error; | ||
290 | }; | ||
291 | }; | ||
292 | |||
293 | The struct cxl_event_header is defined as | ||
294 | |||
295 | :: | ||
296 | |||
297 | struct cxl_event_header { | ||
298 | __u16 type; | ||
299 | __u16 size; | ||
300 | __u16 process_element; | ||
301 | __u16 reserved1; | ||
302 | }; | ||
303 | |||
304 | type: | ||
305 | This defines the type of event. The type determines how | ||
306 | the rest of the event is structured. These types are | ||
307 | described below and defined by enum cxl_event_type. | ||
308 | |||
309 | size: | ||
310 | This is the size of the event in bytes including the | ||
311 | struct cxl_event_header. The start of the next event can | ||
312 | be found at this offset from the start of the current | ||
313 | event. | ||
314 | |||
315 | process_element: | ||
316 | Context ID of the event. | ||
317 | |||
318 | reserved field: | ||
319 | For future extensions and padding. | ||
320 | |||
321 | If the event type is CXL_EVENT_AFU_INTERRUPT then the event | ||
322 | structure is defined as | ||
323 | |||
324 | :: | ||
325 | |||
326 | struct cxl_event_afu_interrupt { | ||
327 | __u16 flags; | ||
328 | __u16 irq; /* Raised AFU interrupt number */ | ||
329 | __u32 reserved1; | ||
330 | }; | ||
331 | |||
332 | flags: | ||
333 | These flags indicate which optional fields are present | ||
334 | in this struct. Currently all fields are mandatory. | ||
335 | |||
336 | irq: | ||
337 | The IRQ number sent by the AFU. | ||
338 | |||
339 | reserved field: | ||
340 | For future extensions and padding. | ||
341 | |||
342 | If the event type is CXL_EVENT_DATA_STORAGE then the event | ||
343 | structure is defined as | ||
344 | |||
345 | :: | ||
346 | |||
347 | struct cxl_event_data_storage { | ||
348 | __u16 flags; | ||
349 | __u16 reserved1; | ||
350 | __u32 reserved2; | ||
351 | __u64 addr; | ||
352 | __u64 dsisr; | ||
353 | __u64 reserved3; | ||
354 | }; | ||
355 | |||
356 | flags: | ||
357 | These flags indicate which optional fields are present in | ||
358 | this struct. Currently all fields are mandatory. | ||
359 | |||
360 | address: | ||
361 | The address that the AFU unsuccessfully attempted to | ||
362 | access. Valid accesses will be handled transparently by the | ||
363 | kernel but invalid accesses will generate this event. | ||
364 | |||
365 | dsisr: | ||
366 | This field gives information on the type of fault. It is a | ||
367 | copy of the DSISR from the PSL hardware when the address | ||
368 | fault occurred. The form of the DSISR is as defined in the | ||
369 | CAIA. | ||
370 | |||
371 | reserved fields: | ||
372 | For future extensions | ||
373 | |||
374 | If the event type is CXL_EVENT_AFU_ERROR then the event structure | ||
375 | is defined as | ||
376 | |||
377 | :: | ||
378 | |||
379 | struct cxl_event_afu_error { | ||
380 | __u16 flags; | ||
381 | __u16 reserved1; | ||
382 | __u32 reserved2; | ||
383 | __u64 error; | ||
384 | }; | ||
385 | |||
386 | flags: | ||
387 | These flags indicate which optional fields are present in | ||
388 | this struct. Currently all fields are Mandatory. | ||
389 | |||
390 | error: | ||
391 | Error status from the AFU. Defined by the AFU. | ||
392 | |||
393 | reserved fields: | ||
394 | For future extensions and padding | ||
395 | |||
396 | |||
397 | 2. Card character device (powerVM guest only) | ||
398 | |||
399 | In a powerVM guest, an extra character device is created for the | ||
400 | card. The device is only used to write (flash) a new image on the | ||
401 | FPGA accelerator. Once the image is written and verified, the | ||
402 | device tree is updated and the card is reset to reload the updated | ||
403 | image. | ||
404 | |||
405 | open | ||
406 | ---- | ||
407 | |||
408 | Opens the device and allocates a file descriptor to be used with | ||
409 | the rest of the API. The device can only be opened once. | ||
410 | |||
411 | ioctl | ||
412 | ----- | ||
413 | |||
414 | CXL_IOCTL_DOWNLOAD_IMAGE / CXL_IOCTL_VALIDATE_IMAGE: | ||
415 | Starts and controls flashing a new FPGA image. Partial | ||
416 | reconfiguration is not supported (yet), so the image must contain | ||
417 | a copy of the PSL and AFU(s). Since an image can be quite large, | ||
418 | the caller may have to iterate, splitting the image in smaller | ||
419 | chunks. | ||
420 | |||
421 | Takes a pointer to a struct cxl_adapter_image:: | ||
422 | |||
423 | struct cxl_adapter_image { | ||
424 | __u64 flags; | ||
425 | __u64 data; | ||
426 | __u64 len_data; | ||
427 | __u64 len_image; | ||
428 | __u64 reserved1; | ||
429 | __u64 reserved2; | ||
430 | __u64 reserved3; | ||
431 | __u64 reserved4; | ||
432 | }; | ||
433 | |||
434 | flags: | ||
435 | These flags indicate which optional fields are present in | ||
436 | this struct. Currently all fields are mandatory. | ||
437 | |||
438 | data: | ||
439 | Pointer to a buffer with part of the image to write to the | ||
440 | card. | ||
441 | |||
442 | len_data: | ||
443 | Size of the buffer pointed to by data. | ||
444 | |||
445 | len_image: | ||
446 | Full size of the image. | ||
447 | |||
448 | |||
449 | Sysfs Class | ||
450 | =========== | ||
451 | |||
452 | A cxl sysfs class is added under /sys/class/cxl to facilitate | ||
453 | enumeration and tuning of the accelerators. Its layout is | ||
454 | described in Documentation/ABI/testing/sysfs-class-cxl | ||
455 | |||
456 | |||
457 | Udev rules | ||
458 | ========== | ||
459 | |||
460 | The following udev rules could be used to create a symlink to the | ||
461 | most logical chardev to use in any programming mode (afuX.Yd for | ||
462 | dedicated, afuX.Ys for afu directed), since the API is virtually | ||
463 | identical for each:: | ||
464 | |||
465 | SUBSYSTEM=="cxl", ATTRS{mode}=="dedicated_process", SYMLINK="cxl/%b" | ||
466 | SUBSYSTEM=="cxl", ATTRS{mode}=="afu_directed", \ | ||
467 | KERNEL=="afu[0-9]*.[0-9]*s", SYMLINK="cxl/%b" | ||