aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorRob Gardner <rob.gardner@oracle.com>2017-12-05 21:40:43 -0500
committerDavid S. Miller <davem@davemloft.net>2018-01-22 11:17:16 -0500
commitdd0273284c7474100bcd331887443f0e4b1dcce8 (patch)
tree2fe7bb4e2925efebad2ac1550f38f5dd88bdb7eb
parentc2b5934ff505dc71247b2c7f5927c1e9b6b13c68 (diff)
sparc64: Oracle DAX driver
DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8 (DAX2) processor chips, and has direct access to the CPU's L3 caches as well as physical memory. It can perform several operations on data streams with various input and output formats. This driver provides a transport mechanism and has limited knowledge of the various opcodes and data formats. A user space library provides high level services and translates these into low level commands which are then passed into the driver and subsequently the hypervisor and the coprocessor. The library is the recommended way for applications to use the coprocessor, and the driver interface is not intended for general use. Signed-off-by: Rob Gardner <rob.gardner@oracle.com> Signed-off-by: Jonathan Helman <jonathan.helman@oracle.com> Signed-off-by: Sanath Kumar <sanath099@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
-rw-r--r--Documentation/sparc/oradax/dax-hv-api.txt1433
-rw-r--r--Documentation/sparc/oradax/oracle-dax.txt429
-rw-r--r--arch/sparc/include/uapi/asm/oradax.h91
-rw-r--r--drivers/sbus/char/Kconfig8
-rw-r--r--drivers/sbus/char/Makefile1
-rw-r--r--drivers/sbus/char/oradax.c1005
6 files changed, 2967 insertions, 0 deletions
diff --git a/Documentation/sparc/oradax/dax-hv-api.txt b/Documentation/sparc/oradax/dax-hv-api.txt
new file mode 100644
index 000000000000..73e8d506cf64
--- /dev/null
+++ b/Documentation/sparc/oradax/dax-hv-api.txt
@@ -0,0 +1,1433 @@
1Excerpt from UltraSPARC Virtual Machine Specification
2Compiled from version 3.0.20+15
3Publication date 2017-09-25 08:21
4Copyright © 2008, 2015 Oracle and/or its affiliates. All rights reserved.
5Extracted via "pdftotext -f 547 -l 572 -layout sun4v_20170925.pdf"
6Authors:
7 Charles Kunzman
8 Sam Glidden
9 Mark Cianchetti
10
11
12Chapter 36. Coprocessor services
13 The following APIs provide access via the Hypervisor to hardware assisted data processing functionality.
14 These APIs may only be provided by certain platforms, and may not be available to all virtual machines
15 even on supported platforms. Restrictions on the use of these APIs may be imposed in order to support
16 live-migration and other system management activities.
17
1836.1. Data Analytics Accelerator
19 The Data Analytics Accelerator (DAX) functionality is a collection of hardware coprocessors that provide
20 high speed processoring of database-centric operations. The coprocessors may support one or more of
21 the following data query operations: search, extraction, compression, decompression, and translation. The
22 functionality offered may vary by virtual machine implementation.
23
24 The DAX is a virtual device to sun4v guests, with supported data operations indicated by the virtual device
25 compatibilty property. Functionality is accessed through the submission of Command Control Blocks
26 (CCBs) via the ccb_submit API function. The operations are processed asynchronously, with the status
27 of the submitted operations reported through a Completion Area linked to each CCB. Each CCB has a
28 separate Completion Area and, unless execution order is specifically restricted through the use of serial-
29 conditional flags, the execution order of submitted CCBs is arbitrary. Likewise, the time to completion
30 for a given CCB is never guaranteed.
31
32 Guest software may implement a software timeout on CCB operations, and if the timeout is exceeded, the
33 operation may be cancelled or killed via the ccb_kill API function. It is recommended for guest software
34 to implement a software timeout to account for certain RAS errors which may result in lost CCBs. It is
35 recommended such implementation use the ccb_info API function to check the status of a CCB prior to
36 killing it in order to determine if the CCB is still in queue, or may have been lost due to a RAS error.
37
38 There is no fixed limit on the number of outstanding CCBs guest software may have queued in the virtual
39 machine, however, internal resource limitations within the virtual machine can cause CCB submissions
40 to be temporarily rejected with EWOULDBLOCK. In such cases, guests should continue to attempt
41 submissions until they succeed; waiting for an outstanding CCB to complete is not necessary, and would
42 not be a guarantee that a future submission would succeed.
43
44 The availablility of DAX coprocessor command service is indicated by the presence of the DAX virtual
45 device node in the guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device
46 node”).
47
4836.1.1. DAX Compatibility Property
49 The query functionality may vary based on the compatibility property of the virtual device:
50
5136.1.1.1. "ORCL,sun4v-dax" Device Compatibility
52 Available CCB commands:
53
54 • No-op/Sync
55
56 • Extract
57
58 • Scan Value
59
60 • Inverted Scan Value
61
62 • Scan Range
63
64
65 509
66 Coprocessor services
67
68
69 • Inverted Scan Range
70
71 • Translate
72
73 • Inverted Translate
74
75 • Select
76
77 See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
78
79 Only version 0 CCBs are available.
80
8136.1.1.2. "ORCL,sun4v-dax-fc" Device Compatibility
82 "ORCL,sun4v-dax-fc" is compatible with the "ORCL,sun4v-dax" interface, and includes additional CCB
83 bit fields and controls.
84
8536.1.1.3. "ORCL,sun4v-dax2" Device Compatibility
86 Available CCB commands:
87
88 • No-op/Sync
89
90 • Extract
91
92 • Scan Value
93
94 • Inverted Scan Value
95
96 • Scan Range
97
98 • Inverted Scan Range
99
100 • Translate
101
102 • Inverted Translate
103
104 • Select
105
106 See Section 36.2.1, “Query CCB Command Formats” for the corresponding CCB input and output formats.
107
108 Version 0 and 1 CCBs are available. Only version 0 CCBs may use Huffman encoded data, whereas only
109 version 1 CCBs may use OZIP.
110
11136.1.2. DAX Virtual Device Interrupts
112 The DAX virtual device has multiple interrupts associated with it which may be used by the guest if
113 desired. The number of device interrupts available to the guest is indicated in the virtual device node of the
114 guest MD (Section 8.24.17, “Database Analytics Accelerators (DAX) virtual-device node”). If the device
115 node indicates N interrupts available, the guest may use any value from 0 to N - 1 (inclusive) in a CCB
116 interrupt number field. Using values outside this range will result in the CCB being rejected for an invalid
117 field value.
118
119 The interrupts may be bound and managed using the standard sun4v device interrupts API (Chapter 16,
120 Device interrupt services). Sysino interrupts are not available for DAX devices.
121
12236.2. Coprocessor Control Block (CCB)
123 CCBs are either 64 or 128 bytes long, depending on the operation type. The exact contents of the CCB
124 are command specific, but all CCBs contain at least one memory buffer address. All memory locations
125
126
127 510
128 Coprocessor services
129
130
131referenced by a CCB must be pinned in memory until the CCB either completes execution or is killed
132via the ccb_kill API call. Changes in virtual address mappings occurring after CCB submission are not
133guaranteed to be visible, and as such all virtual address updates need to be synchronized with CCB
134execution.
135
136All CCBs begin with a common 32-bit header.
137
138Table 36.1. CCB Header Format
139Bits Field Description
140[31:28] CCB version. For API version 2.0: set to 1 if CCB uses OZIP encoding; set to 0 if the CCB
141 uses Huffman encoding; otherwise either 0 or 1. For API version 1.0: always set to 0.
142[27] When API version 2.0 is negotiated, this is the Pipeline Flag [512]. It is reserved in
143 API version 1.0
144[26] Long CCB flag [512]
145[25] Conditional synchronization flag [512]
146[24] Serial synchronization flag
147[23:16] CCB operation code:
148 0x00 No Operation (No-op) or Sync
149 0x01 Extract
150 0x02 Scan Value
151 0x12 Inverted Scan Value
152 0x03 Scan Range
153 0x13 Inverted Scan Range
154 0x04 Translate
155 0x14 Inverted Translate
156 0x05 Select
157[15:13] Reserved
158[12:11] Table address type
159 0b'00 No address
160 0b'01 Alternate context virtual address
161 0b'10 Real address
162 0b'11 Primary context virtual address
163[10:8] Output/Destination address type
164 0b'000 No address
165 0b'001 Alternate context virtual address
166 0b'010 Real address
167 0b'011 Primary context virtual address
168 0b'100 Reserved
169 0b'101 Reserved
170 0b'110 Reserved
171 0b'111 Reserved
172[7:5] Secondary source address type
173
174
175 511
176 Coprocessor services
177
178
179Bits Field Description
180 0b'000 No address
181 0b'001 Alternate context virtual address
182 0b'010 Real address
183 0b'011 Primary context virtual address
184 0b'100 Reserved
185 0b'101 Reserved
186 0b'110 Reserved
187 0b'111 Reserved
188[4:2] Primary source address type
189 0b'000 No address
190 0b'001 Alternate context virtual address
191 0b'010 Real address
192 0b'011 Primary context virtual address
193 0b'100 Reserved
194 0b'101 Reserved
195 0b'110 Reserved
196 0b'111 Reserved
197[1:0] Completion area address type
198 0b'00 No address
199 0b'01 Alternate context virtual address
200 0b'10 Real address
201 0b'11 Primary context virtual address
202
203The Long CCB flag indicates whether the submitted CCB is 64 or 128 bytes long; value is 0 for 64 bytes
204and 1 for 128 bytes.
205
206The Serial and Conditional flags allow simple relative ordering between CCBs. Any CCB with the Serial
207flag set will execute sequentially relative to any previous CCB that is also marked as Serial in the same
208CCB submission. CCBs without the Serial flag set execute independently, even if they are between CCBs
209with the Serial flag set. CCBs marked solely with the Serial flag will execute upon the completion of the
210previous Serial CCB, regardless of the completion status of that CCB. The Conditional flag allows CCBs
211to conditionally execute based on the successful execution of the closest CCB marked with the Serial flag.
212A CCB may only be conditional on exactly one CCB, however, a CCB may be marked both Conditional
213and Serial to allow execution chaining. The flags do NOT allow fan-out chaining, where multiple CCBs
214execute in parallel based on the completion of another CCB.
215
216The Pipeline flag is an optimization that directs the output of one CCB (the "source" CCB) directly to
217the input of the next CCB (the "target" CCB). The target CCB thus does not need to read the input from
218memory. The Pipeline flag is advisory and may be dropped.
219
220Both the Pipeline and Serial bits must be set in the source CCB. The Conditional bit must be set in the
221target CCB. Exactly one CCB must be made conditional on the source CCB; either 0 or 2 target CCBs
222is invalid. However, Pipelines can be extended beyond two CCBs: the sequence would start with a CCB
223with both the Pipeline and Serial bits set, proceed through CCBs with the Pipeline, Serial, and Conditional
224bits set, and terminate at a CCB that has the Conditional bit set, but not the Pipeline bit.
225
226
227 512
228 Coprocessor services
229
230
231 The input of the target CCB must start within 64 bytes of the output of the source CCB or the pipeline flag
232 will be ignored. All CCBs in a pipeline must be submitted in the same call to ccb_submit.
233
234 The various address type fields indicate how the various address values used in the CCB should be
235 interpreted by the virtual machine. Not all of the types specified are used by every CCB format. Types
236 which are not applicable to the given CCB command should be indicated as type 0 (No address). Virtual
237 addresses used in the CCB must have translation entries present in either the TLB or a configured TSB
238 for the submitting virtual processor. Virtual addresses which cannot be translated by the virtual machine
239 will result in the CCB submission being rejected, with the causal virtual address indicated. The CCB
240 may be resubmitted after inserting the translation, or the address may be translated by guest software and
241 resubmitted using the real address translation.
242
24336.2.1. Query CCB Command Formats
24436.2.1.1. Supported Data Formats, Elements Sizes and Offsets
245 Data for query commands may be encoded in multiple possible formats. The data query commands use a
246 common set of values to indicate the encoding formats of the data being processed. Some encoding formats
247 require multiple data streams for processing, requiring the specification of both primary data formats (the
248 encoded data) and secondary data streams (meta-data for the encoded data).
249
25036.2.1.1.1. Primary Input Format
251
252 The primary input format code is a 4-bit field when it is used. There are 10 primary input formats available.
253 The packed formats are not endian neutral. Code values not listed below are reserved.
254
255 Code Format Description
256 0x0 Fixed width byte packed Up to 16 bytes
257 0x1 Fixed width bit packed Up to 15 bits (CCB version 0) or 23 bits (CCB version
258 1); bits are read most significant bit to least significant bit
259 within a byte
260 0x2 Variable width byte packed Data stream of lengths must be provided as a secondary
261 input
262 0x4 Fixed width byte packed with run Up to 16 bytes; data stream of run lengths must be
263 length encoding provided as a secondary input
264 0x5 Fixed width bit packed with run Up to 15 bits (CCB version 0) or 23 bits (CCB version
265 length encoding 1); bits are read most significant bit to least significant bit
266 within a byte; data stream of run lengths must be provided
267 as a secondary input
268 0x8 Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
269 Huffman (CCB version 0) or bits are read most significant bit to least significant bit
270 OZIP (CCB version 1) encoding within a byte; pointer to the encoding table must be
271 provided
272 0x9 Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits (CCB version
273 Huffman (CCB version 0) or 1); compressed stream bits are read most significant bit to
274 OZIP (CCB version 1) encoding least significant bit within a byte; pointer to the encoding
275 table must be provided
276 0xA Variable width byte packed with Up to 16 bytes before the encoding; compressed stream
277 Huffman (CCB version 0) or bits are read most significant bit to least significant bit
278 OZIP (CCB version 1) encoding within a byte; data stream of lengths must be provided as
279 a secondary input; pointer to the encoding table must be
280 provided
281
282
283 513
284 Coprocessor services
285
286
287 Code Format Description
288 0xC Fixed width byte packed with Up to 16 bytes before the encoding; compressed stream
289 run length encoding, followed by bits are read most significant bit to least significant bit
290 Huffman (CCB version 0) or within a byte; data stream of run lengths must be provided
291 OZIP (CCB version 1) encoding as a secondary input; pointer to the encoding table must
292 be provided
293 0xD Fixed width bit packed with Up to 15 bits (CCB version 0) or 23 bits(CCB version 1)
294 run length encoding, followed by before the encoding; compressed stream bits are read most
295 Huffman (CCB version 0) or significant bit to least significant bit within a byte; data
296 OZIP (CCB version 1) encoding stream of run lengths must be provided as a secondary
297 input; pointer to the encoding table must be provided
298
299 If OZIP encoding is used, there must be no reserved bytes in the table.
300
30136.2.1.1.2. Primary Input Element Size
302
303 For primary input data streams with fixed size elements, the element size must be indicated in the CCB
304 command. The size is encoded as the number of bits or bytes, minus one. The valid value range for this
305 field depends on the input format selected, as listed in the table above.
306
30736.2.1.1.3. Secondary Input Format
308
309 For primary input data streams which require a secondary input stream, the secondary input stream is
310 always encoded in a fixed width, bit-packed format. The bits are read from most significant bit to least
311 significant bit within a byte. There are two encoding options for the secondary input stream data elements,
312 depending on whether the value of 0 is needed:
313
314 Secondary Input Description
315 Format Code
316 0 Element is stored as value minus 1 (0 evalutes to 1, 1 evalutes
317 to 2, etc)
318 1 Element is stored as value
319
32036.2.1.1.4. Secondary Input Element Size
321
322 Secondary input element size is encoded as a two bit field:
323
324 Secondary Input Size Description
325 Code
326 0x0 1 bit
327 0x1 2 bits
328 0x2 4 bits
329 0x3 8 bits
330
33136.2.1.1.5. Input Element Offsets
332
333 Bit-wise input data streams may have any alignment within the base addressed byte. The offset, specified
334 from most significant bit to least significant bit, is provided as a fixed 3 bit field for each input type. A
335 value of 0 indicates that the first input element begins at the most significant bit in the first byte, and a
336 value of 7 indicates it begins with the least significant bit.
337
338 This field should be zero for any byte-wise primary input data streams.
339
340
341 514
342 Coprocessor services
343
344
34536.2.1.1.6. Output Format
346
347 Query commands support multiple sizes and encodings for output data streams. There are four possible
348 output encodings, and up to four supported element sizes per encoding. Not all output encodings are
349 supported for every command. The format is indicated by a 4-bit field in the CCB:
350
351 Output Format Code Description
352 0x0 Byte aligned, 1 byte elements
353 0x1 Byte aligned, 2 byte elements
354 0x2 Byte aligned, 4 byte elements
355 0x3 Byte aligned, 8 byte elements
356 0x4 16 byte aligned, 16 byte elements
357 0x5 Reserved
358 0x6 Reserved
359 0x7 Reserved
360 0x8 Packed vector of single bit elements
361 0x9 Reserved
362 0xA Reserved
363 0xB Reserved
364 0xC Reserved
365 0xD 2 byte elements where each element is the index value of a bit,
366 from an bit vector, which was 1.
367 0xE 4 byte elements where each element is the index value of a bit,
368 from an bit vector, which was 1.
369 0xF Reserved
370
37136.2.1.1.7. Application Data Integrity (ADI)
372
373 On platforms which support ADI, the ADI version number may be specified for each separate memory
374 access type used in the CCB command. ADI checking only occurs when reading data. When writing data,
375 the specified ADI version number overwrites any existing ADI value in memory.
376
377 An ADI version value of 0 or 0xF indicates the ADI checking is disabled for that data access, even if it is
378 enabled in memory. By setting the appropriate flag in CCB_SUBMIT (Section 36.3.1, “ccb_submit”) it is
379 also an option to disable ADI checking for all inputs accessed via virtual address for all CCBs submitted
380 during that hypercall invocation.
381
382 The ADI value is only guaranteed to be checked on the first 64 bytes of each data access. Mismatches on
383 subsequent data chunks may not be detected, so guest software should be careful to use page size checking
384 to protect against buffer overruns.
385
38636.2.1.1.8. Page size checking
387
388 All data accesses used in CCB commands must be bounded within a single memory page. When addresses
389 are provided using a virtual address, the page size for checking is extracted from the TTE for that virtual
390 address. When using real addresses, the guest must supply the page size in the same field as the address
391 value. The page size must be one of the sizes supported by the underlying virtual machine. Using a value
392 that is not supported may result in the CCB submission being rejected or the generation of a CCB parsing
393 error in the completion area.
394
395
396 515
397 Coprocessor services
398
399
40036.2.1.2. Extract command
401
402 Converts an input vector in one format to an output vector in another format. All input format types are
403 supported.
404
405 The only supported output format is a padded, byte-aligned output stream, using output codes 0x0 - 0x4.
406 When the specified output element size is larger than the extracted input element size, zeros are padded to
407 the extracted input element. First, if the decompressed input size is not a whole number of bytes, 0 bits are
408 padded to the most significant bit side till the next byte boundary. Next, if the output element size is larger
409 than the byte padded input element, bytes of value 0 are added based on the Padding Direction bit in the
410 CCB. If the output element size is smaller than the byte-padded input element size, the input element is
411 truncated by dropped from the least significant byte side until the selected output size is reached.
412
413 The return value of the CCB completion area is invalid. The “number of elements processed” field in the
414 CCB completion area will be valid.
415
416 The extract CCB is a 64-byte “short format” CCB.
417
418 The extract CCB command format can be specified by the following packed C structure for a big-endian
419 machine:
420
421
422 struct extract_ccb {
423 uint32_t header;
424 uint32_t control;
425 uint64_t completion;
426 uint64_t primary_input;
427 uint64_t data_access_control;
428 uint64_t secondary_input;
429 uint64_t reserved;
430 uint64_t output;
431 uint64_t table;
432 };
433
434
435 The exact field offsets, sizes, and composition are as follows:
436
437 Offset Size Field Description
438 0 4 CCB header (Table 36.1, “CCB Header Format”)
439 4 4 Command control
440 Bits Field Description
441 [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
442 Format”)
443 [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
444 Input Element Size”)
445 [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
446 Element Offsets”)
447 [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
448 Input Format”)
449 [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
450 Element Offsets”)
451
452
453 516
454 Coprocessor services
455
456
457Offset Size Field Description
458 Bits Field Description
459 [15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
460 “Secondary Input Element Size”
461 [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
462 [9] Padding Direction selector: A value of 1 causes padding bytes
463 to be added to the left side of output elements. A value of 0
464 causes padding bytes to be added to the right side of output
465 elements.
466 [8:0] Reserved
4678 8 Completion
468 Bits Field Description
469 [63:60] ADI version (see Section 36.2.1.1.7, “Application Data
470 Integrity (ADI)”)
471 [59] If set to 1, a virtual device interrupt will be generated using
472 the device interrupt number specified in the lower bits of this
473 completion word. If 0, the lower bits of this completion word
474 are ignored.
475 [58:6] Completion area address bits [58:6]. Address type is
476 determined by CCB header.
477 [5:0] Virtual device interrupt number for completion interrupt, if
478 enabled.
47916 8 Primary Input
480 Bits Field Description
481 [63:60] ADI version (see Section 36.2.1.1.7, “Application Data
482 Integrity (ADI)”)
483 [59:56] If using real address, these bits should be filled in with the
484 page size code for the page boundary checking the guest wants
485 the virtual machine to use when accessing this data stream
486 (checking is only guaranteed to be performed when using API
487 version 1.1 and later). If using a virtual address, this field will
488 be used as as primary input address bits [59:56].
489 [55:0] Primary input address bits [55:0]. Address type is determined
490 by CCB header.
49124 8 Data Access Control
492 Bits Field Description
493 [63:62] Flow Control
494 Value Description
495 0b'00 Disable flow control
496 0b'01 Enable flow control (only valid with "ORCL,sun4v-
497 dax-fc" compatible virtual device variants)
498 0b'10 Reserved
499 0b'11 Reserved
500 [61:60] Reserved (API 1.0)
501
502
503 517
504 Coprocessor services
505
506
507Offset Size Field Description
508 Bits Field Description
509 Pipeline target (API 2.0)
510 Value Description
511 0b'00 Connect to primary input
512 0b'01 Connect to secondary input
513 0b'10 Reserved
514 0b'11 Reserved
515 [59:40] Output buffer size given in units of 64 bytes, minus 1. Value of
516 0 means 64 bytes, value of 1 means 128 bytes, etc. Buffer size is
517 only enforced if flow control is enabled in Flow Control field.
518 [39:32] Reserved
519 [31:30] Output Data Cache Allocation
520 Value Description
521 0b'00 Do not allocate cache lines for output data stream.
522 0b'01 Force cache lines for output data stream to be
523 allocated in the cache that is local to the submitting
524 virtual cpu.
525 0b'10 Allocate cache lines for output data stream, but allow
526 existing cache lines associated with the data to remain
527 in their current cache instance. Any memory not
528 already in cache will be allocated in the cache local
529 to the submitting virtual cpu.
530 0b'11 Reserved
531 [29:26] Reserved
532 [25:24] Primary Input Length Format
533 Value Description
534 0b'00 Number of primary symbols
535 0b'01 Number of primary bytes
536 0b'10 Number of primary bits
537 0b'11 Reserved
538 [23:0] Primary Input Length
539 Format Field Value
540 # of primary symbols Number of input elements to process,
541 minus 1. Command execution stops
542 once count is reached.
543 # of primary bytes Number of input bytes to process,
544 minus 1. Command execution stops
545 once count is reached. The count is
546 done before any decompression or
547 decoding.
548 # of primary bits Number of input bits to process,
549 minus 1. Command execution stops
550
551
552
553 518
554 Coprocessor services
555
556
557 Offset Size Field Description
558 Bits Field Description
559 Format Field Value
560 once count is reached. The count is
561 done before any decompression or
562 decoding, and does not include any
563 bits skipped by the Primary Input
564 Offset field value of the command
565 control word.
566 32 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
567 Input.
568 40 8 Reserved
569 48 8 Output (same fields as Primary Input)
570 56 8 Symbol Table (if used by Primary Input)
571 Bits Field Description
572 [63:60] ADI version (see Section 36.2.1.1.7, “Application Data
573 Integrity (ADI)”)
574 [59:56] If using real address, these bits should be filled in with the
575 page size code for the page boundary checking the guest wants
576 the virtual machine to use when accessing this data stream
577 (checking is only guaranteed to be performed when using API
578 version 1.1 and later). If using a virtual address, this field will
579 be used as as symbol table address bits [59:56].
580 [55:4] Symbol table address bits [55:4]. Address type is determined
581 by CCB header.
582 [3:0] Symbol table version
583 Value Description
584 0 Huffman encoding. Must use 64 byte aligned table
585 address. (Only available when using version 0 CCBs)
586 1 OZIP encoding. Must use 16 byte aligned table
587 address. (Only available when using version 1 CCBs)
588
589
59036.2.1.3. Scan commands
591
592 The scan commands search a stream of input data elements for values which match the selection criteria.
593 All the input format types are supported. There are multiple formats for the scan commands, allowing the
594 scan to search for exact matches to one value, exact matches to either of two values, or any value within
595 a specified range. The specific type of scan is indicated by the command code in the CCB header. For the
596 scan range commands, the boundary conditions can be specified as greater-than-or-equal-to a value, less-
597 than-or-equal-to a value, or both by using two boundary values.
598
599 There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
600 0xD, and 0xE). For the standard scan command using the bit vector output, for each input element there
601 exists one bit in the vector that is set if the input element matched the scan criteria, or clear if not. The
602 inverted scan command inverts the polarity of the bits in the output. The most significant bit of the first
603 byte of the output stream corresponds to the first element in the input stream. The standard index array
604 output format contains one array entry for each input element that matched the scan criteria. Each array
605
606
607
608 519
609 Coprocessor services
610
611
612entry is the index of an input element that matched the scan criteria. An inverted scan command produces
613a similar array, but of all the input elements which did NOT match the scan criteria.
614
615The return value of the CCB completion area contains the number of input elements found which match
616the scan criteria (or number that did not match for the inverted scans). The “number of elements processed”
617field in the CCB completion area will be valid, indicating the number of input elements processed.
618
619These commands are 128-byte “long format” CCBs.
620
621The scan CCB command format can be specified by the following packed C structure for a big-endian
622machine:
623
624
625 struct scan_ccb {
626 uint32_t header;
627 uint32_t control;
628 uint64_t completion;
629 uint64_t primary_input;
630 uint64_t data_access_control;
631 uint64_t secondary_input;
632 uint64_t match_criteria0;
633 uint64_t output;
634 uint64_t table;
635 uint64_t match_criteria1;
636 uint64_t match_criteria2;
637 uint64_t match_criteria3;
638 uint64_t reserved[5];
639 };
640
641
642The exact field offsets, sizes, and composition are as follows:
643
644Offset Size Field Description
6450 4 CCB header (Table 36.1, “CCB Header Format”)
6464 4 Command control
647 Bits Field Description
648 [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
649 Format”)
650 [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
651 Input Element Size”)
652 [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
653 Element Offsets”)
654 [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
655 Input Format”)
656 [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
657 Element Offsets”)
658 [15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
659 “Secondary Input Element Size”
660 [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
661 [9:5] Operand size for first scan criteria value. In a scan value
662 operation, this is one of two potential extact match values.
663 In a scan range operation, this is the size of the upper range
664
665
666 520
667 Coprocessor services
668
669
670Offset Size Field Description
671 Bits Field Description
672 boundary. The value of this field is the number of bytes in the
673 operand, minus 1. Values 0xF-0x1E are reserved. A value of
674 0x1F indicates this operand is not in use for this scan operation.
675 [4:0] Operand size for second scan criteria value. In a scan value
676 operation, this is one of two potential extact match values.
677 In a scan range operation, this is the size of the lower range
678 boundary. The value of this field is the number of bytes in the
679 operand, minus 1. Values 0xF-0x1E are reserved. A value of
680 0x1F indicates this operand is not in use for this scan operation.
6818 8 Completion (same fields as Section 36.2.1.2, “Extract command”)
68216 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”)
68324 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
68432 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
685 Input.
68640 4 Most significant 4 bytes of first scan criteria operand. If first operand is less
687 than 4 bytes, the value is left-aligned to the lowest address bytes.
68844 4 Most significant 4 bytes of second scan criteria operand. If second operand
689 is less than 4 bytes, the value is left-aligned to the lowest address bytes.
69048 8 Output (same fields as Primary Input)
69156 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
692 “Extract command”
69364 4 Next 4 most significant bytes of first scan criteria operand occuring after the
694 bytes specified at offset 40, if needed by the operand size. If first operand
695 is less than 8 bytes, the valid bytes are left-aligned to the lowest address.
69668 4 Next 4 most significant bytes of second scan criteria operand occuring after
697 the bytes specified at offset 44, if needed by the operand size. If second
698 operand is less than 8 bytes, the valid bytes are left-aligned to the lowest
699 address.
70072 4 Next 4 most significant bytes of first scan criteria operand occuring after the
701 bytes specified at offset 64, if needed by the operand size. If first operand
702 is less than 12 bytes, the valid bytes are left-aligned to the lowest address.
70376 4 Next 4 most significant bytes of second scan criteria operand occuring after
704 the bytes specified at offset 68, if needed by the operand size. If second
705 operand is less than 12 bytes, the valid bytes are left-aligned to the lowest
706 address.
70780 4 Next 4 most significant bytes of first scan criteria operand occuring after the
708 bytes specified at offset 72, if needed by the operand size. If first operand
709 is less than 16 bytes, the valid bytes are left-aligned to the lowest address.
71084 4 Next 4 most significant bytes of second scan criteria operand occuring after
711 the bytes specified at offset 76, if needed by the operand size. If second
712 operand is less than 16 bytes, the valid bytes are left-aligned to the lowest
713 address.
714
715
716
717
718 521
719 Coprocessor services
720
721
72236.2.1.4. Translate commands
723
724 The translate commands takes an input array of indicies, and a table of single bit values indexed by those
725 indicies, and outputs a bit vector or index array created by reading the tables bit value at each index in
726 the input array. The output should therefore contain exactly one bit per index in the input data stream,
727 when outputing as a bit vector. When outputing as an index array, the number of elements depends on the
728 values read in the bit table, but will always be less than, or equal to, the number of input elements. Only
729 a restricted subset of the possible input format types are supported. No variable width or Huffman/OZIP
730 encoded input streams are allowed. The primary input data element size must be 3 bytes or less.
731
732 The maximum table index size allowed is 15 bits, however, larger input elements may be used to provide
733 additional processing of the output values. If 2 or 3 byte values are used, the least significant 15 bits are
734 used as an index into the bit table. The most significant 9 bits (when using 3-byte input elements) or single
735 bit (when using 2-byte input elements) are compared against a fixed 9-bit test value provided in the CCB.
736 If the values match, the value from the bit table is used as the output element value. If the values do not
737 match, the output data element value is forced to 0.
738
739 In the inverted translate operation, the bit value read from bit table is inverted prior to its use. The additional
740 additional processing based on any additional non-index bits remains unchanged, and still forces the output
741 element value to 0 on a mismatch. The specific type of translate command is indicated by the command
742 code in the CCB header.
743
744 There are two supported formats for the output stream: the bit vector and index array formats (codes 0x8,
745 0xD, and 0xE). The index array format is an array of indicies of bits which would have been set if the
746 output format was a bit array.
747
748 The return value of the CCB completion area contains the number of bits set in the output bit vector,
749 or number of elements in the output index array. The “number of elements processed” field in the CCB
750 completion area will be valid, indicating the number of input elements processed.
751
752 These commands are 64-byte “short format” CCBs.
753
754 The translate CCB command format can be specified by the following packed C structure for a big-endian
755 machine:
756
757
758 struct translate_ccb {
759 uint32_t header;
760 uint32_t control;
761 uint64_t completion;
762 uint64_t primary_input;
763 uint64_t data_access_control;
764 uint64_t secondary_input;
765 uint64_t reserved;
766 uint64_t output;
767 uint64_t table;
768 };
769
770
771 The exact field offsets, sizes, and composition are as follows:
772
773
774 Offset Size Field Description
775 0 4 CCB header (Table 36.1, “CCB Header Format”)
776
777
778 522
779 Coprocessor services
780
781
782Offset Size Field Description
7834 4 Command control
784 Bits Field Description
785 [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
786 Format”)
787 [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
788 Input Element Size”)
789 [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
790 Element Offsets”)
791 [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
792 Input Format”)
793 [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
794 Element Offsets”)
795 [15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
796 “Secondary Input Element Size”
797 [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
798 [9] Reserved
799 [8:0] Test value used for comparison against the most significant bits
800 in the input values, when using 2 or 3 byte input elements.
8018 8 Completion (same fields as Section 36.2.1.2, “Extract command”
80216 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
80324 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”,
804 except Primary Input Length Format may not use the 0x0 value)
80532 8 Secondary Input, if used by Primary Input Format. Same fields as Primary
806 Input.
80740 8 Reserved
80848 8 Output (same fields as Primary Input)
80956 8 Bit Table
810 Bits Field Description
811 [63:60] ADI version (see Section 36.2.1.1.7, “Application Data
812 Integrity (ADI)”)
813 [59:56] If using real address, these bits should be filled in with the
814 page size code for the page boundary checking the guest wants
815 the virtual machine to use when accessing this data stream
816 (checking is only guaranteed to be performed when using API
817 version 1.1 and later). If using a virtual address, this field will
818 be used as as bit table address bits [59:56]
819 [55:4] Bit table address bits [55:4]. Address type is determined by
820 CCB header. Address must be 64-byte aligned (CCB version
821 0) or 16-byte aligned (CCB version 1).
822 [3:0] Bit table version
823 Value Description
824 0 4KB table size
825 1 8KB table size
826
827
828
829 523
830 Coprocessor services
831
832
83336.2.1.5. Select command
834 The select command filters the primary input data stream by using a secondary input bit vector to determine
835 which input elements to include in the output. For each bit set at a given index N within the bit vector,
836 the Nth input element is included in the output. If the bit is not set, the element is not included. Only a
837 restricted subset of the possible input format types are supported. No variable width or run length encoded
838 input streams are allowed, since the secondary input stream is used for the filtering bit vector.
839
840 The only supported output format is a padded, byte-aligned output stream. The stream follows the same
841 rules and restrictions as padded output stream described in Section 36.2.1.2, “Extract command”.
842
843 The return value of the CCB completion area contains the number of bits set in the input bit vector. The
844 "number of elements processed" field in the CCB completion area will be valid, indicating the number
845 of input elements processed.
846
847 The select CCB is a 64-byte “short format” CCB.
848
849 The select CCB command format can be specified by the following packed C structure for a big-endian
850 machine:
851
852
853 struct select_ccb {
854 uint32_t header;
855 uint32_t control;
856 uint64_t completion;
857 uint64_t primary_input;
858 uint64_t data_access_control;
859 uint64_t secondary_input;
860 uint64_t reserved;
861 uint64_t output;
862 uint64_t table;
863 };
864
865
866 The exact field offsets, sizes, and composition are as follows:
867
868 Offset Size Field Description
869 0 4 CCB header (Table 36.1, “CCB Header Format”)
870 4 4 Command control
871 Bits Field Description
872 [31:28] Primary Input Format (see Section 36.2.1.1.1, “Primary Input
873 Format”)
874 [27:23] Primary Input Element Size (see Section 36.2.1.1.2, “Primary
875 Input Element Size”)
876 [22:20] Primary Input Starting Offset (see Section 36.2.1.1.5, “Input
877 Element Offsets”)
878 [19] Secondary Input Format (see Section 36.2.1.1.3, “Secondary
879 Input Format”)
880 [18:16] Secondary Input Starting Offset (see Section 36.2.1.1.5, “Input
881 Element Offsets”)
882 [15:14] Secondary Input Element Size (see Section 36.2.1.1.4,
883 “Secondary Input Element Size”
884
885
886 524
887 Coprocessor services
888
889
890 Offset Size Field Description
891 Bits Field Description
892 [13:10] Output Format (see Section 36.2.1.1.6, “Output Format”)
893 [9] Padding Direction selector: A value of 1 causes padding bytes
894 to be added to the left side of output elements. A value of 0
895 causes padding bytes to be added to the right side of output
896 elements.
897 [8:0] Reserved
898 8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
899 16 8 Primary Input (same fields as Section 36.2.1.2, “Extract command”
900 24 8 Data Access Control (same fields as Section 36.2.1.2, “Extract command”)
901 32 8 Secondary Bit Vector Input. Same fields as Primary Input.
902 40 8 Reserved
903 48 8 Output (same fields as Primary Input)
904 56 8 Symbol Table (if used by Primary Input). Same fields as Section 36.2.1.2,
905 “Extract command”
906
90736.2.1.6. No-op and Sync commands
908 The no-op (no operation) command is a CCB which has no processing effect. The CCB, when processed
909 by the virtual machine, simply updates the completion area with its execution status. The CCB may have
910 the serial-conditional flags set in order to restrict when it executes.
911
912 The sync command is a variant of the no-op command which with restricted execution timing. A sync
913 command CCB will only execute when all previous commands submitted in the same request have
914 completed. This is stronger than the conditional flag sequencing, which is only dependent on a single
915 previous serial CCB. While the relative ordering is guaranteed, virtual machine implementations with
916 shared hardware resources may cause the sync command to wait for longer than the minimum required
917 time.
918
919 The return value of the CCB completion area is invalid for these CCBs. The “number of elements
920 processed” field is also invalid for these CCBs.
921
922 These commands are 64-byte “short format” CCBs.
923
924 The no-op CCB command format can be specified by the following packed C structure for a big-endian
925 machine:
926
927
928 struct nop_ccb {
929 uint32_t header;
930 uint32_t control;
931 uint64_t completion;
932 uint64_t reserved[6];
933 };
934
935
936 The exact field offsets, sizes, and composition are as follows:
937
938 Offset Size Field Description
939 0 4 CCB header (Table 36.1, “CCB Header Format”)
940
941
942 525
943 Coprocessor services
944
945
946 Offset Size Field Description
947 4 4 Command control
948 Bits Field Description
949 [31] If set, this CCB functions as a Sync command. If clear, this
950 CCB functions as a No-op command.
951 [30:0] Reserved
952 8 8 Completion (same fields as Section 36.2.1.2, “Extract command”
953 16 46 Reserved
954
95536.2.2. CCB Completion Area
956 All CCB commands use a common 128-byte Completion Area format, which can be specified by the
957 following packed C structure for a big-endian machine:
958
959
960 struct completion_area {
961 uint8_t status_flag;
962 uint8_t error_note;
963 uint8_t rsvd0[2];
964 uint32_t error_values;
965 uint32_t output_size;
966 uint32_t rsvd1;
967 uint64_t run_time;
968 uint64_t run_stats;
969 uint32_t elements;
970 uint8_t rsvd2[20];
971 uint64_t return_value;
972 uint64_t extra_return_value[8];
973 };
974
975
976 The Completion Area must be a 128-byte aligned memory location. The exact layout can be described
977 using byte offsets and sizes relative to the memory base:
978
979 Offset Size Field Description
980 0 1 CCB execution status
981 0x0 Command not yet completed
982 0x1 Command ran and succeeded
983 0x2 Command ran and failed (partial results may be been
984 produced)
985 0x3 Command ran and was killed (partial execution may
986 have occurred)
987 0x4 Command was not run
988 0x5-0xF Reserved
989 1 1 Error reason code
990 0x0 Reserved
991 0x1 Buffer overflow
992
993
994 526
995 Coprocessor services
996
997
998Offset Size Field Description
999 0x2 CCB decoding error
1000 0x3 Page overflow
1001 0x4-0x6 Reserved
1002 0x7 Command was killed
1003 0x8 Command execution timeout
1004 0x9 ADI miscompare error
1005 0xA Data format error
1006 0xB-0xD Reserved
1007 0xE Unexpected hardware error (Do not retry)
1008 0xF Unexpected hardware error (Retry is ok)
1009 0x10-0x7F Reserved
1010 0x80 Partial Symbol Warning
1011 0x81-0xFF Reserved
10122 2 Reserved
10134 4 If a partial symbol warning was generated, this field contains the number
1014 of remaining bits which were not decoded.
10158 4 Number of bytes of output produced
101612 4 Reserved
101716 8 Runtime of command (unspecified time units)
101824 8 Reserved
101932 4 Number of elements processed
102036 20 Reserved
102156 8 Return value
102264 64 Extended return value
1023
1024The CCB completion area should be treated as read-only by guest software. The CCB execution status
1025byte will be cleared by the Hypervisor to reflect the pending execution status when the CCB is submitted
1026successfully. All other fields are considered invalid upon CCB submission until the CCB execution status
1027byte becomes non-zero.
1028
1029CCBs which complete with status 0x2 or 0x3 may produce partial results and/or side effects due to partial
1030execution of the CCB command. Some valid data may be accessible depending on the fault type, however,
1031it is recommended that guest software treat the destination buffer as being in an unknown state. If a CCB
1032completes with a status byte of 0x2, the error reason code byte can be read to determine what corrective
1033action should be taken.
1034
1035A buffer overflow indicates that the results of the operation exceeded the size of the output buffer indicated
1036in the CCB. The operation can be retried by resubmitting the CCB with a larger output buffer.
1037
1038A CCB decoding error indicates that the CCB contained some invalid field values. It may be also be
1039triggered if the CCB output is directed at a non-existent secondary input and the pipelining hint is followed.
1040
1041A page overflow error indicates that the operation required accessing a memory location beyond the page
1042size associated with a given address. No data will have been read or written past the page boundary, but
1043partial results may have been written to the destination buffer. The CCB can be resubmitted with a larger
1044page size memory allocation to complete the operation.
1045
1046
1047 527
1048 Coprocessor services
1049
1050
1051 In the case of pipelined CCBs, a page overflow error will be triggered if the output from the pipeline source
1052 CCB ends before the input of the pipeline target CCB. Page boundaries are ignored when the pipeline
1053 hint is followed.
1054
1055 Command kill indicates that the CCB execution was halted or prevented by use of the ccb_kill API call.
1056
1057 Command timeout indicates that the CCB execution began, but did not complete within a pre-determined
1058 limit set by the virtual machine. The command may have produced some or no output. The CCB may be
1059 resubmitted with no alterations.
1060
1061 ADI miscompare indicates that the memory buffer version specified in the CCB did not match the value
1062 in memory when accessed by the virtual machine. Guest software should not attempt to resubmit the CCB
1063 without determining the cause of the version mismatch.
1064
1065 A data format error indicates that the input data stream did not follow the specified data input formatting
1066 selected in the CCB.
1067
1068 Some CCBs which encounter hardware errors may be resubmitted without change. Persistent hardware
1069 errors may result in multiple failures until RAS software can identify and isolate the faulty component.
1070
1071 The output size field indicates the number of bytes of valid output in the destination buffer. This field is
1072 not valid for all possible CCB commands.
1073
1074 The runtime field indicates the execution time of the CCB command once it leaves the internal virtual
1075 machine queue. The time units are fixed, but unspecified, allowing only relative timing comparisons
1076 by guest software. The time units may also vary by hardware platform, and should not be construed to
1077 represent any absolute time value.
1078
1079 Some data query commands process data in units of elements. If applicable to the command, the number of
1080 elements processed is indicated in the listed field. This field is not valid for all possible CCB commands.
1081
1082 The return value and extended return value fields are output locations for commands which do not use
1083 a destination output buffer, or have secondary return results. The field is not valid for all possible CCB
1084 commands.
1085
108636.3. Hypervisor API Functions
108736.3.1. ccb_submit
1088 trap# FAST_TRAP
1089 function# CCB_SUBMIT
1090 arg0 address
1091 arg1 length
1092 arg2 flags
1093 arg3 reserved
1094 ret0 status
1095 ret1 length
1096 ret2 status data
1097 ret3 reserved
1098
1099 Submit one or more coprocessor control blocks (CCBs) for evaluation and processing by the virtual
1100 machine. The CCBs are passed in a linear array indicated by address. length indicates the size of
1101 the array in bytes.
1102
1103
1104 528
1105 Coprocessor services
1106
1107
1108The address should be aligned to the size indicated by length, rounded up to the nearest power of
1109two. Virtual machines implementations may reject submissions which do not adhere to that alignment.
1110length must be a multiple of 64 bytes. If length is zero, the maximum supported array length will be
1111returned as length in ret1. In all other cases, the length value in ret1 will reflect the number of bytes
1112successfully consumed from the input CCB array.
1113
1114 Implementation note
1115 Virtual machines should never reject submissions based on the alignment of address if the
1116 entire array is contained within a single memory page of the smallest page size supported by the
1117 virtual machine.
1118
1119A guest may choose to submit addresses used in this API function, including the CCB array address,
1120as either a real or virtual addresses, with the type of each address indicated in flags. Virtual addresses
1121must be present in either the TLB or an active TSB to be processed. The translation context for virtual
1122addresses is determined by a combination of CCB contents and the flags argument.
1123
1124The flags argument is divided into multiple fields defined as follows:
1125
1126
1127Bits Field Description
1128[63:16] Reserved
1129[15] Disable ADI for VA reads (in API 2.0)
1130 Reserved (in API 1.0)
1131[14] Virtual addresses within CCBs are translated in privileged context
1132[13:12] Alternate translation context for virtual addresses within CCBs:
1133 0b'00 CCBs requesting alternate context are rejected
1134 0b'01 Reserved
1135 0b'10 CCBs requesting alternate context use secondary context
1136 0b'11 CCBs requesting alternate context use nucleus context
1137[11:9] Reserved
1138[8] Queue info flag
1139[7] All-or-nothing flag
1140[6] If address is a virtual address, treat its translation context as privileged
1141[5:4] Address type of address:
1142 0b'00 Real address
1143 0b'01 Virtual address in primary context
1144 0b'10 Virtual address in secondary context
1145 0b'11 Virtual address in nucleus context
1146[3:2] Reserved
1147[1:0] CCB command type:
1148 0b'00 Reserved
1149 0b'01 Reserved
1150 0b'10 Query command
1151 0b'11 Reserved
1152
1153
1154
1155 529
1156 Coprocessor services
1157
1158
1159 The CCB submission type and address type for the CCB array must be provided in the flags argument.
1160 All other fields are optional values which change the default behavior of the CCB processing.
1161
1162 When set to one, the "Disable ADI for VA reads" bit will turn off ADI checking when using a virtual
1163 address to load data. ADI checking will still be done when loading real-addressed memory. This bit is only
1164 available when using major version 2 of the coprocessor API group; at major version 1 it is reserved. For
1165 more information about using ADI and DAX, see Section 36.2.1.1.7, “Application Data Integrity (ADI)”.
1166
1167 By default, all virtual addresses are treated as user addresses. If the virtual address translations are
1168 privileged, they must be marked as such in the appropriate flags field. The virtual addresses used within
1169 the submitted CCBs must all be translated with the same privilege level.
1170
1171 By default, all virtual addresses used within the submitted CCBs are translated using the primary context
1172 active at the time of the submission. The address type field within a CCB allows each address to request
1173 translation in an alternate address context. The address context used when the alternate address context is
1174 requested is selected in the flags argument.
1175
1176 The all-or-nothing flag specifies whether the virtual machine should allow partial submissions of the
1177 input CCB array. When using CCBs with serial-conditional flags, it is strongly recommended to use
1178 the all-or-nothing flag to avoid broken conditional chains. Using long CCB chains on a machine under
1179 high coprocessor load may make this impractical, however, and require submitting without the flag.
1180 When submitting serial-conditional CCBs without the all-or-nothing flag, guest software must manually
1181 implement the serial-conditional behavior at any point where the chain was not submitted in a single API
1182 call, and resubmission of the remaining CCBs should clear any conditional flag that might be set in the
1183 first remaining CCB. Failure to do so will produce indeterminate CCB execution status and ordering.
1184
1185 When the all-or-nothing flag is not specified, callers should check the value of length in ret1 to determine
1186 how many CCBs from the array were successfully submitted. Any remaining CCBs can be resubmitted
1187 without modifications.
1188
1189 The value of length in ret1 is also valid when the API call returns an error, and callers should always
1190 check its value to determine which CCBs in the array were already processed. This will additionally
1191 identify which CCB encountered the processing error, and was not submitted successfully.
1192
1193 If the queue info flag is used during submission, and at least one CCB was successfully submitted, the
1194 length value in ret1 will be a multi-field value defined as follows:
1195 Bits Field Description
1196 [63:48] DAX unit instance identifier
1197 [47:32] DAX queue instance identifier
1198 [31:16] Reserved
1199 [15:0] Number of CCB bytes successfully submitted
1200
1201 The value of status data depends on the status value. See error status code descriptions for details.
1202 The value is undefined for status values that do not specifically list a value for the status data.
1203
1204 The API has a reserved input and output register which will be used in subsequent minor versions of this
1205 API function. Guest software implementations should treat that register as voltile across the function call
1206 in order to maintain forward compatibility.
1207
120836.3.1.1. Errors
1209 EOK One or more CCBs have been accepted and enqueued in the virtual machine
1210 and no errors were been encountered during submission. Some submitted
1211 CCBs may not have been enqueued due to internal virtual machine limitations,
1212 and may be resubmitted without changes.
1213
1214
1215 530
1216 Coprocessor services
1217
1218
1219EWOULDBLOCK An internal resource conflict within the virtual machine has prevented it from
1220 being able to complete the CCB submissions sufficiently quickly, requiring
1221 it to abandon processing before it was complete. Some CCBs may have been
1222 successfully enqueued prior to the block, and all remaining CCBs may be
1223 resubmitted without changes.
1224EBADALIGN CCB array is not on a 64-byte boundary, or the array length is not a multiple
1225 of 64 bytes.
1226ENORADDR A real address used either for the CCB array, or within one of the submitted
1227 CCBs, is not valid for the guest. Some CCBs may have been enqueued prior
1228 to the error being detected.
1229ENOMAP A virtual address used either for the CCB array, or within one of the submitted
1230 CCBs, could not be translated by the virtual machine using either the TLB
1231 or TSB contents. The submission may be retried after adding the required
1232 mapping, or by converting the virtual address into a real address. Due to the
1233 shared nature of address translation resources, there is no theoretical limit on
1234 the number of times the translation may fail, and it is recommended all guests
1235 implement some real address based backup. The virtual address which failed
1236 translation is returned as status data in ret2. Some CCBs may have been
1237 enqueued prior to the error being detected.
1238EINVAL The virtual machine detected an invalid CCB during submission, or invalid
1239 input arguments, such as bad flag values. Note that not all invalid CCB values
1240 will be detected during submission, and some may be reported as errors in the
1241 completion area instead. Some CCBs may have been enqueued prior to the
1242 error being detected. This error may be returned if the CCB version is invalid.
1243ETOOMANY The request was submitted with the all-or-nothing flag set, and the array size is
1244 greater than the virtual machine can support in a single request. The maximum
1245 supported size for the current virtual machine can be queried by submitting a
1246 request with a zero length array, as described above.
1247ENOACCESS The guest does not have permission to submit CCBs, or an address used in a
1248 CCBs lacks sufficient permissions to perform the required operation (no write
1249 permission on the destination buffer address, for example). A virtual address
1250 which fails permission checking is returned as status data in ret2. Some
1251 CCBs may have been enqueued prior to the error being detected.
1252EUNAVAILABLE The requested CCB operation could not be performed at this time. The
1253 restricted operation availability may apply only to the first unsuccessfully
1254 submitted CCB, or may apply to a larger scope. The status should not be
1255 interpreted as permanent, and the guest should attempt to submit CCBs in
1256 the future which had previously been unable to be performed. The status
1257 data provides additional information about scope of the retricted availability
1258 as follows:
1259 Value Description
1260 0 Processing for the exact CCB instance submitted was unavailable,
1261 and it is recommended the guest emulate the operation. The
1262 guest should continue to submit all other CCBs, and assume no
1263 restrictions beyond this exact CCB instance.
1264 1 Processing is unavailable for all CCBs using the requested opcode,
1265 and it is recommended the guest emulate the operation. The
1266 guest should continue to submit all other CCBs that use different
1267 opcodes, but can expect continued rejections of CCBs using the
1268 same opcode in the near future.
1269
1270
1271 531
1272 Coprocessor services
1273
1274
1275 Value Description
1276 2 Processing is unavailable for all CCBs using the requested CCB
1277 version, and it is recommended the guest emulate the operation.
1278 The guest should continue to submit all other CCBs that use
1279 different CCB versions, but can expect continued rejections of
1280 CCBs using the same CCB version in the near future.
1281 3 Processing is unavailable for all CCBs on the submitting vcpu,
1282 and it is recommended the guest emulate the operation or resubmit
1283 the CCB on a different vcpu. The guest should continue to submit
1284 CCBs on all other vcpus but can expect continued rejections of all
1285 CCBs on this vcpu in the near future.
1286 4 Processing is unavailable for all CCBs, and it is recommended
1287 the guest emulate the operation. The guest should expect all CCB
1288 submissions to be similarly rejected in the near future.
1289
1290
129136.3.2. ccb_info
1292
1293 trap# FAST_TRAP
1294 function# CCB_INFO
1295 arg0 address
1296 ret0 status
1297 ret1 CCB state
1298 ret2 position
1299 ret3 dax
1300 ret4 queue
1301
1302 Requests status information on a previously submitted CCB. The previously submitted CCB is identified
1303 by the 64-byte aligned real address of the CCBs completion area.
1304
1305 A CCB can be in one of 4 states:
1306
1307
1308 State Value Description
1309 COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
1310 the virtual machine.
1311 ENQUEUED 1 The requested CCB is current in a queue awaiting execution.
1312 INPROGRESS 2 The CCB has been fetched and is currently being executed. It may still
1313 be possible to stop the execution using the ccb_kill hypercall.
1314 NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
1315 appear to have been executed. This may occur if the CCB was lost
1316 due to a hardware error, or the CCB may not have been successfully
1317 submitted to the virtual machine in the first place.
1318
1319 Implementation note
1320 Some platforms may not be able to report CCBs that are currently being processed, and therefore
1321 guest software should invoke the ccb_kill hypercall prior to assuming the request CCB will never
1322 be executed because it was in the NOTFOUND state.
1323
1324
1325 532
1326 Coprocessor services
1327
1328
1329 The position return value is only valid when the state is ENQUEUED. The value returned is the number
1330 of other CCBs ahead of the requested CCB, to provide a relative estimate of when the CCB may execute.
1331
1332 The dax return value is only valid when the state is ENQUEUED. The value returned is the DAX unit
1333 instance indentifier for the DAX unit processing the queue where the requested CCB is located. The value
1334 matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
1335
1336 The queue return value is only valid when the state is ENQUEUED. The value returned is the DAX
1337 queue instance indentifier for the DAX unit processing the queue where the requested CCB is located. The
1338 value matches the value that would have been, or was, returned by ccb_submit using the queue info flag.
1339
134036.3.2.1. Errors
1341
1342 EOK The request was proccessed and the CCB state is valid.
1343 EBADALIGN address is not on a 64-byte aligned.
1344 ENORADDR The real address provided for address is not valid.
1345 EINVAL The CCB completion area contents are not valid.
1346 EWOULDBLOCK Internal resource contraints prevented the CCB state from being queried at this
1347 time. The guest should retry the request.
1348 ENOACCESS The guest does not have permission to access the coprocessor virtual device
1349 functionality.
1350
135136.3.3. ccb_kill
1352
1353 trap# FAST_TRAP
1354 function# CCB_KILL
1355 arg0 address
1356 ret0 status
1357 ret1 result
1358
1359 Request to stop execution of a previously submitted CCB. The previously submitted CCB is identified by
1360 the 64-byte aligned real address of the CCBs completion area.
1361
1362 The kill attempt can produce one of several values in the result return value, reflecting the CCB state
1363 and actions taken by the Hypervisor:
1364
1365 Result Value Description
1366 COMPLETED 0 The CCB has been fetched and executed, and is no longer active in
1367 the virtual machine. It could not be killed and no action was taken.
1368 DEQUEUED 1 The requested CCB was still enqueued when the kill request was
1369 submitted, and has been removed from the queue. Since the CCB
1370 never began execution, no memory modifications were produced by
1371 it, and the completion area will never be updated. The same CCB may
1372 be submitted again, if desired, with no modifications required.
1373 KILLED 2 The CCB had been fetched and was being executed when the kill
1374 request was submitted. The CCB execution was stopped, and the CCB
1375 is no longer active in the virtual machine. The CCB completion area
1376 will reflect the killed status, with the subsequent implications that
1377 partial results may have been produced. Partial results may include full
1378
1379
1380 533
1381 Coprocessor services
1382
1383
1384 Result Value Description
1385 command execution if the command was stopped just prior to writing
1386 to the completion area.
1387 NOTFOUND 3 The CCB could not be located in the virtual machine, and does not
1388 appear to have been executed. This may occur if the CCB was lost
1389 due to a hardware error, or the CCB may not have been successfully
1390 submitted to the virtual machine in the first place. CCBs in the state
1391 are guaranteed to never execute in the future unless resubmitted.
1392
139336.3.3.1. Interactions with Pipelined CCBs
1394
1395 If the pipeline target CCB is killed but the pipeline source CCB was skipped, the completion area of the
1396 target CCB may contain status (4,0) "Command was skipped" instead of (3,7) "Command was killed".
1397
1398 If the pipeline source CCB is killed, the pipeline target CCB's completion status may read (1,0) "Success".
1399 This does not mean the target CCB was processed; since the source CCB was killed, there was no
1400 meaningful output on which the target CCB could operate.
1401
140236.3.3.2. Errors
1403
1404 EOK The request was proccessed and the result is valid.
1405 EBADALIGN address is not on a 64-byte aligned.
1406 ENORADDR The real address provided for address is not valid.
1407 EINVAL The CCB completion area contents are not valid.
1408 EWOULDBLOCK Internal resource contraints prevented the CCB from being killed at this time.
1409 The guest should retry the request.
1410 ENOACCESS The guest does not have permission to access the coprocessor virtual device
1411 functionality.
1412
141336.3.4. dax_info
1414 trap# FAST_TRAP
1415 function# DAX_INFO
1416 ret0 status
1417 ret1 Number of enabled DAX units
1418 ret2 Number of disabled DAX units
1419
1420 Returns the number of DAX units that are enabled for the calling guest to submit CCBs. The number of
1421 DAX units that are disabled for the calling guest are also returned. A disabled DAX unit would have been
1422 available for CCB submission to the calling guest had it not been offlined.
1423
142436.3.4.1. Errors
1425
1426 EOK The request was proccessed and the number of enabled/disabled DAX units
1427 are valid.
1428
1429
1430
1431
1432 534
1433
diff --git a/Documentation/sparc/oradax/oracle-dax.txt b/Documentation/sparc/oradax/oracle-dax.txt
new file mode 100644
index 000000000000..9d53ac93286f
--- /dev/null
+++ b/Documentation/sparc/oradax/oracle-dax.txt
@@ -0,0 +1,429 @@
1Oracle Data Analytics Accelerator (DAX)
2---------------------------------------
3
4DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
5(DAX2) processor chips, and has direct access to the CPU's L3 caches
6as well as physical memory. It can perform several operations on data
7streams with various input and output formats. A driver provides a
8transport mechanism and has limited knowledge of the various opcodes
9and data formats. A user space library provides high level services
10and translates these into low level commands which are then passed
11into the driver and subsequently the Hypervisor and the coprocessor.
12The library is the recommended way for applications to use the
13coprocessor, and the driver interface is not intended for general use.
14This document describes the general flow of the driver, its
15structures, and its programmatic interface. It also provides example
16code sufficient to write user or kernel applications that use DAX
17functionality.
18
19The user library is open source and available at:
20 https://oss.oracle.com/git/gitweb.cgi?p=libdax.git
21
22The Hypervisor interface to the coprocessor is described in detail in
23the accompanying document, dax-hv-api.txt, which is a plain text
24excerpt of the (Oracle internal) "UltraSPARC Virtual Machine
25Specification" version 3.0.20+15, dated 2017-09-25.
26
27
28High Level Overview
29-------------------
30
31A coprocessor request is described by a Command Control Block
32(CCB). The CCB contains an opcode and various parameters. The opcode
33specifies what operation is to be done, and the parameters specify
34options, flags, sizes, and addresses. The CCB (or an array of CCBs)
35is passed to the Hypervisor, which handles queueing and scheduling of
36requests to the available coprocessor execution units. A status code
37returned indicates if the request was submitted successfully or if
38there was an error. One of the addresses given in each CCB is a
39pointer to a "completion area", which is a 128 byte memory block that
40is written by the coprocessor to provide execution status. No
41interrupt is generated upon completion; the completion area must be
42polled by software to find out when a transaction has finished, but
43the M7 and later processors provide a mechanism to pause the virtual
44processor until the completion status has been updated by the
45coprocessor. This is done using the monitored load and mwait
46instructions, which are described in more detail later. The DAX
47coprocessor was designed so that after a request is submitted, the
48kernel is no longer involved in the processing of it. The polling is
49done at the user level, which results in almost zero latency between
50completion of a request and resumption of execution of the requesting
51thread.
52
53
54Addressing Memory
55-----------------
56
57The kernel does not have access to physical memory in the Sun4v
58architecture, as there is an additional level of memory virtualization
59present. This intermediate level is called "real" memory, and the
60kernel treats this as if it were physical. The Hypervisor handles the
61translations between real memory and physical so that each logical
62domain (LDOM) can have a partition of physical memory that is isolated
63from that of other LDOMs. When the kernel sets up a virtual mapping,
64it specifies a virtual address and the real address to which it should
65be mapped.
66
67The DAX coprocessor can only operate on physical memory, so before a
68request can be fed to the coprocessor, all the addresses in a CCB must
69be converted into physical addresses. The kernel cannot do this since
70it has no visibility into physical addresses. So a CCB may contain
71either the virtual or real addresses of the buffers or a combination
72of them. An "address type" field is available for each address that
73may be given in the CCB. In all cases, the Hypervisor will translate
74all the addresses to physical before dispatching to hardware. Address
75translations are performed using the context of the process initiating
76the request.
77
78
79The Driver API
80--------------
81
82An application makes requests to the driver via the write() system
83call, and gets results (if any) via read(). The completion areas are
84made accessible via mmap(), and are read-only for the application.
85
86The request may either be an immediate command or an array of CCBs to
87be submitted to the hardware.
88
89Each open instance of the device is exclusive to the thread that
90opened it, and must be used by that thread for all subsequent
91operations. The driver open function creates a new context for the
92thread and initializes it for use. This context contains pointers and
93values used internally by the driver to keep track of submitted
94requests. The completion area buffer is also allocated, and this is
95large enough to contain the completion areas for many concurrent
96requests. When the device is closed, any outstanding transactions are
97flushed and the context is cleaned up.
98
99On a DAX1 system (M7), the device will be called "oradax1", while on a
100DAX2 system (M8) it will be "oradax2". If an application requires one
101or the other, it should simply attempt to open the appropriate
102device. Only one of the devices will exist on any given system, so the
103name can be used to determine what the platform supports.
104
105The immediate commands are CCB_DEQUEUE, CCB_KILL, and CCB_INFO. For
106all of these, success is indicated by a return value from write()
107equal to the number of bytes given in the call. Otherwise -1 is
108returned and errno is set.
109
110CCB_DEQUEUE
111
112Tells the driver to clean up resources associated with past
113requests. Since no interrupt is generated upon the completion of a
114request, the driver must be told when it may reclaim resources. No
115further status information is returned, so the user should not
116subsequently call read().
117
118CCB_KILL
119
120Kills a CCB during execution. The CCB is guaranteed to not continue
121executing once this call returns successfully. On success, read() must
122be called to retrieve the result of the action.
123
124CCB_INFO
125
126Retrieves information about a currently executing CCB. Note that some
127Hypervisors might return 'notfound' when the CCB is in 'inprogress'
128state. To ensure a CCB in the 'notfound' state will never be executed,
129CCB_KILL must be invoked on that CCB. Upon success, read() must be
130called to retrieve the details of the action.
131
132Submission of an array of CCBs for execution
133
134A write() whose length is a multiple of the CCB size is treated as a
135submit operation. The file offset is treated as the index of the
136completion area to use, and may be set via lseek() or using the
137pwrite() system call. If -1 is returned then errno is set to indicate
138the error. Otherwise, the return value is the length of the array that
139was actually accepted by the coprocessor. If the accepted length is
140equal to the requested length, then the submission was completely
141successful and there is no further status needed; hence, the user
142should not subsequently call read(). Partial acceptance of the CCB
143array is indicated by a return value less than the requested length,
144and read() must be called to retrieve further status information. The
145status will reflect the error caused by the first CCB that was not
146accepted, and status_data will provide additional data in some cases.
147
148MMAP
149
150The mmap() function provides access to the completion area allocated
151in the driver. Note that the completion area is not writeable by the
152user process, and the mmap call must not specify PROT_WRITE.
153
154
155Completion of a Request
156-----------------------
157
158The first byte in each completion area is the command status which is
159updated by the coprocessor hardware. Software may take advantage of
160new M7/M8 processor capabilities to efficiently poll this status byte.
161First, a "monitored load" is achieved via a Load from Alternate Space
162(ldxa, lduba, etc.) with ASI 0x84 (ASI_MONITOR_PRIMARY). Second, a
163"monitored wait" is achieved via the mwait instruction (a write to
164%asr28). This instruction is like pause in that it suspends execution
165of the virtual processor for the given number of nanoseconds, but in
166addition will terminate early when one of several events occur. If the
167block of data containing the monitored location is modified, then the
168mwait terminates. This causes software to resume execution immediately
169(without a context switch or kernel to user transition) after a
170transaction completes. Thus the latency between transaction completion
171and resumption of execution may be just a few nanoseconds.
172
173
174Application Life Cycle of a DAX Submission
175------------------------------------------
176
177 - open dax device
178 - call mmap() to get the completion area address
179 - allocate a CCB and fill in the opcode, flags, parameters, addresses, etc.
180 - submit CCB via write() or pwrite()
181 - go into a loop executing monitored load + monitored wait and
182 terminate when the command status indicates the request is complete
183 (CCB_KILL or CCB_INFO may be used any time as necessary)
184 - perform a CCB_DEQUEUE
185 - call munmap() for completion area
186 - close the dax device
187
188
189Memory Constraints
190------------------
191
192The DAX hardware operates only on physical addresses. Therefore, it is
193not aware of virtual memory mappings and the discontiguities that may
194exist in the physical memory that a virtual buffer maps to. There is
195no I/O TLB or any scatter/gather mechanism. All buffers, whether input
196or output, must reside in a physically contiguous region of memory.
197
198The Hypervisor translates all addresses within a CCB to physical
199before handing off the CCB to DAX. The Hypervisor determines the
200virtual page size for each virtual address given, and uses this to
201program a size limit for each address. This prevents the coprocessor
202from reading or writing beyond the bound of the virtual page, even
203though it is accessing physical memory directly. A simpler way of
204saying this is that a DAX operation will never "cross" a virtual page
205boundary. If an 8k virtual page is used, then the data is strictly
206limited to 8k. If a user's buffer is larger than 8k, then a larger
207page size must be used, or the transaction size will be truncated to
2088k.
209
210Huge pages. A user may allocate huge pages using standard interfaces.
211Memory buffers residing on huge pages may be used to achieve much
212larger DAX transaction sizes, but the rules must still be followed,
213and no transaction will cross a page boundary, even a huge page. A
214major caveat is that Linux on Sparc presents 8Mb as one of the huge
215page sizes. Sparc does not actually provide a 8Mb hardware page size,
216and this size is synthesized by pasting together two 4Mb pages. The
217reasons for this are historical, and it creates an issue because only
218half of this 8Mb page can actually be used for any given buffer in a
219DAX request, and it must be either the first half or the second half;
220it cannot be a 4Mb chunk in the middle, since that crosses a
221(hardware) page boundary. Note that this entire issue may be hidden by
222higher level libraries.
223
224
225CCB Structure
226-------------
227A CCB is an array of 8 64-bit words. Several of these words provide
228command opcodes, parameters, flags, etc., and the rest are addresses
229for the completion area, output buffer, and various inputs:
230
231 struct ccb {
232 u64 control;
233 u64 completion;
234 u64 input0;
235 u64 access;
236 u64 input1;
237 u64 op_data;
238 u64 output;
239 u64 table;
240 };
241
242See libdax/common/sys/dax1/dax1_ccb.h for a detailed description of
243each of these fields, and see dax-hv-api.txt for a complete description
244of the Hypervisor API available to the guest OS (ie, Linux kernel).
245
246The first word (control) is examined by the driver for the following:
247 - CCB version, which must be consistent with hardware version
248 - Opcode, which must be one of the documented allowable commands
249 - Address types, which must be set to "virtual" for all the addresses
250 given by the user, thereby ensuring that the application can
251 only access memory that it owns
252
253
254Example Code
255------------
256
257The DAX is accessible to both user and kernel code. The kernel code
258can make hypercalls directly while the user code must use wrappers
259provided by the driver. The setup of the CCB is nearly identical for
260both; the only difference is in preparation of the completion area. An
261example of user code is given now, with kernel code afterwards.
262
263In order to program using the driver API, the file
264arch/sparc/include/uapi/asm/oradax.h must be included.
265
266First, the proper device must be opened. For M7 it will be
267/dev/oradax1 and for M8 it will be /dev/oradax2. The simplest
268procedure is to attempt to open both, as only one will succeed:
269
270 fd = open("/dev/oradax1", O_RDWR);
271 if (fd < 0)
272 fd = open("/dev/oradax2", O_RDWR);
273 if (fd < 0)
274 /* No DAX found */
275
276Next, the completion area must be mapped:
277
278 completion_area = mmap(NULL, DAX_MMAP_LEN, PROT_READ, MAP_SHARED, fd, 0);
279
280All input and output buffers must be fully contained in one hardware
281page, since as explained above, the DAX is strictly constrained by
282virtual page boundaries. In addition, the output buffer must be
28364-byte aligned and its size must be a multiple of 64 bytes because
284the coprocessor writes in units of cache lines.
285
286This example demonstrates the DAX Scan command, which takes as input a
287vector and a match value, and produces a bitmap as the output. For
288each input element that matches the value, the corresponding bit is
289set in the output.
290
291In this example, the input vector consists of a series of single bits,
292and the match value is 0. So each 0 bit in the input will produce a 1
293in the output, and vice versa, which produces an output bitmap which
294is the input bitmap inverted.
295
296For details of all the parameters and bits used in this CCB, please
297refer to section 36.2.1.3 of the DAX Hypervisor API document, which
298describes the Scan command in detail.
299
300 ccb->control = /* Table 36.1, CCB Header Format */
301 (2L << 48) /* command = Scan Value */
302 | (3L << 40) /* output address type = primary virtual */
303 | (3L << 34) /* primary input address type = primary virtual */
304 /* Section 36.2.1, Query CCB Command Formats */
305 | (1 << 28) /* 36.2.1.1.1 primary input format = fixed width bit packed */
306 | (0 << 23) /* 36.2.1.1.2 primary input element size = 0 (1 bit) */
307 | (8 << 10) /* 36.2.1.1.6 output format = bit vector */
308 | (0 << 5) /* 36.2.1.3 First scan criteria size = 0 (1 byte) */
309 | (31 << 0); /* 36.2.1.3 Disable second scan criteria */
310
311 ccb->completion = 0; /* Completion area address, to be filled in by driver */
312
313 ccb->input0 = (unsigned long) input; /* primary input address */
314
315 ccb->access = /* Section 36.2.1.2, Data Access Control */
316 (2 << 24) /* Primary input length format = bits */
317 | (nbits - 1); /* number of bits in primary input stream, minus 1 */
318
319 ccb->input1 = 0; /* secondary input address, unused */
320
321 ccb->op_data = 0; /* scan criteria (value to be matched) */
322
323 ccb->output = (unsigned long) output; /* output address */
324
325 ccb->table = 0; /* table address, unused */
326
327The CCB submission is a write() or pwrite() system call to the
328driver. If the call fails, then a read() must be used to retrieve the
329status:
330
331 if (pwrite(fd, ccb, 64, 0) != 64) {
332 struct ccb_exec_result status;
333 read(fd, &status, sizeof(status));
334 /* bail out */
335 }
336
337After a successful submission of the CCB, the completion area may be
338polled to determine when the DAX is finished. Detailed information on
339the contents of the completion area can be found in section 36.2.2 of
340the DAX HV API document.
341
342 while (1) {
343 /* Monitored Load */
344 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
345 : "=r" (status)
346 : "r" (completion_area));
347
348 if (status) /* 0 indicates command in progress */
349 break;
350
351 /* MWAIT */
352 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
353 }
354
355A completion area status of 1 indicates successful completion of the
356CCB and validity of the output bitmap, which may be used immediately.
357All other non-zero values indicate error conditions which are
358described in section 36.2.2.
359
360 if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
361 /* completion_area[0] contains the completion status */
362 /* completion_area[1] contains an error code, see 36.2.2 */
363 }
364
365After the completion area has been processed, the driver must be
366notified that it can release any resources associated with the
367request. This is done via the dequeue operation:
368
369 struct dax_command cmd;
370 cmd.command = CCB_DEQUEUE;
371 if (write(fd, &cmd, sizeof(cmd)) != sizeof(cmd)) {
372 /* bail out */
373 }
374
375Finally, normal program cleanup should be done, i.e., unmapping
376completion area, closing the dax device, freeing memory etc.
377
378[Kernel example]
379
380The only difference in using the DAX in kernel code is the treatment
381of the completion area. Unlike user applications which mmap the
382completion area allocated by the driver, kernel code must allocate its
383own memory to use for the completion area, and this address and its
384type must be given in the CCB:
385
386 ccb->control |= /* Table 36.1, CCB Header Format */
387 (3L << 32); /* completion area address type = primary virtual */
388
389 ccb->completion = (unsigned long) completion_area; /* Completion area address */
390
391The dax submit hypercall is made directly. The flags used in the
392ccb_submit call are documented in the DAX HV API in section 36.3.1.
393
394#include <asm/hypervisor.h>
395
396 hv_rv = sun4v_ccb_submit((unsigned long)ccb, 64,
397 HV_CCB_QUERY_CMD |
398 HV_CCB_ARG0_PRIVILEGED | HV_CCB_ARG0_TYPE_PRIMARY |
399 HV_CCB_VA_PRIVILEGED,
400 0, &bytes_accepted, &status_data);
401
402 if (hv_rv != HV_EOK) {
403 /* hv_rv is an error code, status_data contains */
404 /* potential additional status, see 36.3.1.1 */
405 }
406
407After the submission, the completion area polling code is identical to
408that in user land:
409
410 while (1) {
411 /* Monitored Load */
412 __asm__ __volatile__("lduba [%1] 0x84, %0\n"
413 : "=r" (status)
414 : "r" (completion_area));
415
416 if (status) /* 0 indicates command in progress */
417 break;
418
419 /* MWAIT */
420 __asm__ __volatile__("wr %%g0, 1000, %%asr28\n" ::); /* 1000 ns */
421 }
422
423 if (completion_area[0] != 1) { /* section 36.2.2, 1 = command ran and succeeded */
424 /* completion_area[0] contains the completion status */
425 /* completion_area[1] contains an error code, see 36.2.2 */
426 }
427
428The output bitmap is ready for consumption immediately after the
429completion status indicates success.
diff --git a/arch/sparc/include/uapi/asm/oradax.h b/arch/sparc/include/uapi/asm/oradax.h
new file mode 100644
index 000000000000..722951908b0a
--- /dev/null
+++ b/arch/sparc/include/uapi/asm/oradax.h
@@ -0,0 +1,91 @@
1/*
2 * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
3 *
4 * This program is free software: you can redistribute it and/or modify
5 * it under the terms of the GNU General Public License as published by
6 * the Free Software Foundation, either version 3 of the License, or
7 * (at your option) any later version.
8 *
9 * This program is distributed in the hope that it will be useful,
10 * but WITHOUT ANY WARRANTY; without even the implied warranty of
11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 * GNU General Public License for more details.
13 *
14 * You should have received a copy of the GNU General Public License
15 * along with this program. If not, see <http://www.gnu.org/licenses/>.
16 */
17
18/*
19 * Oracle DAX driver API definitions
20 */
21
22#ifndef _ORADAX_H
23#define _ORADAX_H
24
25#include <linux/types.h>
26
27#define CCB_KILL 0
28#define CCB_INFO 1
29#define CCB_DEQUEUE 2
30
31struct dax_command {
32 __u16 command; /* CCB_KILL/INFO/DEQUEUE */
33 __u16 ca_offset; /* offset into mmapped completion area */
34};
35
36struct ccb_kill_result {
37 __u16 action; /* action taken to kill ccb */
38};
39
40struct ccb_info_result {
41 __u16 state; /* state of enqueued ccb */
42 __u16 inst_num; /* dax instance number of enqueued ccb */
43 __u16 q_num; /* queue number of enqueued ccb */
44 __u16 q_pos; /* ccb position in queue */
45};
46
47struct ccb_exec_result {
48 __u64 status_data; /* additional status data (e.g. bad VA) */
49 __u32 status; /* one of DAX_SUBMIT_* */
50};
51
52union ccb_result {
53 struct ccb_exec_result exec;
54 struct ccb_info_result info;
55 struct ccb_kill_result kill;
56};
57
58#define DAX_MMAP_LEN (16 * 1024)
59#define DAX_MAX_CCBS 15
60#define DAX_CCB_BUF_MAXLEN (DAX_MAX_CCBS * 64)
61#define DAX_NAME "oradax"
62
63/* CCB_EXEC status */
64#define DAX_SUBMIT_OK 0
65#define DAX_SUBMIT_ERR_RETRY 1
66#define DAX_SUBMIT_ERR_WOULDBLOCK 2
67#define DAX_SUBMIT_ERR_BUSY 3
68#define DAX_SUBMIT_ERR_THR_INIT 4
69#define DAX_SUBMIT_ERR_ARG_INVAL 5
70#define DAX_SUBMIT_ERR_CCB_INVAL 6
71#define DAX_SUBMIT_ERR_NO_CA_AVAIL 7
72#define DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS 8
73#define DAX_SUBMIT_ERR_NOMAP 9
74#define DAX_SUBMIT_ERR_NOACCESS 10
75#define DAX_SUBMIT_ERR_TOOMANY 11
76#define DAX_SUBMIT_ERR_UNAVAIL 12
77#define DAX_SUBMIT_ERR_INTERNAL 13
78
79/* CCB_INFO states - must match HV_CCB_STATE_* definitions */
80#define DAX_CCB_COMPLETED 0
81#define DAX_CCB_ENQUEUED 1
82#define DAX_CCB_INPROGRESS 2
83#define DAX_CCB_NOTFOUND 3
84
85/* CCB_KILL actions - must match HV_CCB_KILL_* definitions */
86#define DAX_KILL_COMPLETED 0
87#define DAX_KILL_DEQUEUED 1
88#define DAX_KILL_KILLED 2
89#define DAX_KILL_NOTFOUND 3
90
91#endif /* _ORADAX_H */
diff --git a/drivers/sbus/char/Kconfig b/drivers/sbus/char/Kconfig
index 5ba684f73ab8..a785aa7660c3 100644
--- a/drivers/sbus/char/Kconfig
+++ b/drivers/sbus/char/Kconfig
@@ -70,5 +70,13 @@ config DISPLAY7SEG
70 another UltraSPARC-IIi-cEngine boardset with a 7-segment display, 70 another UltraSPARC-IIi-cEngine boardset with a 7-segment display,
71 you should say N to this option. 71 you should say N to this option.
72 72
73config ORACLE_DAX
74 tristate "Oracle Data Analytics Accelerator"
75 default m if SPARC64
76 help
77 Driver for Oracle Data Analytics Accelerator, which is
78 a coprocessor that performs database operations in hardware.
79 It is available on M7 and M8 based systems only.
80
73endmenu 81endmenu
74 82
diff --git a/drivers/sbus/char/Makefile b/drivers/sbus/char/Makefile
index ae478144c551..8c48ed96683f 100644
--- a/drivers/sbus/char/Makefile
+++ b/drivers/sbus/char/Makefile
@@ -17,3 +17,4 @@ obj-$(CONFIG_SUN_OPENPROMIO) += openprom.o
17obj-$(CONFIG_TADPOLE_TS102_UCTRL) += uctrl.o 17obj-$(CONFIG_TADPOLE_TS102_UCTRL) += uctrl.o
18obj-$(CONFIG_SUN_JSFLASH) += jsflash.o 18obj-$(CONFIG_SUN_JSFLASH) += jsflash.o
19obj-$(CONFIG_BBC_I2C) += bbc.o 19obj-$(CONFIG_BBC_I2C) += bbc.o
20obj-$(CONFIG_ORACLE_DAX) += oradax.o
diff --git a/drivers/sbus/char/oradax.c b/drivers/sbus/char/oradax.c
new file mode 100644
index 000000000000..10452ae18ef1
--- /dev/null
+++ b/drivers/sbus/char/oradax.c
@@ -0,0 +1,1005 @@
1/*
2 * Copyright (c) 2017, Oracle and/or its affiliates. All rights reserved.
3 *
4 * This program is free software: you can redistribute it and/or modify
5 * it under the terms of the GNU General Public License as published by
6 * the Free Software Foundation, either version 3 of the License, or
7 * (at your option) any later version.
8 *
9 * This program is distributed in the hope that it will be useful,
10 * but WITHOUT ANY WARRANTY; without even the implied warranty of
11 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
12 * GNU General Public License for more details.
13 *
14 * You should have received a copy of the GNU General Public License
15 * along with this program. If not, see <http://www.gnu.org/licenses/>.
16 */
17
18/*
19 * Oracle Data Analytics Accelerator (DAX)
20 *
21 * DAX is a coprocessor which resides on the SPARC M7 (DAX1) and M8
22 * (DAX2) processor chips, and has direct access to the CPU's L3
23 * caches as well as physical memory. It can perform several
24 * operations on data streams with various input and output formats.
25 * The driver provides a transport mechanism only and has limited
26 * knowledge of the various opcodes and data formats. A user space
27 * library provides high level services and translates these into low
28 * level commands which are then passed into the driver and
29 * subsequently the hypervisor and the coprocessor. The library is
30 * the recommended way for applications to use the coprocessor, and
31 * the driver interface is not intended for general use.
32 *
33 * See Documentation/sparc/oradax/oracle_dax.txt for more details.
34 */
35
36#include <linux/uaccess.h>
37#include <linux/module.h>
38#include <linux/delay.h>
39#include <linux/cdev.h>
40#include <linux/slab.h>
41#include <linux/mm.h>
42
43#include <asm/hypervisor.h>
44#include <asm/mdesc.h>
45#include <asm/oradax.h>
46
47MODULE_LICENSE("GPL");
48MODULE_DESCRIPTION("Driver for Oracle Data Analytics Accelerator");
49
50#define DAX_DBG_FLG_BASIC 0x01
51#define DAX_DBG_FLG_STAT 0x02
52#define DAX_DBG_FLG_INFO 0x04
53#define DAX_DBG_FLG_ALL 0xff
54
55#define dax_err(fmt, ...) pr_err("%s: " fmt "\n", __func__, ##__VA_ARGS__)
56#define dax_info(fmt, ...) pr_info("%s: " fmt "\n", __func__, ##__VA_ARGS__)
57
58#define dax_dbg(fmt, ...) do { \
59 if (dax_debug & DAX_DBG_FLG_BASIC)\
60 dax_info(fmt, ##__VA_ARGS__); \
61 } while (0)
62#define dax_stat_dbg(fmt, ...) do { \
63 if (dax_debug & DAX_DBG_FLG_STAT) \
64 dax_info(fmt, ##__VA_ARGS__); \
65 } while (0)
66#define dax_info_dbg(fmt, ...) do { \
67 if (dax_debug & DAX_DBG_FLG_INFO) \
68 dax_info(fmt, ##__VA_ARGS__); \
69 } while (0)
70
71#define DAX1_MINOR 1
72#define DAX1_MAJOR 1
73#define DAX2_MINOR 0
74#define DAX2_MAJOR 2
75
76#define DAX1_STR "ORCL,sun4v-dax"
77#define DAX2_STR "ORCL,sun4v-dax2"
78
79#define DAX_CA_ELEMS (DAX_MMAP_LEN / sizeof(struct dax_cca))
80
81#define DAX_CCB_USEC 100
82#define DAX_CCB_RETRIES 10000
83
84/* stream types */
85enum {
86 OUT,
87 PRI,
88 SEC,
89 TBL,
90 NUM_STREAM_TYPES
91};
92
93/* completion status */
94#define CCA_STAT_NOT_COMPLETED 0
95#define CCA_STAT_COMPLETED 1
96#define CCA_STAT_FAILED 2
97#define CCA_STAT_KILLED 3
98#define CCA_STAT_NOT_RUN 4
99#define CCA_STAT_PIPE_OUT 5
100#define CCA_STAT_PIPE_SRC 6
101#define CCA_STAT_PIPE_DST 7
102
103/* completion err */
104#define CCA_ERR_SUCCESS 0x0 /* no error */
105#define CCA_ERR_OVERFLOW 0x1 /* buffer overflow */
106#define CCA_ERR_DECODE 0x2 /* CCB decode error */
107#define CCA_ERR_PAGE_OVERFLOW 0x3 /* page overflow */
108#define CCA_ERR_KILLED 0x7 /* command was killed */
109#define CCA_ERR_TIMEOUT 0x8 /* Timeout */
110#define CCA_ERR_ADI 0x9 /* ADI error */
111#define CCA_ERR_DATA_FMT 0xA /* data format error */
112#define CCA_ERR_OTHER_NO_RETRY 0xE /* Other error, do not retry */
113#define CCA_ERR_OTHER_RETRY 0xF /* Other error, retry */
114#define CCA_ERR_PARTIAL_SYMBOL 0x80 /* QP partial symbol warning */
115
116/* CCB address types */
117#define DAX_ADDR_TYPE_NONE 0
118#define DAX_ADDR_TYPE_VA_ALT 1 /* secondary context */
119#define DAX_ADDR_TYPE_RA 2 /* real address */
120#define DAX_ADDR_TYPE_VA 3 /* virtual address */
121
122/* dax_header_t opcode */
123#define DAX_OP_SYNC_NOP 0x0
124#define DAX_OP_EXTRACT 0x1
125#define DAX_OP_SCAN_VALUE 0x2
126#define DAX_OP_SCAN_RANGE 0x3
127#define DAX_OP_TRANSLATE 0x4
128#define DAX_OP_SELECT 0x5
129#define DAX_OP_INVERT 0x10 /* OR with translate, scan opcodes */
130
131struct dax_header {
132 u32 ccb_version:4; /* 31:28 CCB Version */
133 /* 27:24 Sync Flags */
134 u32 pipe:1; /* Pipeline */
135 u32 longccb:1; /* Longccb. Set for scan with lu2, lu3, lu4. */
136 u32 cond:1; /* Conditional */
137 u32 serial:1; /* Serial */
138 u32 opcode:8; /* 23:16 Opcode */
139 /* 15:0 Address Type. */
140 u32 reserved:3; /* 15:13 reserved */
141 u32 table_addr_type:2; /* 12:11 Huffman Table Address Type */
142 u32 out_addr_type:3; /* 10:8 Destination Address Type */
143 u32 sec_addr_type:3; /* 7:5 Secondary Source Address Type */
144 u32 pri_addr_type:3; /* 4:2 Primary Source Address Type */
145 u32 cca_addr_type:2; /* 1:0 Completion Address Type */
146};
147
148struct dax_control {
149 u32 pri_fmt:4; /* 31:28 Primary Input Format */
150 u32 pri_elem_size:5; /* 27:23 Primary Input Element Size(less1) */
151 u32 pri_offset:3; /* 22:20 Primary Input Starting Offset */
152 u32 sec_encoding:1; /* 19 Secondary Input Encoding */
153 /* (must be 0 for Select) */
154 u32 sec_offset:3; /* 18:16 Secondary Input Starting Offset */
155 u32 sec_elem_size:2; /* 15:14 Secondary Input Element Size */
156 /* (must be 0 for Select) */
157 u32 out_fmt:2; /* 13:12 Output Format */
158 u32 out_elem_size:2; /* 11:10 Output Element Size */
159 u32 misc:10; /* 9:0 Opcode specific info */
160};
161
162struct dax_data_access {
163 u64 flow_ctrl:2; /* 63:62 Flow Control Type */
164 u64 pipe_target:2; /* 61:60 Pipeline Target */
165 u64 out_buf_size:20; /* 59:40 Output Buffer Size */
166 /* (cachelines less 1) */
167 u64 unused1:8; /* 39:32 Reserved, Set to 0 */
168 u64 out_alloc:5; /* 31:27 Output Allocation */
169 u64 unused2:1; /* 26 Reserved */
170 u64 pri_len_fmt:2; /* 25:24 Input Length Format */
171 u64 pri_len:24; /* 23:0 Input Element/Byte/Bit Count */
172 /* (less 1) */
173};
174
175struct dax_ccb {
176 struct dax_header hdr; /* CCB Header */
177 struct dax_control ctrl;/* Control Word */
178 void *ca; /* Completion Address */
179 void *pri; /* Primary Input Address */
180 struct dax_data_access dac; /* Data Access Control */
181 void *sec; /* Secondary Input Address */
182 u64 dword5; /* depends on opcode */
183 void *out; /* Output Address */
184 void *tbl; /* Table Address or bitmap */
185};
186
187struct dax_cca {
188 u8 status; /* user may mwait on this address */
189 u8 err; /* user visible error notification */
190 u8 rsvd[2]; /* reserved */
191 u32 n_remaining; /* for QP partial symbol warning */
192 u32 output_sz; /* output in bytes */
193 u32 rsvd2; /* reserved */
194 u64 run_cycles; /* run time in OCND2 cycles */
195 u64 run_stats; /* nothing reported in version 1.0 */
196 u32 n_processed; /* number input elements */
197 u32 rsvd3[5]; /* reserved */
198 u64 retval; /* command return value */
199 u64 rsvd4[8]; /* reserved */
200};
201
202/* per thread CCB context */
203struct dax_ctx {
204 struct dax_ccb *ccb_buf;
205 u64 ccb_buf_ra; /* cached RA of ccb_buf */
206 struct dax_cca *ca_buf;
207 u64 ca_buf_ra; /* cached RA of ca_buf */
208 struct page *pages[DAX_CA_ELEMS][NUM_STREAM_TYPES];
209 /* array of locked pages */
210 struct task_struct *owner; /* thread that owns ctx */
211 struct task_struct *client; /* requesting thread */
212 union ccb_result result;
213 u32 ccb_count;
214 u32 fail_count;
215};
216
217/* driver public entry points */
218static int dax_open(struct inode *inode, struct file *file);
219static ssize_t dax_read(struct file *filp, char __user *buf,
220 size_t count, loff_t *ppos);
221static ssize_t dax_write(struct file *filp, const char __user *buf,
222 size_t count, loff_t *ppos);
223static int dax_devmap(struct file *f, struct vm_area_struct *vma);
224static int dax_close(struct inode *i, struct file *f);
225
226static const struct file_operations dax_fops = {
227 .owner = THIS_MODULE,
228 .open = dax_open,
229 .read = dax_read,
230 .write = dax_write,
231 .mmap = dax_devmap,
232 .release = dax_close,
233};
234
235static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
236 size_t count, loff_t *ppos);
237static int dax_ccb_info(u64 ca, struct ccb_info_result *info);
238static int dax_ccb_kill(u64 ca, u16 *kill_res);
239
240static struct cdev c_dev;
241static struct class *cl;
242static dev_t first;
243
244static int max_ccb_version;
245static int dax_debug;
246module_param(dax_debug, int, 0644);
247MODULE_PARM_DESC(dax_debug, "Debug flags");
248
249static int __init dax_attach(void)
250{
251 unsigned long dummy, hv_rv, major, minor, minor_requested, max_ccbs;
252 struct mdesc_handle *hp = mdesc_grab();
253 char *prop, *dax_name;
254 bool found = false;
255 int len, ret = 0;
256 u64 pn;
257
258 if (hp == NULL) {
259 dax_err("Unable to grab mdesc");
260 return -ENODEV;
261 }
262
263 mdesc_for_each_node_by_name(hp, pn, "virtual-device") {
264 prop = (char *)mdesc_get_property(hp, pn, "name", &len);
265 if (prop == NULL)
266 continue;
267 if (strncmp(prop, "dax", strlen("dax")))
268 continue;
269 dax_dbg("Found node 0x%llx = %s", pn, prop);
270
271 prop = (char *)mdesc_get_property(hp, pn, "compatible", &len);
272 if (prop == NULL)
273 continue;
274 dax_dbg("Found node 0x%llx = %s", pn, prop);
275 found = true;
276 break;
277 }
278
279 if (!found) {
280 dax_err("No DAX device found");
281 ret = -ENODEV;
282 goto done;
283 }
284
285 if (strncmp(prop, DAX2_STR, strlen(DAX2_STR)) == 0) {
286 dax_name = DAX_NAME "2";
287 major = DAX2_MAJOR;
288 minor_requested = DAX2_MINOR;
289 max_ccb_version = 1;
290 dax_dbg("MD indicates DAX2 coprocessor");
291 } else if (strncmp(prop, DAX1_STR, strlen(DAX1_STR)) == 0) {
292 dax_name = DAX_NAME "1";
293 major = DAX1_MAJOR;
294 minor_requested = DAX1_MINOR;
295 max_ccb_version = 0;
296 dax_dbg("MD indicates DAX1 coprocessor");
297 } else {
298 dax_err("Unknown dax type: %s", prop);
299 ret = -ENODEV;
300 goto done;
301 }
302
303 minor = minor_requested;
304 dax_dbg("Registering DAX HV api with major %ld minor %ld", major,
305 minor);
306 if (sun4v_hvapi_register(HV_GRP_DAX, major, &minor)) {
307 dax_err("hvapi_register failed");
308 ret = -ENODEV;
309 goto done;
310 } else {
311 dax_dbg("Max minor supported by HV = %ld (major %ld)", minor,
312 major);
313 minor = min(minor, minor_requested);
314 dax_dbg("registered DAX major %ld minor %ld", major, minor);
315 }
316
317 /* submit a zero length ccb array to query coprocessor queue size */
318 hv_rv = sun4v_ccb_submit(0, 0, HV_CCB_QUERY_CMD, 0, &max_ccbs, &dummy);
319 if (hv_rv != 0) {
320 dax_err("get_hwqueue_size failed with status=%ld and max_ccbs=%ld",
321 hv_rv, max_ccbs);
322 ret = -ENODEV;
323 goto done;
324 }
325
326 if (max_ccbs != DAX_MAX_CCBS) {
327 dax_err("HV reports unsupported max_ccbs=%ld", max_ccbs);
328 ret = -ENODEV;
329 goto done;
330 }
331
332 if (alloc_chrdev_region(&first, 0, 1, DAX_NAME) < 0) {
333 dax_err("alloc_chrdev_region failed");
334 ret = -ENXIO;
335 goto done;
336 }
337
338 cl = class_create(THIS_MODULE, DAX_NAME);
339 if (cl == NULL) {
340 dax_err("class_create failed");
341 ret = -ENXIO;
342 goto class_error;
343 }
344
345 if (device_create(cl, NULL, first, NULL, dax_name) == NULL) {
346 dax_err("device_create failed");
347 ret = -ENXIO;
348 goto device_error;
349 }
350
351 cdev_init(&c_dev, &dax_fops);
352 if (cdev_add(&c_dev, first, 1) == -1) {
353 dax_err("cdev_add failed");
354 ret = -ENXIO;
355 goto cdev_error;
356 }
357
358 pr_info("Attached DAX module\n");
359 goto done;
360
361cdev_error:
362 device_destroy(cl, first);
363device_error:
364 class_destroy(cl);
365class_error:
366 unregister_chrdev_region(first, 1);
367done:
368 mdesc_release(hp);
369 return ret;
370}
371module_init(dax_attach);
372
373static void __exit dax_detach(void)
374{
375 pr_info("Cleaning up DAX module\n");
376 cdev_del(&c_dev);
377 device_destroy(cl, first);
378 class_destroy(cl);
379 unregister_chrdev_region(first, 1);
380}
381module_exit(dax_detach);
382
383/* map completion area */
384static int dax_devmap(struct file *f, struct vm_area_struct *vma)
385{
386 struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
387 size_t len = vma->vm_end - vma->vm_start;
388
389 dax_dbg("len=0x%lx, flags=0x%lx", len, vma->vm_flags);
390
391 if (ctx->owner != current) {
392 dax_dbg("devmap called from wrong thread");
393 return -EINVAL;
394 }
395
396 if (len != DAX_MMAP_LEN) {
397 dax_dbg("len(%lu) != DAX_MMAP_LEN(%d)", len, DAX_MMAP_LEN);
398 return -EINVAL;
399 }
400
401 /* completion area is mapped read-only for user */
402 if (vma->vm_flags & VM_WRITE)
403 return -EPERM;
404 vma->vm_flags &= ~VM_MAYWRITE;
405
406 if (remap_pfn_range(vma, vma->vm_start, ctx->ca_buf_ra >> PAGE_SHIFT,
407 len, vma->vm_page_prot))
408 return -EAGAIN;
409
410 dax_dbg("mmapped completion area at uva 0x%lx", vma->vm_start);
411 return 0;
412}
413
414/* Unlock user pages. Called during dequeue or device close */
415static void dax_unlock_pages(struct dax_ctx *ctx, int ccb_index, int nelem)
416{
417 int i, j;
418
419 for (i = ccb_index; i < ccb_index + nelem; i++) {
420 for (j = 0; j < NUM_STREAM_TYPES; j++) {
421 struct page *p = ctx->pages[i][j];
422
423 if (p) {
424 dax_dbg("freeing page %p", p);
425 if (j == OUT)
426 set_page_dirty(p);
427 put_page(p);
428 ctx->pages[i][j] = NULL;
429 }
430 }
431 }
432}
433
434static int dax_lock_page(void *va, struct page **p)
435{
436 int ret;
437
438 dax_dbg("uva %p", va);
439
440 ret = get_user_pages_fast((unsigned long)va, 1, 1, p);
441 if (ret == 1) {
442 dax_dbg("locked page %p, for VA %p", *p, va);
443 return 0;
444 }
445
446 dax_dbg("get_user_pages failed, va=%p, ret=%d", va, ret);
447 return -1;
448}
449
450static int dax_lock_pages(struct dax_ctx *ctx, int idx,
451 int nelem, u64 *err_va)
452{
453 int i;
454
455 for (i = 0; i < nelem; i++) {
456 struct dax_ccb *ccbp = &ctx->ccb_buf[i];
457
458 /*
459 * For each address in the CCB whose type is virtual,
460 * lock the page and change the type to virtual alternate
461 * context. On error, return the offending address in
462 * err_va.
463 */
464 if (ccbp->hdr.out_addr_type == DAX_ADDR_TYPE_VA) {
465 dax_dbg("output");
466 if (dax_lock_page(ccbp->out,
467 &ctx->pages[i + idx][OUT]) != 0) {
468 *err_va = (u64)ccbp->out;
469 goto error;
470 }
471 ccbp->hdr.out_addr_type = DAX_ADDR_TYPE_VA_ALT;
472 }
473
474 if (ccbp->hdr.pri_addr_type == DAX_ADDR_TYPE_VA) {
475 dax_dbg("input");
476 if (dax_lock_page(ccbp->pri,
477 &ctx->pages[i + idx][PRI]) != 0) {
478 *err_va = (u64)ccbp->pri;
479 goto error;
480 }
481 ccbp->hdr.pri_addr_type = DAX_ADDR_TYPE_VA_ALT;
482 }
483
484 if (ccbp->hdr.sec_addr_type == DAX_ADDR_TYPE_VA) {
485 dax_dbg("sec input");
486 if (dax_lock_page(ccbp->sec,
487 &ctx->pages[i + idx][SEC]) != 0) {
488 *err_va = (u64)ccbp->sec;
489 goto error;
490 }
491 ccbp->hdr.sec_addr_type = DAX_ADDR_TYPE_VA_ALT;
492 }
493
494 if (ccbp->hdr.table_addr_type == DAX_ADDR_TYPE_VA) {
495 dax_dbg("tbl");
496 if (dax_lock_page(ccbp->tbl,
497 &ctx->pages[i + idx][TBL]) != 0) {
498 *err_va = (u64)ccbp->tbl;
499 goto error;
500 }
501 ccbp->hdr.table_addr_type = DAX_ADDR_TYPE_VA_ALT;
502 }
503
504 /* skip over 2nd 64 bytes of long CCB */
505 if (ccbp->hdr.longccb)
506 i++;
507 }
508 return DAX_SUBMIT_OK;
509
510error:
511 dax_unlock_pages(ctx, idx, nelem);
512 return DAX_SUBMIT_ERR_NOACCESS;
513}
514
515static void dax_ccb_wait(struct dax_ctx *ctx, int idx)
516{
517 int ret, nretries;
518 u16 kill_res;
519
520 dax_dbg("idx=%d", idx);
521
522 for (nretries = 0; nretries < DAX_CCB_RETRIES; nretries++) {
523 if (ctx->ca_buf[idx].status == CCA_STAT_NOT_COMPLETED)
524 udelay(DAX_CCB_USEC);
525 else
526 return;
527 }
528 dax_dbg("ctx (%p): CCB[%d] timed out, wait usec=%d, retries=%d. Killing ccb",
529 (void *)ctx, idx, DAX_CCB_USEC, DAX_CCB_RETRIES);
530
531 ret = dax_ccb_kill(ctx->ca_buf_ra + idx * sizeof(struct dax_cca),
532 &kill_res);
533 dax_dbg("Kill CCB[%d] %s", idx, ret ? "failed" : "succeeded");
534}
535
536static int dax_close(struct inode *ino, struct file *f)
537{
538 struct dax_ctx *ctx = (struct dax_ctx *)f->private_data;
539 int i;
540
541 f->private_data = NULL;
542
543 for (i = 0; i < DAX_CA_ELEMS; i++) {
544 if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
545 dax_dbg("CCB[%d] not completed", i);
546 dax_ccb_wait(ctx, i);
547 }
548 dax_unlock_pages(ctx, i, 1);
549 }
550
551 kfree(ctx->ccb_buf);
552 kfree(ctx->ca_buf);
553 dax_stat_dbg("CCBs: %d good, %d bad", ctx->ccb_count, ctx->fail_count);
554 kfree(ctx);
555
556 return 0;
557}
558
559static ssize_t dax_read(struct file *f, char __user *buf,
560 size_t count, loff_t *ppos)
561{
562 struct dax_ctx *ctx = f->private_data;
563
564 if (ctx->client != current)
565 return -EUSERS;
566
567 ctx->client = NULL;
568
569 if (count != sizeof(union ccb_result))
570 return -EINVAL;
571 if (copy_to_user(buf, &ctx->result, sizeof(union ccb_result)))
572 return -EFAULT;
573 return count;
574}
575
576static ssize_t dax_write(struct file *f, const char __user *buf,
577 size_t count, loff_t *ppos)
578{
579 struct dax_ctx *ctx = f->private_data;
580 struct dax_command hdr;
581 unsigned long ca;
582 int i, idx, ret;
583
584 if (ctx->client != NULL)
585 return -EINVAL;
586
587 if (count == 0 || count > DAX_MAX_CCBS * sizeof(struct dax_ccb))
588 return -EINVAL;
589
590 if (count % sizeof(struct dax_ccb) == 0)
591 return dax_ccb_exec(ctx, buf, count, ppos); /* CCB EXEC */
592
593 if (count != sizeof(struct dax_command))
594 return -EINVAL;
595
596 /* immediate command */
597 if (ctx->owner != current)
598 return -EUSERS;
599
600 if (copy_from_user(&hdr, buf, sizeof(hdr)))
601 return -EFAULT;
602
603 ca = ctx->ca_buf_ra + hdr.ca_offset;
604
605 switch (hdr.command) {
606 case CCB_KILL:
607 if (hdr.ca_offset >= DAX_MMAP_LEN) {
608 dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
609 hdr.ca_offset, DAX_MMAP_LEN);
610 return -EINVAL;
611 }
612
613 ret = dax_ccb_kill(ca, &ctx->result.kill.action);
614 if (ret != 0) {
615 dax_dbg("dax_ccb_kill failed (ret=%d)", ret);
616 return ret;
617 }
618
619 dax_info_dbg("killed (ca_offset %d)", hdr.ca_offset);
620 idx = hdr.ca_offset / sizeof(struct dax_cca);
621 ctx->ca_buf[idx].status = CCA_STAT_KILLED;
622 ctx->ca_buf[idx].err = CCA_ERR_KILLED;
623 ctx->client = current;
624 return count;
625
626 case CCB_INFO:
627 if (hdr.ca_offset >= DAX_MMAP_LEN) {
628 dax_dbg("invalid ca_offset (%d) >= ca_buflen (%d)",
629 hdr.ca_offset, DAX_MMAP_LEN);
630 return -EINVAL;
631 }
632
633 ret = dax_ccb_info(ca, &ctx->result.info);
634 if (ret != 0) {
635 dax_dbg("dax_ccb_info failed (ret=%d)", ret);
636 return ret;
637 }
638
639 dax_info_dbg("info succeeded on ca_offset %d", hdr.ca_offset);
640 ctx->client = current;
641 return count;
642
643 case CCB_DEQUEUE:
644 for (i = 0; i < DAX_CA_ELEMS; i++) {
645 if (ctx->ca_buf[i].status !=
646 CCA_STAT_NOT_COMPLETED)
647 dax_unlock_pages(ctx, i, 1);
648 }
649 return count;
650
651 default:
652 return -EINVAL;
653 }
654}
655
656static int dax_open(struct inode *inode, struct file *f)
657{
658 struct dax_ctx *ctx = NULL;
659 int i;
660
661 ctx = kzalloc(sizeof(*ctx), GFP_KERNEL);
662 if (ctx == NULL)
663 goto done;
664
665 ctx->ccb_buf = kcalloc(DAX_MAX_CCBS, sizeof(struct dax_ccb),
666 GFP_KERNEL);
667 if (ctx->ccb_buf == NULL)
668 goto done;
669
670 ctx->ccb_buf_ra = virt_to_phys(ctx->ccb_buf);
671 dax_dbg("ctx->ccb_buf=0x%p, ccb_buf_ra=0x%llx",
672 (void *)ctx->ccb_buf, ctx->ccb_buf_ra);
673
674 /* allocate CCB completion area buffer */
675 ctx->ca_buf = kzalloc(DAX_MMAP_LEN, GFP_KERNEL);
676 if (ctx->ca_buf == NULL)
677 goto alloc_error;
678 for (i = 0; i < DAX_CA_ELEMS; i++)
679 ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
680
681 ctx->ca_buf_ra = virt_to_phys(ctx->ca_buf);
682 dax_dbg("ctx=0x%p, ctx->ca_buf=0x%p, ca_buf_ra=0x%llx",
683 (void *)ctx, (void *)ctx->ca_buf, ctx->ca_buf_ra);
684
685 ctx->owner = current;
686 f->private_data = ctx;
687 return 0;
688
689alloc_error:
690 kfree(ctx->ccb_buf);
691done:
692 if (ctx != NULL)
693 kfree(ctx);
694 return -ENOMEM;
695}
696
697static char *dax_hv_errno(unsigned long hv_ret, int *ret)
698{
699 switch (hv_ret) {
700 case HV_EBADALIGN:
701 *ret = -EFAULT;
702 return "HV_EBADALIGN";
703 case HV_ENORADDR:
704 *ret = -EFAULT;
705 return "HV_ENORADDR";
706 case HV_EINVAL:
707 *ret = -EINVAL;
708 return "HV_EINVAL";
709 case HV_EWOULDBLOCK:
710 *ret = -EAGAIN;
711 return "HV_EWOULDBLOCK";
712 case HV_ENOACCESS:
713 *ret = -EPERM;
714 return "HV_ENOACCESS";
715 default:
716 break;
717 }
718
719 *ret = -EIO;
720 return "UNKNOWN";
721}
722
723static int dax_ccb_kill(u64 ca, u16 *kill_res)
724{
725 unsigned long hv_ret;
726 int count, ret = 0;
727 char *err_str;
728
729 for (count = 0; count < DAX_CCB_RETRIES; count++) {
730 dax_dbg("attempting kill on ca_ra 0x%llx", ca);
731 hv_ret = sun4v_ccb_kill(ca, kill_res);
732
733 if (hv_ret == HV_EOK) {
734 dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca,
735 *kill_res);
736 } else {
737 err_str = dax_hv_errno(hv_ret, &ret);
738 dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
739 }
740
741 if (ret != -EAGAIN)
742 return ret;
743 dax_info_dbg("ccb_kill count = %d", count);
744 udelay(DAX_CCB_USEC);
745 }
746
747 return -EAGAIN;
748}
749
750static int dax_ccb_info(u64 ca, struct ccb_info_result *info)
751{
752 unsigned long hv_ret;
753 char *err_str;
754 int ret = 0;
755
756 dax_dbg("attempting info on ca_ra 0x%llx", ca);
757 hv_ret = sun4v_ccb_info(ca, info);
758
759 if (hv_ret == HV_EOK) {
760 dax_info_dbg("HV_EOK (ca_ra 0x%llx): %d", ca, info->state);
761 if (info->state == DAX_CCB_ENQUEUED) {
762 dax_info_dbg("dax_unit %d, queue_num %d, queue_pos %d",
763 info->inst_num, info->q_num, info->q_pos);
764 }
765 } else {
766 err_str = dax_hv_errno(hv_ret, &ret);
767 dax_dbg("%s (ca_ra 0x%llx)", err_str, ca);
768 }
769
770 return ret;
771}
772
773static void dax_prt_ccbs(struct dax_ccb *ccb, int nelem)
774{
775 int i, j;
776 u64 *ccbp;
777
778 dax_dbg("ccb buffer:");
779 for (i = 0; i < nelem; i++) {
780 ccbp = (u64 *)&ccb[i];
781 dax_dbg(" %sccb[%d]", ccb[i].hdr.longccb ? "long " : "", i);
782 for (j = 0; j < 8; j++)
783 dax_dbg("\tccb[%d].dwords[%d]=0x%llx",
784 i, j, *(ccbp + j));
785 }
786}
787
788/*
789 * Validates user CCB content. Also sets completion address and address types
790 * for all addresses contained in CCB.
791 */
792static int dax_preprocess_usr_ccbs(struct dax_ctx *ctx, int idx, int nelem)
793{
794 int i;
795
796 /*
797 * The user is not allowed to specify real address types in
798 * the CCB header. This must be enforced by the kernel before
799 * submitting the CCBs to HV. The only allowed values for all
800 * address fields are VA or IMM
801 */
802 for (i = 0; i < nelem; i++) {
803 struct dax_ccb *ccbp = &ctx->ccb_buf[i];
804 unsigned long ca_offset;
805
806 if (ccbp->hdr.ccb_version > max_ccb_version)
807 return DAX_SUBMIT_ERR_CCB_INVAL;
808
809 switch (ccbp->hdr.opcode) {
810 case DAX_OP_SYNC_NOP:
811 case DAX_OP_EXTRACT:
812 case DAX_OP_SCAN_VALUE:
813 case DAX_OP_SCAN_RANGE:
814 case DAX_OP_TRANSLATE:
815 case DAX_OP_SCAN_VALUE | DAX_OP_INVERT:
816 case DAX_OP_SCAN_RANGE | DAX_OP_INVERT:
817 case DAX_OP_TRANSLATE | DAX_OP_INVERT:
818 case DAX_OP_SELECT:
819 break;
820 default:
821 return DAX_SUBMIT_ERR_CCB_INVAL;
822 }
823
824 if (ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_VA &&
825 ccbp->hdr.out_addr_type != DAX_ADDR_TYPE_NONE) {
826 dax_dbg("invalid out_addr_type in user CCB[%d]", i);
827 return DAX_SUBMIT_ERR_CCB_INVAL;
828 }
829
830 if (ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_VA &&
831 ccbp->hdr.pri_addr_type != DAX_ADDR_TYPE_NONE) {
832 dax_dbg("invalid pri_addr_type in user CCB[%d]", i);
833 return DAX_SUBMIT_ERR_CCB_INVAL;
834 }
835
836 if (ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_VA &&
837 ccbp->hdr.sec_addr_type != DAX_ADDR_TYPE_NONE) {
838 dax_dbg("invalid sec_addr_type in user CCB[%d]", i);
839 return DAX_SUBMIT_ERR_CCB_INVAL;
840 }
841
842 if (ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_VA &&
843 ccbp->hdr.table_addr_type != DAX_ADDR_TYPE_NONE) {
844 dax_dbg("invalid table_addr_type in user CCB[%d]", i);
845 return DAX_SUBMIT_ERR_CCB_INVAL;
846 }
847
848 /* set completion (real) address and address type */
849 ccbp->hdr.cca_addr_type = DAX_ADDR_TYPE_RA;
850 ca_offset = (idx + i) * sizeof(struct dax_cca);
851 ccbp->ca = (void *)ctx->ca_buf_ra + ca_offset;
852 memset(&ctx->ca_buf[idx + i], 0, sizeof(struct dax_cca));
853
854 dax_dbg("ccb[%d]=%p, ca_offset=0x%lx, compl RA=0x%llx",
855 i, ccbp, ca_offset, ctx->ca_buf_ra + ca_offset);
856
857 /* skip over 2nd 64 bytes of long CCB */
858 if (ccbp->hdr.longccb)
859 i++;
860 }
861
862 return DAX_SUBMIT_OK;
863}
864
865static int dax_ccb_exec(struct dax_ctx *ctx, const char __user *buf,
866 size_t count, loff_t *ppos)
867{
868 unsigned long accepted_len, hv_rv;
869 int i, idx, nccbs, naccepted;
870
871 ctx->client = current;
872 idx = *ppos;
873 nccbs = count / sizeof(struct dax_ccb);
874
875 if (ctx->owner != current) {
876 dax_dbg("wrong thread");
877 ctx->result.exec.status = DAX_SUBMIT_ERR_THR_INIT;
878 return 0;
879 }
880 dax_dbg("args: ccb_buf_len=%ld, idx=%d", count, idx);
881
882 /* for given index and length, verify ca_buf range exists */
883 if (idx + nccbs >= DAX_CA_ELEMS) {
884 ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
885 return 0;
886 }
887
888 /*
889 * Copy CCBs into kernel buffer to prevent modification by the
890 * user in between validation and submission.
891 */
892 if (copy_from_user(ctx->ccb_buf, buf, count)) {
893 dax_dbg("copyin of user CCB buffer failed");
894 ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_ARR_MMU_MISS;
895 return 0;
896 }
897
898 /* check to see if ca_buf[idx] .. ca_buf[idx + nccbs] are available */
899 for (i = idx; i < idx + nccbs; i++) {
900 if (ctx->ca_buf[i].status == CCA_STAT_NOT_COMPLETED) {
901 dax_dbg("CA range not available, dequeue needed");
902 ctx->result.exec.status = DAX_SUBMIT_ERR_NO_CA_AVAIL;
903 return 0;
904 }
905 }
906 dax_unlock_pages(ctx, idx, nccbs);
907
908 ctx->result.exec.status = dax_preprocess_usr_ccbs(ctx, idx, nccbs);
909 if (ctx->result.exec.status != DAX_SUBMIT_OK)
910 return 0;
911
912 ctx->result.exec.status = dax_lock_pages(ctx, idx, nccbs,
913 &ctx->result.exec.status_data);
914 if (ctx->result.exec.status != DAX_SUBMIT_OK)
915 return 0;
916
917 if (dax_debug & DAX_DBG_FLG_BASIC)
918 dax_prt_ccbs(ctx->ccb_buf, nccbs);
919
920 hv_rv = sun4v_ccb_submit(ctx->ccb_buf_ra, count,
921 HV_CCB_QUERY_CMD | HV_CCB_VA_SECONDARY, 0,
922 &accepted_len, &ctx->result.exec.status_data);
923
924 switch (hv_rv) {
925 case HV_EOK:
926 /*
927 * Hcall succeeded with no errors but the accepted
928 * length may be less than the requested length. The
929 * only way the driver can resubmit the remainder is
930 * to wait for completion of the submitted CCBs since
931 * there is no way to guarantee the ordering semantics
932 * required by the client applications. Therefore we
933 * let the user library deal with resubmissions.
934 */
935 ctx->result.exec.status = DAX_SUBMIT_OK;
936 break;
937 case HV_EWOULDBLOCK:
938 /*
939 * This is a transient HV API error. The user library
940 * can retry.
941 */
942 dax_dbg("hcall returned HV_EWOULDBLOCK");
943 ctx->result.exec.status = DAX_SUBMIT_ERR_WOULDBLOCK;
944 break;
945 case HV_ENOMAP:
946 /*
947 * HV was unable to translate a VA. The VA it could
948 * not translate is returned in the status_data param.
949 */
950 dax_dbg("hcall returned HV_ENOMAP");
951 ctx->result.exec.status = DAX_SUBMIT_ERR_NOMAP;
952 break;
953 case HV_EINVAL:
954 /*
955 * This is the result of an invalid user CCB as HV is
956 * validating some of the user CCB fields. Pass this
957 * error back to the user. There is no supporting info
958 * to isolate the invalid field.
959 */
960 dax_dbg("hcall returned HV_EINVAL");
961 ctx->result.exec.status = DAX_SUBMIT_ERR_CCB_INVAL;
962 break;
963 case HV_ENOACCESS:
964 /*
965 * HV found a VA that did not have the appropriate
966 * permissions (such as the w bit). The VA in question
967 * is returned in status_data param.
968 */
969 dax_dbg("hcall returned HV_ENOACCESS");
970 ctx->result.exec.status = DAX_SUBMIT_ERR_NOACCESS;
971 break;
972 case HV_EUNAVAILABLE:
973 /*
974 * The requested CCB operation could not be performed
975 * at this time. Return the specific unavailable code
976 * in the status_data field.
977 */
978 dax_dbg("hcall returned HV_EUNAVAILABLE");
979 ctx->result.exec.status = DAX_SUBMIT_ERR_UNAVAIL;
980 break;
981 default:
982 ctx->result.exec.status = DAX_SUBMIT_ERR_INTERNAL;
983 dax_dbg("unknown hcall return value (%ld)", hv_rv);
984 break;
985 }
986
987 /* unlock pages associated with the unaccepted CCBs */
988 naccepted = accepted_len / sizeof(struct dax_ccb);
989 dax_unlock_pages(ctx, idx + naccepted, nccbs - naccepted);
990
991 /* mark unaccepted CCBs as not completed */
992 for (i = idx + naccepted; i < idx + nccbs; i++)
993 ctx->ca_buf[i].status = CCA_STAT_COMPLETED;
994
995 ctx->ccb_count += naccepted;
996 ctx->fail_count += nccbs - naccepted;
997
998 dax_dbg("hcall rv=%ld, accepted_len=%ld, status_data=0x%llx, ret status=%d",
999 hv_rv, accepted_len, ctx->result.exec.status_data,
1000 ctx->result.exec.status);
1001
1002 if (count == accepted_len)
1003 ctx->client = NULL; /* no read needed to complete protocol */
1004 return accepted_len;
1005}