diff options
Diffstat (limited to 'Documentation/driver-api')
-rw-r--r-- | Documentation/driver-api/index.rst | 1 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/btt.rst | 285 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/index.rst | 10 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/nvdimm.rst | 887 | ||||
-rw-r--r-- | Documentation/driver-api/nvdimm/security.rst | 143 |
5 files changed, 1326 insertions, 0 deletions
diff --git a/Documentation/driver-api/index.rst b/Documentation/driver-api/index.rst index d665cd9ab95f..410dd7110772 100644 --- a/Documentation/driver-api/index.rst +++ b/Documentation/driver-api/index.rst | |||
@@ -44,6 +44,7 @@ available subsections can be seen below. | |||
44 | mtdnand | 44 | mtdnand |
45 | miscellaneous | 45 | miscellaneous |
46 | mei/index | 46 | mei/index |
47 | nvdimm/index | ||
47 | w1 | 48 | w1 |
48 | rapidio/index | 49 | rapidio/index |
49 | s390-drivers | 50 | s390-drivers |
diff --git a/Documentation/driver-api/nvdimm/btt.rst b/Documentation/driver-api/nvdimm/btt.rst new file mode 100644 index 000000000000..2d8269f834bd --- /dev/null +++ b/Documentation/driver-api/nvdimm/btt.rst | |||
@@ -0,0 +1,285 @@ | |||
1 | ============================= | ||
2 | BTT - Block Translation Table | ||
3 | ============================= | ||
4 | |||
5 | |||
6 | 1. Introduction | ||
7 | =============== | ||
8 | |||
9 | Persistent memory based storage is able to perform IO at byte (or more | ||
10 | accurately, cache line) granularity. However, we often want to expose such | ||
11 | storage as traditional block devices. The block drivers for persistent memory | ||
12 | will do exactly this. However, they do not provide any atomicity guarantees. | ||
13 | Traditional SSDs typically provide protection against torn sectors in hardware, | ||
14 | using stored energy in capacitors to complete in-flight block writes, or perhaps | ||
15 | in firmware. We don't have this luxury with persistent memory - if a write is in | ||
16 | progress, and we experience a power failure, the block will contain a mix of old | ||
17 | and new data. Applications may not be prepared to handle such a scenario. | ||
18 | |||
19 | The Block Translation Table (BTT) provides atomic sector update semantics for | ||
20 | persistent memory devices, so that applications that rely on sector writes not | ||
21 | being torn can continue to do so. The BTT manifests itself as a stacked block | ||
22 | device, and reserves a portion of the underlying storage for its metadata. At | ||
23 | the heart of it, is an indirection table that re-maps all the blocks on the | ||
24 | volume. It can be thought of as an extremely simple file system that only | ||
25 | provides atomic sector updates. | ||
26 | |||
27 | |||
28 | 2. Static Layout | ||
29 | ================ | ||
30 | |||
31 | The underlying storage on which a BTT can be laid out is not limited in any way. | ||
32 | The BTT, however, splits the available space into chunks of up to 512 GiB, | ||
33 | called "Arenas". | ||
34 | |||
35 | Each arena follows the same layout for its metadata, and all references in an | ||
36 | arena are internal to it (with the exception of one field that points to the | ||
37 | next arena). The following depicts the "On-disk" metadata layout:: | ||
38 | |||
39 | |||
40 | Backing Store +-------> Arena | ||
41 | +---------------+ | +------------------+ | ||
42 | | | | | Arena info block | | ||
43 | | Arena 0 +---+ | 4K | | ||
44 | | 512G | +------------------+ | ||
45 | | | | | | ||
46 | +---------------+ | | | ||
47 | | | | | | ||
48 | | Arena 1 | | Data Blocks | | ||
49 | | 512G | | | | ||
50 | | | | | | ||
51 | +---------------+ | | | ||
52 | | . | | | | ||
53 | | . | | | | ||
54 | | . | | | | ||
55 | | | | | | ||
56 | | | | | | ||
57 | +---------------+ +------------------+ | ||
58 | | | | ||
59 | | BTT Map | | ||
60 | | | | ||
61 | | | | ||
62 | +------------------+ | ||
63 | | | | ||
64 | | BTT Flog | | ||
65 | | | | ||
66 | +------------------+ | ||
67 | | Info block copy | | ||
68 | | 4K | | ||
69 | +------------------+ | ||
70 | |||
71 | |||
72 | 3. Theory of Operation | ||
73 | ====================== | ||
74 | |||
75 | |||
76 | a. The BTT Map | ||
77 | -------------- | ||
78 | |||
79 | The map is a simple lookup/indirection table that maps an LBA to an internal | ||
80 | block. Each map entry is 32 bits. The two most significant bits are special | ||
81 | flags, and the remaining form the internal block number. | ||
82 | |||
83 | ======== ============================================================= | ||
84 | Bit Description | ||
85 | ======== ============================================================= | ||
86 | 31 - 30 Error and Zero flags - Used in the following way: | ||
87 | |||
88 | == == ==================================================== | ||
89 | 31 30 Description | ||
90 | == == ==================================================== | ||
91 | 0 0 Initial state. Reads return zeroes; Premap = Postmap | ||
92 | 0 1 Zero state: Reads return zeroes | ||
93 | 1 0 Error state: Reads fail; Writes clear 'E' bit | ||
94 | 1 1 Normal Block – has valid postmap | ||
95 | == == ==================================================== | ||
96 | |||
97 | 29 - 0 Mappings to internal 'postmap' blocks | ||
98 | ======== ============================================================= | ||
99 | |||
100 | |||
101 | Some of the terminology that will be subsequently used: | ||
102 | |||
103 | ============ ================================================================ | ||
104 | External LBA LBA as made visible to upper layers. | ||
105 | ABA Arena Block Address - Block offset/number within an arena | ||
106 | Premap ABA The block offset into an arena, which was decided upon by range | ||
107 | checking the External LBA | ||
108 | Postmap ABA The block number in the "Data Blocks" area obtained after | ||
109 | indirection from the map | ||
110 | nfree The number of free blocks that are maintained at any given time. | ||
111 | This is the number of concurrent writes that can happen to the | ||
112 | arena. | ||
113 | ============ ================================================================ | ||
114 | |||
115 | |||
116 | For example, after adding a BTT, we surface a disk of 1024G. We get a read for | ||
117 | the external LBA at 768G. This falls into the second arena, and of the 512G | ||
118 | worth of blocks that this arena contributes, this block is at 256G. Thus, the | ||
119 | premap ABA is 256G. We now refer to the map, and find out the mapping for block | ||
120 | 'X' (256G) points to block 'Y', say '64'. Thus the postmap ABA is 64. | ||
121 | |||
122 | |||
123 | b. The BTT Flog | ||
124 | --------------- | ||
125 | |||
126 | The BTT provides sector atomicity by making every write an "allocating write", | ||
127 | i.e. Every write goes to a "free" block. A running list of free blocks is | ||
128 | maintained in the form of the BTT flog. 'Flog' is a combination of the words | ||
129 | "free list" and "log". The flog contains 'nfree' entries, and an entry contains: | ||
130 | |||
131 | ======== ===================================================================== | ||
132 | lba The premap ABA that is being written to | ||
133 | old_map The old postmap ABA - after 'this' write completes, this will be a | ||
134 | free block. | ||
135 | new_map The new postmap ABA. The map will up updated to reflect this | ||
136 | lba->postmap_aba mapping, but we log it here in case we have to | ||
137 | recover. | ||
138 | seq Sequence number to mark which of the 2 sections of this flog entry is | ||
139 | valid/newest. It cycles between 01->10->11->01 (binary) under normal | ||
140 | operation, with 00 indicating an uninitialized state. | ||
141 | lba' alternate lba entry | ||
142 | old_map' alternate old postmap entry | ||
143 | new_map' alternate new postmap entry | ||
144 | seq' alternate sequence number. | ||
145 | ======== ===================================================================== | ||
146 | |||
147 | Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also | ||
148 | padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are | ||
149 | done such that for any entry being written, it: | ||
150 | a. overwrites the 'old' section in the entry based on sequence numbers | ||
151 | b. writes the 'new' section such that the sequence number is written last. | ||
152 | |||
153 | |||
154 | c. The concept of lanes | ||
155 | ----------------------- | ||
156 | |||
157 | While 'nfree' describes the number of concurrent IOs an arena can process | ||
158 | concurrently, 'nlanes' is the number of IOs the BTT device as a whole can | ||
159 | process:: | ||
160 | |||
161 | nlanes = min(nfree, num_cpus) | ||
162 | |||
163 | A lane number is obtained at the start of any IO, and is used for indexing into | ||
164 | all the on-disk and in-memory data structures for the duration of the IO. If | ||
165 | there are more CPUs than the max number of available lanes, than lanes are | ||
166 | protected by spinlocks. | ||
167 | |||
168 | |||
169 | d. In-memory data structure: Read Tracking Table (RTT) | ||
170 | ------------------------------------------------------ | ||
171 | |||
172 | Consider a case where we have two threads, one doing reads and the other, | ||
173 | writes. We can hit a condition where the writer thread grabs a free block to do | ||
174 | a new IO, but the (slow) reader thread is still reading from it. In other words, | ||
175 | the reader consulted a map entry, and started reading the corresponding block. A | ||
176 | writer started writing to the same external LBA, and finished the write updating | ||
177 | the map for that external LBA to point to its new postmap ABA. At this point the | ||
178 | internal, postmap block that the reader is (still) reading has been inserted | ||
179 | into the list of free blocks. If another write comes in for the same LBA, it can | ||
180 | grab this free block, and start writing to it, causing the reader to read | ||
181 | incorrect data. To prevent this, we introduce the RTT. | ||
182 | |||
183 | The RTT is a simple, per arena table with 'nfree' entries. Every reader inserts | ||
184 | into rtt[lane_number], the postmap ABA it is reading, and clears it after the | ||
185 | read is complete. Every writer thread, after grabbing a free block, checks the | ||
186 | RTT for its presence. If the postmap free block is in the RTT, it waits till the | ||
187 | reader clears the RTT entry, and only then starts writing to it. | ||
188 | |||
189 | |||
190 | e. In-memory data structure: map locks | ||
191 | -------------------------------------- | ||
192 | |||
193 | Consider a case where two writer threads are writing to the same LBA. There can | ||
194 | be a race in the following sequence of steps:: | ||
195 | |||
196 | free[lane] = map[premap_aba] | ||
197 | map[premap_aba] = postmap_aba | ||
198 | |||
199 | Both threads can update their respective free[lane] with the same old, freed | ||
200 | postmap_aba. This has made the layout inconsistent by losing a free entry, and | ||
201 | at the same time, duplicating another free entry for two lanes. | ||
202 | |||
203 | To solve this, we could have a single map lock (per arena) that has to be taken | ||
204 | before performing the above sequence, but we feel that could be too contentious. | ||
205 | Instead we use an array of (nfree) map_locks that is indexed by | ||
206 | (premap_aba modulo nfree). | ||
207 | |||
208 | |||
209 | f. Reconstruction from the Flog | ||
210 | ------------------------------- | ||
211 | |||
212 | On startup, we analyze the BTT flog to create our list of free blocks. We walk | ||
213 | through all the entries, and for each lane, of the set of two possible | ||
214 | 'sections', we always look at the most recent one only (based on the sequence | ||
215 | number). The reconstruction rules/steps are simple: | ||
216 | |||
217 | - Read map[log_entry.lba]. | ||
218 | - If log_entry.new matches the map entry, then log_entry.old is free. | ||
219 | - If log_entry.new does not match the map entry, then log_entry.new is free. | ||
220 | (This case can only be caused by power-fails/unsafe shutdowns) | ||
221 | |||
222 | |||
223 | g. Summarizing - Read and Write flows | ||
224 | ------------------------------------- | ||
225 | |||
226 | Read: | ||
227 | |||
228 | 1. Convert external LBA to arena number + pre-map ABA | ||
229 | 2. Get a lane (and take lane_lock) | ||
230 | 3. Read map to get the entry for this pre-map ABA | ||
231 | 4. Enter post-map ABA into RTT[lane] | ||
232 | 5. If TRIM flag set in map, return zeroes, and end IO (go to step 8) | ||
233 | 6. If ERROR flag set in map, end IO with EIO (go to step 8) | ||
234 | 7. Read data from this block | ||
235 | 8. Remove post-map ABA entry from RTT[lane] | ||
236 | 9. Release lane (and lane_lock) | ||
237 | |||
238 | Write: | ||
239 | |||
240 | 1. Convert external LBA to Arena number + pre-map ABA | ||
241 | 2. Get a lane (and take lane_lock) | ||
242 | 3. Use lane to index into in-memory free list and obtain a new block, next flog | ||
243 | index, next sequence number | ||
244 | 4. Scan the RTT to check if free block is present, and spin/wait if it is. | ||
245 | 5. Write data to this free block | ||
246 | 6. Read map to get the existing post-map ABA entry for this pre-map ABA | ||
247 | 7. Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq_num] | ||
248 | 8. Write new post-map ABA into map. | ||
249 | 9. Write old post-map entry into the free list | ||
250 | 10. Calculate next sequence number and write into the free list entry | ||
251 | 11. Release lane (and lane_lock) | ||
252 | |||
253 | |||
254 | 4. Error Handling | ||
255 | ================= | ||
256 | |||
257 | An arena would be in an error state if any of the metadata is corrupted | ||
258 | irrecoverably, either due to a bug or a media error. The following conditions | ||
259 | indicate an error: | ||
260 | |||
261 | - Info block checksum does not match (and recovering from the copy also fails) | ||
262 | - All internal available blocks are not uniquely and entirely addressed by the | ||
263 | sum of mapped blocks and free blocks (from the BTT flog). | ||
264 | - Rebuilding free list from the flog reveals missing/duplicate/impossible | ||
265 | entries | ||
266 | - A map entry is out of bounds | ||
267 | |||
268 | If any of these error conditions are encountered, the arena is put into a read | ||
269 | only state using a flag in the info block. | ||
270 | |||
271 | |||
272 | 5. Usage | ||
273 | ======== | ||
274 | |||
275 | The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem | ||
276 | (pmem, or blk mode). The easiest way to set up such a namespace is using the | ||
277 | 'ndctl' utility [1]: | ||
278 | |||
279 | For example, the ndctl command line to setup a btt with a 4k sector size is:: | ||
280 | |||
281 | ndctl create-namespace -f -e namespace0.0 -m sector -l 4k | ||
282 | |||
283 | See ndctl create-namespace --help for more options. | ||
284 | |||
285 | [1]: https://github.com/pmem/ndctl | ||
diff --git a/Documentation/driver-api/nvdimm/index.rst b/Documentation/driver-api/nvdimm/index.rst new file mode 100644 index 000000000000..19dc8ee371dc --- /dev/null +++ b/Documentation/driver-api/nvdimm/index.rst | |||
@@ -0,0 +1,10 @@ | |||
1 | =================================== | ||
2 | Non-Volatile Memory Device (NVDIMM) | ||
3 | =================================== | ||
4 | |||
5 | .. toctree:: | ||
6 | :maxdepth: 1 | ||
7 | |||
8 | nvdimm | ||
9 | btt | ||
10 | security | ||
diff --git a/Documentation/driver-api/nvdimm/nvdimm.rst b/Documentation/driver-api/nvdimm/nvdimm.rst new file mode 100644 index 000000000000..08f855cbb4e6 --- /dev/null +++ b/Documentation/driver-api/nvdimm/nvdimm.rst | |||
@@ -0,0 +1,887 @@ | |||
1 | =============================== | ||
2 | LIBNVDIMM: Non-Volatile Devices | ||
3 | =============================== | ||
4 | |||
5 | libnvdimm - kernel / libndctl - userspace helper library | ||
6 | |||
7 | linux-nvdimm@lists.01.org | ||
8 | |||
9 | Version 13 | ||
10 | |||
11 | .. contents: | ||
12 | |||
13 | Glossary | ||
14 | Overview | ||
15 | Supporting Documents | ||
16 | Git Trees | ||
17 | LIBNVDIMM PMEM and BLK | ||
18 | Why BLK? | ||
19 | PMEM vs BLK | ||
20 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX | ||
21 | Example NVDIMM Platform | ||
22 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API | ||
23 | LIBNDCTL: Context | ||
24 | libndctl: instantiate a new library context example | ||
25 | LIBNVDIMM/LIBNDCTL: Bus | ||
26 | libnvdimm: control class device in /sys/class | ||
27 | libnvdimm: bus | ||
28 | libndctl: bus enumeration example | ||
29 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) | ||
30 | libnvdimm: DIMM (NMEM) | ||
31 | libndctl: DIMM enumeration example | ||
32 | LIBNVDIMM/LIBNDCTL: Region | ||
33 | libnvdimm: region | ||
34 | libndctl: region enumeration example | ||
35 | Why Not Encode the Region Type into the Region Name? | ||
36 | How Do I Determine the Major Type of a Region? | ||
37 | LIBNVDIMM/LIBNDCTL: Namespace | ||
38 | libnvdimm: namespace | ||
39 | libndctl: namespace enumeration example | ||
40 | libndctl: namespace creation example | ||
41 | Why the Term "namespace"? | ||
42 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" | ||
43 | libnvdimm: btt layout | ||
44 | libndctl: btt creation example | ||
45 | Summary LIBNDCTL Diagram | ||
46 | |||
47 | |||
48 | Glossary | ||
49 | ======== | ||
50 | |||
51 | PMEM: | ||
52 | A system-physical-address range where writes are persistent. A | ||
53 | block device composed of PMEM is capable of DAX. A PMEM address range | ||
54 | may span an interleave of several DIMMs. | ||
55 | |||
56 | BLK: | ||
57 | A set of one or more programmable memory mapped apertures provided | ||
58 | by a DIMM to access its media. This indirection precludes the | ||
59 | performance benefit of interleaving, but enables DIMM-bounded failure | ||
60 | modes. | ||
61 | |||
62 | DPA: | ||
63 | DIMM Physical Address, is a DIMM-relative offset. With one DIMM in | ||
64 | the system there would be a 1:1 system-physical-address:DPA association. | ||
65 | Once more DIMMs are added a memory controller interleave must be | ||
66 | decoded to determine the DPA associated with a given | ||
67 | system-physical-address. BLK capacity always has a 1:1 relationship | ||
68 | with a single-DIMM's DPA range. | ||
69 | |||
70 | DAX: | ||
71 | File system extensions to bypass the page cache and block layer to | ||
72 | mmap persistent memory, from a PMEM block device, directly into a | ||
73 | process address space. | ||
74 | |||
75 | DSM: | ||
76 | Device Specific Method: ACPI method to to control specific | ||
77 | device - in this case the firmware. | ||
78 | |||
79 | DCR: | ||
80 | NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. | ||
81 | It defines a vendor-id, device-id, and interface format for a given DIMM. | ||
82 | |||
83 | BTT: | ||
84 | Block Translation Table: Persistent memory is byte addressable. | ||
85 | Existing software may have an expectation that the power-fail-atomicity | ||
86 | of writes is at least one sector, 512 bytes. The BTT is an indirection | ||
87 | table with atomic update semantics to front a PMEM/BLK block device | ||
88 | driver and present arbitrary atomic sector sizes. | ||
89 | |||
90 | LABEL: | ||
91 | Metadata stored on a DIMM device that partitions and identifies | ||
92 | (persistently names) storage between PMEM and BLK. It also partitions | ||
93 | BLK storage to host BTTs with different parameters per BLK-partition. | ||
94 | Note that traditional partition tables, GPT/MBR, are layered on top of a | ||
95 | BLK or PMEM device. | ||
96 | |||
97 | |||
98 | Overview | ||
99 | ======== | ||
100 | |||
101 | The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, | ||
102 | PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM | ||
103 | and BLK mode access. These three modes of operation are described by | ||
104 | the "NVDIMM Firmware Interface Table" (NFIT) in ACPI 6. While the LIBNVDIMM | ||
105 | implementation is generic and supports pre-NFIT platforms, it was guided | ||
106 | by the superset of capabilities need to support this ACPI 6 definition | ||
107 | for NVDIMM resources. The bulk of the kernel implementation is in place | ||
108 | to handle the case where DPA accessible via PMEM is aliased with DPA | ||
109 | accessible via BLK. When that occurs a LABEL is needed to reserve DPA | ||
110 | for exclusive access via one mode a time. | ||
111 | |||
112 | Supporting Documents | ||
113 | -------------------- | ||
114 | |||
115 | ACPI 6: | ||
116 | http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf | ||
117 | NVDIMM Namespace: | ||
118 | http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf | ||
119 | DSM Interface Example: | ||
120 | http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf | ||
121 | Driver Writer's Guide: | ||
122 | http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf | ||
123 | |||
124 | Git Trees | ||
125 | --------- | ||
126 | |||
127 | LIBNVDIMM: | ||
128 | https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git | ||
129 | LIBNDCTL: | ||
130 | https://github.com/pmem/ndctl.git | ||
131 | PMEM: | ||
132 | https://github.com/01org/prd | ||
133 | |||
134 | |||
135 | LIBNVDIMM PMEM and BLK | ||
136 | ====================== | ||
137 | |||
138 | Prior to the arrival of the NFIT, non-volatile memory was described to a | ||
139 | system in various ad-hoc ways. Usually only the bare minimum was | ||
140 | provided, namely, a single system-physical-address range where writes | ||
141 | are expected to be durable after a system power loss. Now, the NFIT | ||
142 | specification standardizes not only the description of PMEM, but also | ||
143 | BLK and platform message-passing entry points for control and | ||
144 | configuration. | ||
145 | |||
146 | For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block | ||
147 | device driver: | ||
148 | |||
149 | 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This | ||
150 | range is contiguous in system memory and may be interleaved (hardware | ||
151 | memory controller striped) across multiple DIMMs. When interleaved the | ||
152 | platform may optionally provide details of which DIMMs are participating | ||
153 | in the interleave. | ||
154 | |||
155 | Note that while LIBNVDIMM describes system-physical-address ranges that may | ||
156 | alias with BLK access as ND_NAMESPACE_PMEM ranges and those without | ||
157 | alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no | ||
158 | distinction. The different device-types are an implementation detail | ||
159 | that userspace can exploit to implement policies like "only interface | ||
160 | with address ranges from certain DIMMs". It is worth noting that when | ||
161 | aliasing is present and a DIMM lacks a label, then no block device can | ||
162 | be created by default as userspace needs to do at least one allocation | ||
163 | of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once | ||
164 | registered, can be immediately attached to nd_pmem. | ||
165 | |||
166 | 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform | ||
167 | defined apertures. A set of apertures will access just one DIMM. | ||
168 | Multiple windows (apertures) allow multiple concurrent accesses, much like | ||
169 | tagged-command-queuing, and would likely be used by different threads or | ||
170 | different CPUs. | ||
171 | |||
172 | The NFIT specification defines a standard format for a BLK-aperture, but | ||
173 | the spec also allows for vendor specific layouts, and non-NFIT BLK | ||
174 | implementations may have other designs for BLK I/O. For this reason | ||
175 | "nd_blk" calls back into platform-specific code to perform the I/O. | ||
176 | |||
177 | One such implementation is defined in the "Driver Writer's Guide" and "DSM | ||
178 | Interface Example". | ||
179 | |||
180 | |||
181 | Why BLK? | ||
182 | ======== | ||
183 | |||
184 | While PMEM provides direct byte-addressable CPU-load/store access to | ||
185 | NVDIMM storage, it does not provide the best system RAS (recovery, | ||
186 | availability, and serviceability) model. An access to a corrupted | ||
187 | system-physical-address address causes a CPU exception while an access | ||
188 | to a corrupted address through an BLK-aperture causes that block window | ||
189 | to raise an error status in a register. The latter is more aligned with | ||
190 | the standard error model that host-bus-adapter attached disks present. | ||
191 | |||
192 | Also, if an administrator ever wants to replace a memory it is easier to | ||
193 | service a system at DIMM module boundaries. Compare this to PMEM where | ||
194 | data could be interleaved in an opaque hardware specific manner across | ||
195 | several DIMMs. | ||
196 | |||
197 | PMEM vs BLK | ||
198 | ----------- | ||
199 | |||
200 | BLK-apertures solve these RAS problems, but their presence is also the | ||
201 | major contributing factor to the complexity of the ND subsystem. They | ||
202 | complicate the implementation because PMEM and BLK alias in DPA space. | ||
203 | Any given DIMM's DPA-range may contribute to one or more | ||
204 | system-physical-address sets of interleaved DIMMs, *and* may also be | ||
205 | accessed in its entirety through its BLK-aperture. Accessing a DPA | ||
206 | through a system-physical-address while simultaneously accessing the | ||
207 | same DPA through a BLK-aperture has undefined results. For this reason, | ||
208 | DIMMs with this dual interface configuration include a DSM function to | ||
209 | store/retrieve a LABEL. The LABEL effectively partitions the DPA-space | ||
210 | into exclusive system-physical-address and BLK-aperture accessible | ||
211 | regions. For simplicity a DIMM is allowed a PMEM "region" per each | ||
212 | interleave set in which it is a member. The remaining DPA space can be | ||
213 | carved into an arbitrary number of BLK devices with discontiguous | ||
214 | extents. | ||
215 | |||
216 | BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX | ||
217 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
218 | |||
219 | One of the few | ||
220 | reasons to allow multiple BLK namespaces per REGION is so that each | ||
221 | BLK-namespace can be configured with a BTT with unique atomic sector | ||
222 | sizes. While a PMEM device can host a BTT the LABEL specification does | ||
223 | not provide for a sector size to be specified for a PMEM namespace. | ||
224 | |||
225 | This is due to the expectation that the primary usage model for PMEM is | ||
226 | via DAX, and the BTT is incompatible with DAX. However, for the cases | ||
227 | where an application or filesystem still needs atomic sector update | ||
228 | guarantees it can register a BTT on a PMEM device or partition. See | ||
229 | LIBNVDIMM/NDCTL: Block Translation Table "btt" | ||
230 | |||
231 | |||
232 | Example NVDIMM Platform | ||
233 | ======================= | ||
234 | |||
235 | For the remainder of this document the following diagram will be | ||
236 | referenced for any example sysfs layouts:: | ||
237 | |||
238 | |||
239 | (a) (b) DIMM BLK-REGION | ||
240 | +-------------------+--------+--------+--------+ | ||
241 | +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 | ||
242 | | imc0 +--+- - - region0- - - +--------+ +--------+ | ||
243 | +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 | ||
244 | | +-------------------+--------v v--------+ | ||
245 | +--+---+ | | | ||
246 | | cpu0 | region1 | ||
247 | +--+---+ | | | ||
248 | | +----------------------------^ ^--------+ | ||
249 | +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 | ||
250 | | imc1 +--+----------------------------| +--------+ | ||
251 | +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 | ||
252 | +----------------------------+--------+--------+ | ||
253 | |||
254 | In this platform we have four DIMMs and two memory controllers in one | ||
255 | socket. Each unique interface (BLK or PMEM) to DPA space is identified | ||
256 | by a region device with a dynamically assigned id (REGION0 - REGION5). | ||
257 | |||
258 | 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A | ||
259 | single PMEM namespace is created in the REGION0-SPA-range that spans most | ||
260 | of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that | ||
261 | interleaved system-physical-address range is reclaimed as BLK-aperture | ||
262 | accessed space starting at DPA-offset (a) into each DIMM. In that | ||
263 | reclaimed space we create two BLK-aperture "namespaces" from REGION2 and | ||
264 | REGION3 where "blk2.0" and "blk3.0" are just human readable names that | ||
265 | could be set to any user-desired name in the LABEL. | ||
266 | |||
267 | 2. In the last portion of DIMM0 and DIMM1 we have an interleaved | ||
268 | system-physical-address range, REGION1, that spans those two DIMMs as | ||
269 | well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace | ||
270 | named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for | ||
271 | each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and | ||
272 | "blk5.0". | ||
273 | |||
274 | 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 | ||
275 | interleaved system-physical-address range (i.e. the DPA address past | ||
276 | offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. | ||
277 | Note, that this example shows that BLK-aperture namespaces don't need to | ||
278 | be contiguous in DPA-space. | ||
279 | |||
280 | This bus is provided by the kernel under the device | ||
281 | /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and | ||
282 | the nfit_test.ko module is loaded. This not only test LIBNVDIMM but the | ||
283 | acpi_nfit.ko driver as well. | ||
284 | |||
285 | |||
286 | LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API | ||
287 | ======================================================== | ||
288 | |||
289 | What follows is a description of the LIBNVDIMM sysfs layout and a | ||
290 | corresponding object hierarchy diagram as viewed through the LIBNDCTL | ||
291 | API. The example sysfs paths and diagrams are relative to the Example | ||
292 | NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit | ||
293 | test. | ||
294 | |||
295 | LIBNDCTL: Context | ||
296 | ----------------- | ||
297 | |||
298 | Every API call in the LIBNDCTL library requires a context that holds the | ||
299 | logging parameters and other library instance state. The library is | ||
300 | based on the libabc template: | ||
301 | |||
302 | https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git | ||
303 | |||
304 | LIBNDCTL: instantiate a new library context example | ||
305 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
306 | |||
307 | :: | ||
308 | |||
309 | struct ndctl_ctx *ctx; | ||
310 | |||
311 | if (ndctl_new(&ctx) == 0) | ||
312 | return ctx; | ||
313 | else | ||
314 | return NULL; | ||
315 | |||
316 | LIBNVDIMM/LIBNDCTL: Bus | ||
317 | ----------------------- | ||
318 | |||
319 | A bus has a 1:1 relationship with an NFIT. The current expectation for | ||
320 | ACPI based systems is that there is only ever one platform-global NFIT. | ||
321 | That said, it is trivial to register multiple NFITs, the specification | ||
322 | does not preclude it. The infrastructure supports multiple busses and | ||
323 | we use this capability to test multiple NFIT configurations in the unit | ||
324 | test. | ||
325 | |||
326 | LIBNVDIMM: control class device in /sys/class | ||
327 | --------------------------------------------- | ||
328 | |||
329 | This character device accepts DSM messages to be passed to DIMM | ||
330 | identified by its NFIT handle:: | ||
331 | |||
332 | /sys/class/nd/ndctl0 | ||
333 | |-- dev | ||
334 | |-- device -> ../../../ndbus0 | ||
335 | |-- subsystem -> ../../../../../../../class/nd | ||
336 | |||
337 | |||
338 | |||
339 | LIBNVDIMM: bus | ||
340 | -------------- | ||
341 | |||
342 | :: | ||
343 | |||
344 | struct nvdimm_bus *nvdimm_bus_register(struct device *parent, | ||
345 | struct nvdimm_bus_descriptor *nfit_desc); | ||
346 | |||
347 | :: | ||
348 | |||
349 | /sys/devices/platform/nfit_test.0/ndbus0 | ||
350 | |-- commands | ||
351 | |-- nd | ||
352 | |-- nfit | ||
353 | |-- nmem0 | ||
354 | |-- nmem1 | ||
355 | |-- nmem2 | ||
356 | |-- nmem3 | ||
357 | |-- power | ||
358 | |-- provider | ||
359 | |-- region0 | ||
360 | |-- region1 | ||
361 | |-- region2 | ||
362 | |-- region3 | ||
363 | |-- region4 | ||
364 | |-- region5 | ||
365 | |-- uevent | ||
366 | `-- wait_probe | ||
367 | |||
368 | LIBNDCTL: bus enumeration example | ||
369 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
370 | |||
371 | Find the bus handle that describes the bus from Example NVDIMM Platform:: | ||
372 | |||
373 | static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, | ||
374 | const char *provider) | ||
375 | { | ||
376 | struct ndctl_bus *bus; | ||
377 | |||
378 | ndctl_bus_foreach(ctx, bus) | ||
379 | if (strcmp(provider, ndctl_bus_get_provider(bus)) == 0) | ||
380 | return bus; | ||
381 | |||
382 | return NULL; | ||
383 | } | ||
384 | |||
385 | bus = get_bus_by_provider(ctx, "nfit_test.0"); | ||
386 | |||
387 | |||
388 | LIBNVDIMM/LIBNDCTL: DIMM (NMEM) | ||
389 | ------------------------------- | ||
390 | |||
391 | The DIMM device provides a character device for sending commands to | ||
392 | hardware, and it is a container for LABELs. If the DIMM is defined by | ||
393 | NFIT then an optional 'nfit' attribute sub-directory is available to add | ||
394 | NFIT-specifics. | ||
395 | |||
396 | Note that the kernel device name for "DIMMs" is "nmemX". The NFIT | ||
397 | describes these devices via "Memory Device to System Physical Address | ||
398 | Range Mapping Structure", and there is no requirement that they actually | ||
399 | be physical DIMMs, so we use a more generic name. | ||
400 | |||
401 | LIBNVDIMM: DIMM (NMEM) | ||
402 | ^^^^^^^^^^^^^^^^^^^^^^ | ||
403 | |||
404 | :: | ||
405 | |||
406 | struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, | ||
407 | const struct attribute_group **groups, unsigned long flags, | ||
408 | unsigned long *dsm_mask); | ||
409 | |||
410 | :: | ||
411 | |||
412 | /sys/devices/platform/nfit_test.0/ndbus0 | ||
413 | |-- nmem0 | ||
414 | | |-- available_slots | ||
415 | | |-- commands | ||
416 | | |-- dev | ||
417 | | |-- devtype | ||
418 | | |-- driver -> ../../../../../bus/nd/drivers/nvdimm | ||
419 | | |-- modalias | ||
420 | | |-- nfit | ||
421 | | | |-- device | ||
422 | | | |-- format | ||
423 | | | |-- handle | ||
424 | | | |-- phys_id | ||
425 | | | |-- rev_id | ||
426 | | | |-- serial | ||
427 | | | `-- vendor | ||
428 | | |-- state | ||
429 | | |-- subsystem -> ../../../../../bus/nd | ||
430 | | `-- uevent | ||
431 | |-- nmem1 | ||
432 | [..] | ||
433 | |||
434 | |||
435 | LIBNDCTL: DIMM enumeration example | ||
436 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
437 | |||
438 | Note, in this example we are assuming NFIT-defined DIMMs which are | ||
439 | identified by an "nfit_handle" a 32-bit value where: | ||
440 | |||
441 | - Bit 3:0 DIMM number within the memory channel | ||
442 | - Bit 7:4 memory channel number | ||
443 | - Bit 11:8 memory controller ID | ||
444 | - Bit 15:12 socket ID (within scope of a Node controller if node | ||
445 | controller is present) | ||
446 | - Bit 27:16 Node Controller ID | ||
447 | - Bit 31:28 Reserved | ||
448 | |||
449 | :: | ||
450 | |||
451 | static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, | ||
452 | unsigned int handle) | ||
453 | { | ||
454 | struct ndctl_dimm *dimm; | ||
455 | |||
456 | ndctl_dimm_foreach(bus, dimm) | ||
457 | if (ndctl_dimm_get_handle(dimm) == handle) | ||
458 | return dimm; | ||
459 | |||
460 | return NULL; | ||
461 | } | ||
462 | |||
463 | #define DIMM_HANDLE(n, s, i, c, d) \ | ||
464 | (((n & 0xfff) << 16) | ((s & 0xf) << 12) | ((i & 0xf) << 8) \ | ||
465 | | ((c & 0xf) << 4) | (d & 0xf)) | ||
466 | |||
467 | dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); | ||
468 | |||
469 | LIBNVDIMM/LIBNDCTL: Region | ||
470 | -------------------------- | ||
471 | |||
472 | A generic REGION device is registered for each PMEM range or BLK-aperture | ||
473 | set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture | ||
474 | sets on the "nfit_test.0" bus. The primary role of regions are to be a | ||
475 | container of "mappings". A mapping is a tuple of <DIMM, | ||
476 | DPA-start-offset, length>. | ||
477 | |||
478 | LIBNVDIMM provides a built-in driver for these REGION devices. This driver | ||
479 | is responsible for reconciling the aliased DPA mappings across all | ||
480 | regions, parsing the LABEL, if present, and then emitting NAMESPACE | ||
481 | devices with the resolved/exclusive DPA-boundaries for the nd_pmem or | ||
482 | nd_blk device driver to consume. | ||
483 | |||
484 | In addition to the generic attributes of "mapping"s, "interleave_ways" | ||
485 | and "size" the REGION device also exports some convenience attributes. | ||
486 | "nstype" indicates the integer type of namespace-device this region | ||
487 | emits, "devtype" duplicates the DEVTYPE variable stored by udev at the | ||
488 | 'add' event, "modalias" duplicates the MODALIAS variable stored by udev | ||
489 | at the 'add' event, and finally, the optional "spa_index" is provided in | ||
490 | the case where the region is defined by a SPA. | ||
491 | |||
492 | LIBNVDIMM: region:: | ||
493 | |||
494 | struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, | ||
495 | struct nd_region_desc *ndr_desc); | ||
496 | struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, | ||
497 | struct nd_region_desc *ndr_desc); | ||
498 | |||
499 | :: | ||
500 | |||
501 | /sys/devices/platform/nfit_test.0/ndbus0 | ||
502 | |-- region0 | ||
503 | | |-- available_size | ||
504 | | |-- btt0 | ||
505 | | |-- btt_seed | ||
506 | | |-- devtype | ||
507 | | |-- driver -> ../../../../../bus/nd/drivers/nd_region | ||
508 | | |-- init_namespaces | ||
509 | | |-- mapping0 | ||
510 | | |-- mapping1 | ||
511 | | |-- mappings | ||
512 | | |-- modalias | ||
513 | | |-- namespace0.0 | ||
514 | | |-- namespace_seed | ||
515 | | |-- numa_node | ||
516 | | |-- nfit | ||
517 | | | `-- spa_index | ||
518 | | |-- nstype | ||
519 | | |-- set_cookie | ||
520 | | |-- size | ||
521 | | |-- subsystem -> ../../../../../bus/nd | ||
522 | | `-- uevent | ||
523 | |-- region1 | ||
524 | [..] | ||
525 | |||
526 | LIBNDCTL: region enumeration example | ||
527 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
528 | |||
529 | Sample region retrieval routines based on NFIT-unique data like | ||
530 | "spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for | ||
531 | BLK:: | ||
532 | |||
533 | static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, | ||
534 | unsigned int spa_index) | ||
535 | { | ||
536 | struct ndctl_region *region; | ||
537 | |||
538 | ndctl_region_foreach(bus, region) { | ||
539 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_PMEM) | ||
540 | continue; | ||
541 | if (ndctl_region_get_spa_index(region) == spa_index) | ||
542 | return region; | ||
543 | } | ||
544 | return NULL; | ||
545 | } | ||
546 | |||
547 | static struct ndctl_region *get_blk_region_by_dimm_handle(struct ndctl_bus *bus, | ||
548 | unsigned int handle) | ||
549 | { | ||
550 | struct ndctl_region *region; | ||
551 | |||
552 | ndctl_region_foreach(bus, region) { | ||
553 | struct ndctl_mapping *map; | ||
554 | |||
555 | if (ndctl_region_get_type(region) != ND_DEVICE_REGION_BLOCK) | ||
556 | continue; | ||
557 | ndctl_mapping_foreach(region, map) { | ||
558 | struct ndctl_dimm *dimm = ndctl_mapping_get_dimm(map); | ||
559 | |||
560 | if (ndctl_dimm_get_handle(dimm) == handle) | ||
561 | return region; | ||
562 | } | ||
563 | } | ||
564 | return NULL; | ||
565 | } | ||
566 | |||
567 | |||
568 | Why Not Encode the Region Type into the Region Name? | ||
569 | ---------------------------------------------------- | ||
570 | |||
571 | At first glance it seems since NFIT defines just PMEM and BLK interface | ||
572 | types that we should simply name REGION devices with something derived | ||
573 | from those type names. However, the ND subsystem explicitly keeps the | ||
574 | REGION name generic and expects userspace to always consider the | ||
575 | region-attributes for four reasons: | ||
576 | |||
577 | 1. There are already more than two REGION and "namespace" types. For | ||
578 | PMEM there are two subtypes. As mentioned previously we have PMEM where | ||
579 | the constituent DIMM devices are known and anonymous PMEM. For BLK | ||
580 | regions the NFIT specification already anticipates vendor specific | ||
581 | implementations. The exact distinction of what a region contains is in | ||
582 | the region-attributes not the region-name or the region-devtype. | ||
583 | |||
584 | 2. A region with zero child-namespaces is a possible configuration. For | ||
585 | example, the NFIT allows for a DCR to be published without a | ||
586 | corresponding BLK-aperture. This equates to a DIMM that can only accept | ||
587 | control/configuration messages, but no i/o through a descendant block | ||
588 | device. Again, this "type" is advertised in the attributes ('mappings' | ||
589 | == 0) and the name does not tell you much. | ||
590 | |||
591 | 3. What if a third major interface type arises in the future? Outside | ||
592 | of vendor specific implementations, it's not difficult to envision a | ||
593 | third class of interface type beyond BLK and PMEM. With a generic name | ||
594 | for the REGION level of the device-hierarchy old userspace | ||
595 | implementations can still make sense of new kernel advertised | ||
596 | region-types. Userspace can always rely on the generic region | ||
597 | attributes like "mappings", "size", etc and the expected child devices | ||
598 | named "namespace". This generic format of the device-model hierarchy | ||
599 | allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and | ||
600 | future-proof. | ||
601 | |||
602 | 4. There are more robust mechanisms for determining the major type of a | ||
603 | region than a device name. See the next section, How Do I Determine the | ||
604 | Major Type of a Region? | ||
605 | |||
606 | How Do I Determine the Major Type of a Region? | ||
607 | ---------------------------------------------- | ||
608 | |||
609 | Outside of the blanket recommendation of "use libndctl", or simply | ||
610 | looking at the kernel header (/usr/include/linux/ndctl.h) to decode the | ||
611 | "nstype" integer attribute, here are some other options. | ||
612 | |||
613 | 1. module alias lookup | ||
614 | ^^^^^^^^^^^^^^^^^^^^^^ | ||
615 | |||
616 | The whole point of region/namespace device type differentiation is to | ||
617 | decide which block-device driver will attach to a given LIBNVDIMM namespace. | ||
618 | One can simply use the modalias to lookup the resulting module. It's | ||
619 | important to note that this method is robust in the presence of a | ||
620 | vendor-specific driver down the road. If a vendor-specific | ||
621 | implementation wants to supplant the standard nd_blk driver it can with | ||
622 | minimal impact to the rest of LIBNVDIMM. | ||
623 | |||
624 | In fact, a vendor may also want to have a vendor-specific region-driver | ||
625 | (outside of nd_region). For example, if a vendor defined its own LABEL | ||
626 | format it would need its own region driver to parse that LABEL and emit | ||
627 | the resulting namespaces. The output from module resolution is more | ||
628 | accurate than a region-name or region-devtype. | ||
629 | |||
630 | 2. udev | ||
631 | ^^^^^^^ | ||
632 | |||
633 | The kernel "devtype" is registered in the udev database:: | ||
634 | |||
635 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 | ||
636 | P: /devices/platform/nfit_test.0/ndbus0/region0 | ||
637 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 | ||
638 | E: DEVTYPE=nd_pmem | ||
639 | E: MODALIAS=nd:t2 | ||
640 | E: SUBSYSTEM=nd | ||
641 | |||
642 | # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 | ||
643 | P: /devices/platform/nfit_test.0/ndbus0/region4 | ||
644 | E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 | ||
645 | E: DEVTYPE=nd_blk | ||
646 | E: MODALIAS=nd:t3 | ||
647 | E: SUBSYSTEM=nd | ||
648 | |||
649 | ...and is available as a region attribute, but keep in mind that the | ||
650 | "devtype" does not indicate sub-type variations and scripts should | ||
651 | really be understanding the other attributes. | ||
652 | |||
653 | 3. type specific attributes | ||
654 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
655 | |||
656 | As it currently stands a BLK-aperture region will never have a | ||
657 | "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A | ||
658 | BLK region with a "mappings" value of 0 is, as mentioned above, a DIMM | ||
659 | that does not allow I/O. A PMEM region with a "mappings" value of zero | ||
660 | is a simple system-physical-address range. | ||
661 | |||
662 | |||
663 | LIBNVDIMM/LIBNDCTL: Namespace | ||
664 | ----------------------------- | ||
665 | |||
666 | A REGION, after resolving DPA aliasing and LABEL specified boundaries, | ||
667 | surfaces one or more "namespace" devices. The arrival of a "namespace" | ||
668 | device currently triggers either the nd_blk or nd_pmem driver to load | ||
669 | and register a disk/block device. | ||
670 | |||
671 | LIBNVDIMM: namespace | ||
672 | ^^^^^^^^^^^^^^^^^^^^ | ||
673 | |||
674 | Here is a sample layout from the three major types of NAMESPACE where | ||
675 | namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' | ||
676 | attribute), namespace2.0 represents a BLK namespace (note it has a | ||
677 | 'sector_size' attribute) that, and namespace6.0 represents an anonymous | ||
678 | PMEM namespace (note that has no 'uuid' attribute due to not support a | ||
679 | LABEL):: | ||
680 | |||
681 | /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 | ||
682 | |-- alt_name | ||
683 | |-- devtype | ||
684 | |-- dpa_extents | ||
685 | |-- force_raw | ||
686 | |-- modalias | ||
687 | |-- numa_node | ||
688 | |-- resource | ||
689 | |-- size | ||
690 | |-- subsystem -> ../../../../../../bus/nd | ||
691 | |-- type | ||
692 | |-- uevent | ||
693 | `-- uuid | ||
694 | /sys/devices/platform/nfit_test.0/ndbus0/region2/namespace2.0 | ||
695 | |-- alt_name | ||
696 | |-- devtype | ||
697 | |-- dpa_extents | ||
698 | |-- force_raw | ||
699 | |-- modalias | ||
700 | |-- numa_node | ||
701 | |-- sector_size | ||
702 | |-- size | ||
703 | |-- subsystem -> ../../../../../../bus/nd | ||
704 | |-- type | ||
705 | |-- uevent | ||
706 | `-- uuid | ||
707 | /sys/devices/platform/nfit_test.1/ndbus1/region6/namespace6.0 | ||
708 | |-- block | ||
709 | | `-- pmem0 | ||
710 | |-- devtype | ||
711 | |-- driver -> ../../../../../../bus/nd/drivers/pmem | ||
712 | |-- force_raw | ||
713 | |-- modalias | ||
714 | |-- numa_node | ||
715 | |-- resource | ||
716 | |-- size | ||
717 | |-- subsystem -> ../../../../../../bus/nd | ||
718 | |-- type | ||
719 | `-- uevent | ||
720 | |||
721 | LIBNDCTL: namespace enumeration example | ||
722 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
723 | Namespaces are indexed relative to their parent region, example below. | ||
724 | These indexes are mostly static from boot to boot, but subsystem makes | ||
725 | no guarantees in this regard. For a static namespace identifier use its | ||
726 | 'uuid' attribute. | ||
727 | |||
728 | :: | ||
729 | |||
730 | static struct ndctl_namespace | ||
731 | *get_namespace_by_id(struct ndctl_region *region, unsigned int id) | ||
732 | { | ||
733 | struct ndctl_namespace *ndns; | ||
734 | |||
735 | ndctl_namespace_foreach(region, ndns) | ||
736 | if (ndctl_namespace_get_id(ndns) == id) | ||
737 | return ndns; | ||
738 | |||
739 | return NULL; | ||
740 | } | ||
741 | |||
742 | LIBNDCTL: namespace creation example | ||
743 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
744 | |||
745 | Idle namespaces are automatically created by the kernel if a given | ||
746 | region has enough available capacity to create a new namespace. | ||
747 | Namespace instantiation involves finding an idle namespace and | ||
748 | configuring it. For the most part the setting of namespace attributes | ||
749 | can occur in any order, the only constraint is that 'uuid' must be set | ||
750 | before 'size'. This enables the kernel to track DPA allocations | ||
751 | internally with a static identifier:: | ||
752 | |||
753 | static int configure_namespace(struct ndctl_region *region, | ||
754 | struct ndctl_namespace *ndns, | ||
755 | struct namespace_parameters *parameters) | ||
756 | { | ||
757 | char devname[50]; | ||
758 | |||
759 | snprintf(devname, sizeof(devname), "namespace%d.%d", | ||
760 | ndctl_region_get_id(region), paramaters->id); | ||
761 | |||
762 | ndctl_namespace_set_alt_name(ndns, devname); | ||
763 | /* 'uuid' must be set prior to setting size! */ | ||
764 | ndctl_namespace_set_uuid(ndns, paramaters->uuid); | ||
765 | ndctl_namespace_set_size(ndns, paramaters->size); | ||
766 | /* unlike pmem namespaces, blk namespaces have a sector size */ | ||
767 | if (parameters->lbasize) | ||
768 | ndctl_namespace_set_sector_size(ndns, parameters->lbasize); | ||
769 | ndctl_namespace_enable(ndns); | ||
770 | } | ||
771 | |||
772 | |||
773 | Why the Term "namespace"? | ||
774 | ^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
775 | |||
776 | 1. Why not "volume" for instance? "volume" ran the risk of confusing | ||
777 | ND (libnvdimm subsystem) to a volume manager like device-mapper. | ||
778 | |||
779 | 2. The term originated to describe the sub-devices that can be created | ||
780 | within a NVME controller (see the nvme specification: | ||
781 | http://www.nvmexpress.org/specifications/), and NFIT namespaces are | ||
782 | meant to parallel the capabilities and configurability of | ||
783 | NVME-namespaces. | ||
784 | |||
785 | |||
786 | LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" | ||
787 | ------------------------------------------------- | ||
788 | |||
789 | A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked | ||
790 | block device driver that fronts either the whole block device or a | ||
791 | partition of a block device emitted by either a PMEM or BLK NAMESPACE. | ||
792 | |||
793 | LIBNVDIMM: btt layout | ||
794 | ^^^^^^^^^^^^^^^^^^^^^ | ||
795 | |||
796 | Every region will start out with at least one BTT device which is the | ||
797 | seed device. To activate it set the "namespace", "uuid", and | ||
798 | "sector_size" attributes and then bind the device to the nd_pmem or | ||
799 | nd_blk driver depending on the region type:: | ||
800 | |||
801 | /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ | ||
802 | |-- namespace | ||
803 | |-- delete | ||
804 | |-- devtype | ||
805 | |-- modalias | ||
806 | |-- numa_node | ||
807 | |-- sector_size | ||
808 | |-- subsystem -> ../../../../../bus/nd | ||
809 | |-- uevent | ||
810 | `-- uuid | ||
811 | |||
812 | LIBNDCTL: btt creation example | ||
813 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
814 | |||
815 | Similar to namespaces an idle BTT device is automatically created per | ||
816 | region. Each time this "seed" btt device is configured and enabled a new | ||
817 | seed is created. Creating a BTT configuration involves two steps of | ||
818 | finding and idle BTT and assigning it to consume a PMEM or BLK namespace:: | ||
819 | |||
820 | static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) | ||
821 | { | ||
822 | struct ndctl_btt *btt; | ||
823 | |||
824 | ndctl_btt_foreach(region, btt) | ||
825 | if (!ndctl_btt_is_enabled(btt) | ||
826 | && !ndctl_btt_is_configured(btt)) | ||
827 | return btt; | ||
828 | |||
829 | return NULL; | ||
830 | } | ||
831 | |||
832 | static int configure_btt(struct ndctl_region *region, | ||
833 | struct btt_parameters *parameters) | ||
834 | { | ||
835 | btt = get_idle_btt(region); | ||
836 | |||
837 | ndctl_btt_set_uuid(btt, parameters->uuid); | ||
838 | ndctl_btt_set_sector_size(btt, parameters->sector_size); | ||
839 | ndctl_btt_set_namespace(btt, parameters->ndns); | ||
840 | /* turn off raw mode device */ | ||
841 | ndctl_namespace_disable(parameters->ndns); | ||
842 | /* turn on btt access */ | ||
843 | ndctl_btt_enable(btt); | ||
844 | } | ||
845 | |||
846 | Once instantiated a new inactive btt seed device will appear underneath | ||
847 | the region. | ||
848 | |||
849 | Once a "namespace" is removed from a BTT that instance of the BTT device | ||
850 | will be deleted or otherwise reset to default values. This deletion is | ||
851 | only at the device model level. In order to destroy a BTT the "info | ||
852 | block" needs to be destroyed. Note, that to destroy a BTT the media | ||
853 | needs to be written in raw mode. By default, the kernel will autodetect | ||
854 | the presence of a BTT and disable raw mode. This autodetect behavior | ||
855 | can be suppressed by enabling raw mode for the namespace via the | ||
856 | ndctl_namespace_set_raw_mode() API. | ||
857 | |||
858 | |||
859 | Summary LIBNDCTL Diagram | ||
860 | ------------------------ | ||
861 | |||
862 | For the given example above, here is the view of the objects as seen by the | ||
863 | LIBNDCTL API:: | ||
864 | |||
865 | +---+ | ||
866 | |CTX| +---------+ +--------------+ +---------------+ | ||
867 | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | | ||
868 | | | +---------+ +--------------+ +---------------+ | ||
869 | +-------+ | | +---------+ +--------------+ +---------------+ | ||
870 | | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | | ||
871 | +-------+ | | | +---------+ +--------------+ +---------------+ | ||
872 | | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ | ||
873 | +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | | ||
874 | | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ | ||
875 | +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | | ||
876 | | DIMM3 <-+ | +--------------+ +----------------------+ | ||
877 | +-------+ | +---------+ +--------------+ +---------------+ | ||
878 | +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | | ||
879 | | +---------+ | +--------------+ +----------------------+ | ||
880 | | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | | ||
881 | | +--------------+ +----------------------+ | ||
882 | | +---------+ +--------------+ +---------------+ | ||
883 | +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | | ||
884 | | +---------+ +--------------+ +---------------+ | ||
885 | | +---------+ +--------------+ +----------------------+ | ||
886 | +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | | ||
887 | +---------+ +--------------+ +---------------+------+ | ||
diff --git a/Documentation/driver-api/nvdimm/security.rst b/Documentation/driver-api/nvdimm/security.rst new file mode 100644 index 000000000000..ad9dea099b34 --- /dev/null +++ b/Documentation/driver-api/nvdimm/security.rst | |||
@@ -0,0 +1,143 @@ | |||
1 | =============== | ||
2 | NVDIMM Security | ||
3 | =============== | ||
4 | |||
5 | 1. Introduction | ||
6 | --------------- | ||
7 | |||
8 | With the introduction of Intel Device Specific Methods (DSM) v1.8 | ||
9 | specification [1], security DSMs are introduced. The spec added the following | ||
10 | security DSMs: "get security state", "set passphrase", "disable passphrase", | ||
11 | "unlock unit", "freeze lock", "secure erase", and "overwrite". A security_ops | ||
12 | data structure has been added to struct dimm in order to support the security | ||
13 | operations and generic APIs are exposed to allow vendor neutral operations. | ||
14 | |||
15 | 2. Sysfs Interface | ||
16 | ------------------ | ||
17 | The "security" sysfs attribute is provided in the nvdimm sysfs directory. For | ||
18 | example: | ||
19 | /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/security | ||
20 | |||
21 | The "show" attribute of that attribute will display the security state for | ||
22 | that DIMM. The following states are available: disabled, unlocked, locked, | ||
23 | frozen, and overwrite. If security is not supported, the sysfs attribute | ||
24 | will not be visible. | ||
25 | |||
26 | The "store" attribute takes several commands when it is being written to | ||
27 | in order to support some of the security functionalities: | ||
28 | update <old_keyid> <new_keyid> - enable or update passphrase. | ||
29 | disable <keyid> - disable enabled security and remove key. | ||
30 | freeze - freeze changing of security states. | ||
31 | erase <keyid> - delete existing user encryption key. | ||
32 | overwrite <keyid> - wipe the entire nvdimm. | ||
33 | master_update <keyid> <new_keyid> - enable or update master passphrase. | ||
34 | master_erase <keyid> - delete existing user encryption key. | ||
35 | |||
36 | 3. Key Management | ||
37 | ----------------- | ||
38 | |||
39 | The key is associated to the payload by the DIMM id. For example: | ||
40 | # cat /sys/devices/LNXSYSTM:00/LNXSYBUS:00/ACPI0012:00/ndbus0/nmem0/nfit/id | ||
41 | 8089-a2-1740-00000133 | ||
42 | The DIMM id would be provided along with the key payload (passphrase) to | ||
43 | the kernel. | ||
44 | |||
45 | The security keys are managed on the basis of a single key per DIMM. The | ||
46 | key "passphrase" is expected to be 32bytes long. This is similar to the ATA | ||
47 | security specification [2]. A key is initially acquired via the request_key() | ||
48 | kernel API call during nvdimm unlock. It is up to the user to make sure that | ||
49 | all the keys are in the kernel user keyring for unlock. | ||
50 | |||
51 | A nvdimm encrypted-key of format enc32 has the description format of: | ||
52 | nvdimm:<bus-provider-specific-unique-id> | ||
53 | |||
54 | See file ``Documentation/security/keys/trusted-encrypted.rst`` for creating | ||
55 | encrypted-keys of enc32 format. TPM usage with a master trusted key is | ||
56 | preferred for sealing the encrypted-keys. | ||
57 | |||
58 | 4. Unlocking | ||
59 | ------------ | ||
60 | When the DIMMs are being enumerated by the kernel, the kernel will attempt to | ||
61 | retrieve the key from the kernel user keyring. This is the only time | ||
62 | a locked DIMM can be unlocked. Once unlocked, the DIMM will remain unlocked | ||
63 | until reboot. Typically an entity (i.e. shell script) will inject all the | ||
64 | relevant encrypted-keys into the kernel user keyring during the initramfs phase. | ||
65 | This provides the unlock function access to all the related keys that contain | ||
66 | the passphrase for the respective nvdimms. It is also recommended that the | ||
67 | keys are injected before libnvdimm is loaded by modprobe. | ||
68 | |||
69 | 5. Update | ||
70 | --------- | ||
71 | When doing an update, it is expected that the existing key is removed from | ||
72 | the kernel user keyring and reinjected as different (old) key. It's irrelevant | ||
73 | what the key description is for the old key since we are only interested in the | ||
74 | keyid when doing the update operation. It is also expected that the new key | ||
75 | is injected with the description format described from earlier in this | ||
76 | document. The update command written to the sysfs attribute will be with | ||
77 | the format: | ||
78 | update <old keyid> <new keyid> | ||
79 | |||
80 | If there is no old keyid due to a security enabling, then a 0 should be | ||
81 | passed in. | ||
82 | |||
83 | 6. Freeze | ||
84 | --------- | ||
85 | The freeze operation does not require any keys. The security config can be | ||
86 | frozen by a user with root privelege. | ||
87 | |||
88 | 7. Disable | ||
89 | ---------- | ||
90 | The security disable command format is: | ||
91 | disable <keyid> | ||
92 | |||
93 | An key with the current passphrase payload that is tied to the nvdimm should be | ||
94 | in the kernel user keyring. | ||
95 | |||
96 | 8. Secure Erase | ||
97 | --------------- | ||
98 | The command format for doing a secure erase is: | ||
99 | erase <keyid> | ||
100 | |||
101 | An key with the current passphrase payload that is tied to the nvdimm should be | ||
102 | in the kernel user keyring. | ||
103 | |||
104 | 9. Overwrite | ||
105 | ------------ | ||
106 | The command format for doing an overwrite is: | ||
107 | overwrite <keyid> | ||
108 | |||
109 | Overwrite can be done without a key if security is not enabled. A key serial | ||
110 | of 0 can be passed in to indicate no key. | ||
111 | |||
112 | The sysfs attribute "security" can be polled to wait on overwrite completion. | ||
113 | Overwrite can last tens of minutes or more depending on nvdimm size. | ||
114 | |||
115 | An encrypted-key with the current user passphrase that is tied to the nvdimm | ||
116 | should be injected and its keyid should be passed in via sysfs. | ||
117 | |||
118 | 10. Master Update | ||
119 | ----------------- | ||
120 | The command format for doing a master update is: | ||
121 | update <old keyid> <new keyid> | ||
122 | |||
123 | The operating mechanism for master update is identical to update except the | ||
124 | master passphrase key is passed to the kernel. The master passphrase key | ||
125 | is just another encrypted-key. | ||
126 | |||
127 | This command is only available when security is disabled. | ||
128 | |||
129 | 11. Master Erase | ||
130 | ---------------- | ||
131 | The command format for doing a master erase is: | ||
132 | master_erase <current keyid> | ||
133 | |||
134 | This command has the same operating mechanism as erase except the master | ||
135 | passphrase key is passed to the kernel. The master passphrase key is just | ||
136 | another encrypted-key. | ||
137 | |||
138 | This command is only available when the master security is enabled, indicated | ||
139 | by the extended security status. | ||
140 | |||
141 | [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf | ||
142 | |||
143 | [2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf | ||