summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-04-18 13:21:14 -0400
committerMauro Carvalho Chehab <mchehab+samsung@kernel.org>2019-07-15 08:20:25 -0400
commitb0a4aa950c68b5010831ecfc450510c64e4d80ba (patch)
treecd8bcde377e0a9c2cdee158b7cc9b871c9cef8d7
parent6e58e2d81367308ffd891bd0b34d47e9104e7ae4 (diff)
docs: nvdimm: convert to ReST
Rename the nvdimm documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Acked-by: Dan Williams <dan.j.williams@intel.com>
-rw-r--r--Documentation/nvdimm/btt.rst (renamed from Documentation/nvdimm/btt.txt)144
-rw-r--r--Documentation/nvdimm/index.rst12
-rw-r--r--Documentation/nvdimm/nvdimm.rst (renamed from Documentation/nvdimm/nvdimm.txt)526
-rw-r--r--Documentation/nvdimm/security.rst (renamed from Documentation/nvdimm/security.txt)4
-rw-r--r--drivers/nvdimm/Kconfig2
5 files changed, 393 insertions, 295 deletions
diff --git a/Documentation/nvdimm/btt.txt b/Documentation/nvdimm/btt.rst
index e293fb664924..2d8269f834bd 100644
--- a/Documentation/nvdimm/btt.txt
+++ b/Documentation/nvdimm/btt.rst
@@ -1,9 +1,10 @@
1=============================
1BTT - Block Translation Table 2BTT - Block Translation Table
2============================= 3=============================
3 4
4 5
51. Introduction 61. Introduction
6--------------- 7===============
7 8
8Persistent memory based storage is able to perform IO at byte (or more 9Persistent memory based storage is able to perform IO at byte (or more
9accurately, cache line) granularity. However, we often want to expose such 10accurately, cache line) granularity. However, we often want to expose such
@@ -25,7 +26,7 @@ provides atomic sector updates.
25 26
26 27
272. Static Layout 282. Static Layout
28---------------- 29================
29 30
30The underlying storage on which a BTT can be laid out is not limited in any way. 31The underlying storage on which a BTT can be laid out is not limited in any way.
31The BTT, however, splits the available space into chunks of up to 512 GiB, 32The BTT, however, splits the available space into chunks of up to 512 GiB,
@@ -33,43 +34,43 @@ called "Arenas".
33 34
34Each arena follows the same layout for its metadata, and all references in an 35Each arena follows the same layout for its metadata, and all references in an
35arena are internal to it (with the exception of one field that points to the 36arena are internal to it (with the exception of one field that points to the
36next arena). The following depicts the "On-disk" metadata layout: 37next arena). The following depicts the "On-disk" metadata layout::
37 38
38 39
39 Backing Store +-------> Arena 40 Backing Store +-------> Arena
40+---------------+ | +------------------+ 41 +---------------+ | +------------------+
41| | | | Arena info block | 42 | | | | Arena info block |
42| Arena 0 +---+ | 4K | 43 | Arena 0 +---+ | 4K |
43| 512G | +------------------+ 44 | 512G | +------------------+
44| | | | 45 | | | |
45+---------------+ | | 46 +---------------+ | |
46| | | | 47 | | | |
47| Arena 1 | | Data Blocks | 48 | Arena 1 | | Data Blocks |
48| 512G | | | 49 | 512G | | |
49| | | | 50 | | | |
50+---------------+ | | 51 +---------------+ | |
51| . | | | 52 | . | | |
52| . | | | 53 | . | | |
53| . | | | 54 | . | | |
54| | | | 55 | | | |
55| | | | 56 | | | |
56+---------------+ +------------------+ 57 +---------------+ +------------------+
57 | | 58 | |
58 | BTT Map | 59 | BTT Map |
59 | | 60 | |
60 | | 61 | |
61 +------------------+ 62 +------------------+
62 | | 63 | |
63 | BTT Flog | 64 | BTT Flog |
64 | | 65 | |
65 +------------------+ 66 +------------------+
66 | Info block copy | 67 | Info block copy |
67 | 4K | 68 | 4K |
68 +------------------+ 69 +------------------+
69 70
70 71
713. Theory of Operation 723. Theory of Operation
72---------------------- 73======================
73 74
74 75
75a. The BTT Map 76a. The BTT Map
@@ -79,31 +80,37 @@ The map is a simple lookup/indirection table that maps an LBA to an internal
79block. Each map entry is 32 bits. The two most significant bits are special 80block. Each map entry is 32 bits. The two most significant bits are special
80flags, and the remaining form the internal block number. 81flags, and the remaining form the internal block number.
81 82
83======== =============================================================
82Bit Description 84Bit Description
8331 - 30 : Error and Zero flags - Used in the following way: 85======== =============================================================
84 Bit Description 8631 - 30 Error and Zero flags - Used in the following way:
85 31 30
86 -----------------------------------------------------------------------
87 00 Initial state. Reads return zeroes; Premap = Postmap
88 01 Zero state: Reads return zeroes
89 10 Error state: Reads fail; Writes clear 'E' bit
90 11 Normal Block – has valid postmap
91 87
88 == == ====================================================
89 31 30 Description
90 == == ====================================================
91 0 0 Initial state. Reads return zeroes; Premap = Postmap
92 0 1 Zero state: Reads return zeroes
93 1 0 Error state: Reads fail; Writes clear 'E' bit
94 1 1 Normal Block – has valid postmap
95 == == ====================================================
92 96
9329 - 0 : Mappings to internal 'postmap' blocks 9729 - 0 Mappings to internal 'postmap' blocks
98======== =============================================================
94 99
95 100
96Some of the terminology that will be subsequently used: 101Some of the terminology that will be subsequently used:
97 102
98External LBA : LBA as made visible to upper layers. 103============ ================================================================
99ABA : Arena Block Address - Block offset/number within an arena 104External LBA LBA as made visible to upper layers.
100Premap ABA : The block offset into an arena, which was decided upon by range 105ABA Arena Block Address - Block offset/number within an arena
106Premap ABA The block offset into an arena, which was decided upon by range
101 checking the External LBA 107 checking the External LBA
102Postmap ABA : The block number in the "Data Blocks" area obtained after 108Postmap ABA The block number in the "Data Blocks" area obtained after
103 indirection from the map 109 indirection from the map
104nfree : The number of free blocks that are maintained at any given time. 110nfree The number of free blocks that are maintained at any given time.
105 This is the number of concurrent writes that can happen to the 111 This is the number of concurrent writes that can happen to the
106 arena. 112 arena.
113============ ================================================================
107 114
108 115
109For example, after adding a BTT, we surface a disk of 1024G. We get a read for 116For example, after adding a BTT, we surface a disk of 1024G. We get a read for
@@ -121,19 +128,21 @@ i.e. Every write goes to a "free" block. A running list of free blocks is
121maintained in the form of the BTT flog. 'Flog' is a combination of the words 128maintained in the form of the BTT flog. 'Flog' is a combination of the words
122"free list" and "log". The flog contains 'nfree' entries, and an entry contains: 129"free list" and "log". The flog contains 'nfree' entries, and an entry contains:
123 130
124lba : The premap ABA that is being written to 131======== =====================================================================
125old_map : The old postmap ABA - after 'this' write completes, this will be a 132lba The premap ABA that is being written to
133old_map The old postmap ABA - after 'this' write completes, this will be a
126 free block. 134 free block.
127new_map : The new postmap ABA. The map will up updated to reflect this 135new_map The new postmap ABA. The map will up updated to reflect this
128 lba->postmap_aba mapping, but we log it here in case we have to 136 lba->postmap_aba mapping, but we log it here in case we have to
129 recover. 137 recover.
130seq : Sequence number to mark which of the 2 sections of this flog entry is 138seq Sequence number to mark which of the 2 sections of this flog entry is
131 valid/newest. It cycles between 01->10->11->01 (binary) under normal 139 valid/newest. It cycles between 01->10->11->01 (binary) under normal
132 operation, with 00 indicating an uninitialized state. 140 operation, with 00 indicating an uninitialized state.
133lba' : alternate lba entry 141lba' alternate lba entry
134old_map': alternate old postmap entry 142old_map' alternate old postmap entry
135new_map': alternate new postmap entry 143new_map' alternate new postmap entry
136seq' : alternate sequence number. 144seq' alternate sequence number.
145======== =====================================================================
137 146
138Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also 147Each of the above fields is 32-bit, making one entry 32 bytes. Entries are also
139padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are 148padded to 64 bytes to avoid cache line sharing or aliasing. Flog updates are
@@ -147,8 +156,10 @@ c. The concept of lanes
147 156
148While 'nfree' describes the number of concurrent IOs an arena can process 157While 'nfree' describes the number of concurrent IOs an arena can process
149concurrently, 'nlanes' is the number of IOs the BTT device as a whole can 158concurrently, 'nlanes' is the number of IOs the BTT device as a whole can
150process. 159process::
151 nlanes = min(nfree, num_cpus) 160
161 nlanes = min(nfree, num_cpus)
162
152A lane number is obtained at the start of any IO, and is used for indexing into 163A lane number is obtained at the start of any IO, and is used for indexing into
153all the on-disk and in-memory data structures for the duration of the IO. If 164all the on-disk and in-memory data structures for the duration of the IO. If
154there are more CPUs than the max number of available lanes, than lanes are 165there are more CPUs than the max number of available lanes, than lanes are
@@ -180,10 +191,10 @@ e. In-memory data structure: map locks
180-------------------------------------- 191--------------------------------------
181 192
182Consider a case where two writer threads are writing to the same LBA. There can 193Consider a case where two writer threads are writing to the same LBA. There can
183be a race in the following sequence of steps: 194be a race in the following sequence of steps::
184 195
185free[lane] = map[premap_aba] 196 free[lane] = map[premap_aba]
186map[premap_aba] = postmap_aba 197 map[premap_aba] = postmap_aba
187 198
188Both threads can update their respective free[lane] with the same old, freed 199Both threads can update their respective free[lane] with the same old, freed
189postmap_aba. This has made the layout inconsistent by losing a free entry, and 200postmap_aba. This has made the layout inconsistent by losing a free entry, and
@@ -202,6 +213,7 @@ On startup, we analyze the BTT flog to create our list of free blocks. We walk
202through all the entries, and for each lane, of the set of two possible 213through all the entries, and for each lane, of the set of two possible
203'sections', we always look at the most recent one only (based on the sequence 214'sections', we always look at the most recent one only (based on the sequence
204number). The reconstruction rules/steps are simple: 215number). The reconstruction rules/steps are simple:
216
205- Read map[log_entry.lba]. 217- Read map[log_entry.lba].
206- If log_entry.new matches the map entry, then log_entry.old is free. 218- If log_entry.new matches the map entry, then log_entry.old is free.
207- If log_entry.new does not match the map entry, then log_entry.new is free. 219- If log_entry.new does not match the map entry, then log_entry.new is free.
@@ -228,7 +240,7 @@ Write:
2281. Convert external LBA to Arena number + pre-map ABA 2401. Convert external LBA to Arena number + pre-map ABA
2292. Get a lane (and take lane_lock) 2412. Get a lane (and take lane_lock)
2303. Use lane to index into in-memory free list and obtain a new block, next flog 2423. Use lane to index into in-memory free list and obtain a new block, next flog
231 index, next sequence number 243 index, next sequence number
2324. Scan the RTT to check if free block is present, and spin/wait if it is. 2444. Scan the RTT to check if free block is present, and spin/wait if it is.
2335. Write data to this free block 2455. Write data to this free block
2346. Read map to get the existing post-map ABA entry for this pre-map ABA 2466. Read map to get the existing post-map ABA entry for this pre-map ABA
@@ -245,6 +257,7 @@ Write:
245An arena would be in an error state if any of the metadata is corrupted 257An arena would be in an error state if any of the metadata is corrupted
246irrecoverably, either due to a bug or a media error. The following conditions 258irrecoverably, either due to a bug or a media error. The following conditions
247indicate an error: 259indicate an error:
260
248- Info block checksum does not match (and recovering from the copy also fails) 261- Info block checksum does not match (and recovering from the copy also fails)
249- All internal available blocks are not uniquely and entirely addressed by the 262- All internal available blocks are not uniquely and entirely addressed by the
250 sum of mapped blocks and free blocks (from the BTT flog). 263 sum of mapped blocks and free blocks (from the BTT flog).
@@ -263,11 +276,10 @@ The BTT can be set up on any disk (namespace) exposed by the libnvdimm subsystem
263(pmem, or blk mode). The easiest way to set up such a namespace is using the 276(pmem, or blk mode). The easiest way to set up such a namespace is using the
264'ndctl' utility [1]: 277'ndctl' utility [1]:
265 278
266For example, the ndctl command line to setup a btt with a 4k sector size is: 279For example, the ndctl command line to setup a btt with a 4k sector size is::
267 280
268 ndctl create-namespace -f -e namespace0.0 -m sector -l 4k 281 ndctl create-namespace -f -e namespace0.0 -m sector -l 4k
269 282
270See ndctl create-namespace --help for more options. 283See ndctl create-namespace --help for more options.
271 284
272[1]: https://github.com/pmem/ndctl 285[1]: https://github.com/pmem/ndctl
273
diff --git a/Documentation/nvdimm/index.rst b/Documentation/nvdimm/index.rst
new file mode 100644
index 000000000000..1a3402d3775e
--- /dev/null
+++ b/Documentation/nvdimm/index.rst
@@ -0,0 +1,12 @@
1:orphan:
2
3===================================
4Non-Volatile Memory Device (NVDIMM)
5===================================
6
7.. toctree::
8 :maxdepth: 1
9
10 nvdimm
11 btt
12 security
diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.rst
index 1669f626b037..08f855cbb4e6 100644
--- a/Documentation/nvdimm/nvdimm.txt
+++ b/Documentation/nvdimm/nvdimm.rst
@@ -1,8 +1,14 @@
1 LIBNVDIMM: Non-Volatile Devices 1===============================
2 libnvdimm - kernel / libndctl - userspace helper library 2LIBNVDIMM: Non-Volatile Devices
3 linux-nvdimm@lists.01.org 3===============================
4 v13
5 4
5libnvdimm - kernel / libndctl - userspace helper library
6
7linux-nvdimm@lists.01.org
8
9Version 13
10
11.. contents:
6 12
7 Glossary 13 Glossary
8 Overview 14 Overview
@@ -40,49 +46,57 @@
40 46
41 47
42Glossary 48Glossary
43-------- 49========
44 50
45PMEM: A system-physical-address range where writes are persistent. A 51PMEM:
46block device composed of PMEM is capable of DAX. A PMEM address range 52 A system-physical-address range where writes are persistent. A
47may span an interleave of several DIMMs. 53 block device composed of PMEM is capable of DAX. A PMEM address range
48 54 may span an interleave of several DIMMs.
49BLK: A set of one or more programmable memory mapped apertures provided 55
50by a DIMM to access its media. This indirection precludes the 56BLK:
51performance benefit of interleaving, but enables DIMM-bounded failure 57 A set of one or more programmable memory mapped apertures provided
52modes. 58 by a DIMM to access its media. This indirection precludes the
53 59 performance benefit of interleaving, but enables DIMM-bounded failure
54DPA: DIMM Physical Address, is a DIMM-relative offset. With one DIMM in 60 modes.
55the system there would be a 1:1 system-physical-address:DPA association. 61
56Once more DIMMs are added a memory controller interleave must be 62DPA:
57decoded to determine the DPA associated with a given 63 DIMM Physical Address, is a DIMM-relative offset. With one DIMM in
58system-physical-address. BLK capacity always has a 1:1 relationship 64 the system there would be a 1:1 system-physical-address:DPA association.
59with a single-DIMM's DPA range. 65 Once more DIMMs are added a memory controller interleave must be
60 66 decoded to determine the DPA associated with a given
61DAX: File system extensions to bypass the page cache and block layer to 67 system-physical-address. BLK capacity always has a 1:1 relationship
62mmap persistent memory, from a PMEM block device, directly into a 68 with a single-DIMM's DPA range.
63process address space. 69
64 70DAX:
65DSM: Device Specific Method: ACPI method to to control specific 71 File system extensions to bypass the page cache and block layer to
66device - in this case the firmware. 72 mmap persistent memory, from a PMEM block device, directly into a
67 73 process address space.
68DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. 74
69It defines a vendor-id, device-id, and interface format for a given DIMM. 75DSM:
70 76 Device Specific Method: ACPI method to to control specific
71BTT: Block Translation Table: Persistent memory is byte addressable. 77 device - in this case the firmware.
72Existing software may have an expectation that the power-fail-atomicity 78
73of writes is at least one sector, 512 bytes. The BTT is an indirection 79DCR:
74table with atomic update semantics to front a PMEM/BLK block device 80 NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5.
75driver and present arbitrary atomic sector sizes. 81 It defines a vendor-id, device-id, and interface format for a given DIMM.
76 82
77LABEL: Metadata stored on a DIMM device that partitions and identifies 83BTT:
78(persistently names) storage between PMEM and BLK. It also partitions 84 Block Translation Table: Persistent memory is byte addressable.
79BLK storage to host BTTs with different parameters per BLK-partition. 85 Existing software may have an expectation that the power-fail-atomicity
80Note that traditional partition tables, GPT/MBR, are layered on top of a 86 of writes is at least one sector, 512 bytes. The BTT is an indirection
81BLK or PMEM device. 87 table with atomic update semantics to front a PMEM/BLK block device
88 driver and present arbitrary atomic sector sizes.
89
90LABEL:
91 Metadata stored on a DIMM device that partitions and identifies
92 (persistently names) storage between PMEM and BLK. It also partitions
93 BLK storage to host BTTs with different parameters per BLK-partition.
94 Note that traditional partition tables, GPT/MBR, are layered on top of a
95 BLK or PMEM device.
82 96
83 97
84Overview 98Overview
85-------- 99========
86 100
87The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely, 101The LIBNVDIMM subsystem provides support for three types of NVDIMMs, namely,
88PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM 102PMEM, BLK, and NVDIMM devices that can simultaneously support both PMEM
@@ -96,19 +110,30 @@ accessible via BLK. When that occurs a LABEL is needed to reserve DPA
96for exclusive access via one mode a time. 110for exclusive access via one mode a time.
97 111
98Supporting Documents 112Supporting Documents
99ACPI 6: http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf 113--------------------
100NVDIMM Namespace: http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf 114
101DSM Interface Example: http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf 115ACPI 6:
102Driver Writer's Guide: http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf 116 http://www.uefi.org/sites/default/files/resources/ACPI_6.0.pdf
117NVDIMM Namespace:
118 http://pmem.io/documents/NVDIMM_Namespace_Spec.pdf
119DSM Interface Example:
120 http://pmem.io/documents/NVDIMM_DSM_Interface_Example.pdf
121Driver Writer's Guide:
122 http://pmem.io/documents/NVDIMM_Driver_Writers_Guide.pdf
103 123
104Git Trees 124Git Trees
105LIBNVDIMM: https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git 125---------
106LIBNDCTL: https://github.com/pmem/ndctl.git 126
107PMEM: https://github.com/01org/prd 127LIBNVDIMM:
128 https://git.kernel.org/cgit/linux/kernel/git/djbw/nvdimm.git
129LIBNDCTL:
130 https://github.com/pmem/ndctl.git
131PMEM:
132 https://github.com/01org/prd
108 133
109 134
110LIBNVDIMM PMEM and BLK 135LIBNVDIMM PMEM and BLK
111------------------ 136======================
112 137
113Prior to the arrival of the NFIT, non-volatile memory was described to a 138Prior to the arrival of the NFIT, non-volatile memory was described to a
114system in various ad-hoc ways. Usually only the bare minimum was 139system in various ad-hoc ways. Usually only the bare minimum was
@@ -122,38 +147,39 @@ For each NVDIMM access method (PMEM, BLK), LIBNVDIMM provides a block
122device driver: 147device driver:
123 148
124 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This 149 1. PMEM (nd_pmem.ko): Drives a system-physical-address range. This
125 range is contiguous in system memory and may be interleaved (hardware 150 range is contiguous in system memory and may be interleaved (hardware
126 memory controller striped) across multiple DIMMs. When interleaved the 151 memory controller striped) across multiple DIMMs. When interleaved the
127 platform may optionally provide details of which DIMMs are participating 152 platform may optionally provide details of which DIMMs are participating
128 in the interleave. 153 in the interleave.
129 154
130 Note that while LIBNVDIMM describes system-physical-address ranges that may 155 Note that while LIBNVDIMM describes system-physical-address ranges that may
131 alias with BLK access as ND_NAMESPACE_PMEM ranges and those without 156 alias with BLK access as ND_NAMESPACE_PMEM ranges and those without
132 alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no 157 alias as ND_NAMESPACE_IO ranges, to the nd_pmem driver there is no
133 distinction. The different device-types are an implementation detail 158 distinction. The different device-types are an implementation detail
134 that userspace can exploit to implement policies like "only interface 159 that userspace can exploit to implement policies like "only interface
135 with address ranges from certain DIMMs". It is worth noting that when 160 with address ranges from certain DIMMs". It is worth noting that when
136 aliasing is present and a DIMM lacks a label, then no block device can 161 aliasing is present and a DIMM lacks a label, then no block device can
137 be created by default as userspace needs to do at least one allocation 162 be created by default as userspace needs to do at least one allocation
138 of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once 163 of DPA to the PMEM range. In contrast ND_NAMESPACE_IO ranges, once
139 registered, can be immediately attached to nd_pmem. 164 registered, can be immediately attached to nd_pmem.
140 165
141 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform 166 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform
142 defined apertures. A set of apertures will access just one DIMM. 167 defined apertures. A set of apertures will access just one DIMM.
143 Multiple windows (apertures) allow multiple concurrent accesses, much like 168 Multiple windows (apertures) allow multiple concurrent accesses, much like
144 tagged-command-queuing, and would likely be used by different threads or 169 tagged-command-queuing, and would likely be used by different threads or
145 different CPUs. 170 different CPUs.
171
172 The NFIT specification defines a standard format for a BLK-aperture, but
173 the spec also allows for vendor specific layouts, and non-NFIT BLK
174 implementations may have other designs for BLK I/O. For this reason
175 "nd_blk" calls back into platform-specific code to perform the I/O.
146 176
147 The NFIT specification defines a standard format for a BLK-aperture, but 177 One such implementation is defined in the "Driver Writer's Guide" and "DSM
148 the spec also allows for vendor specific layouts, and non-NFIT BLK 178 Interface Example".
149 implementations may have other designs for BLK I/O. For this reason
150 "nd_blk" calls back into platform-specific code to perform the I/O.
151 One such implementation is defined in the "Driver Writer's Guide" and "DSM
152 Interface Example".
153 179
154 180
155Why BLK? 181Why BLK?
156-------- 182========
157 183
158While PMEM provides direct byte-addressable CPU-load/store access to 184While PMEM provides direct byte-addressable CPU-load/store access to
159NVDIMM storage, it does not provide the best system RAS (recovery, 185NVDIMM storage, it does not provide the best system RAS (recovery,
@@ -162,12 +188,15 @@ system-physical-address address causes a CPU exception while an access
162to a corrupted address through an BLK-aperture causes that block window 188to a corrupted address through an BLK-aperture causes that block window
163to raise an error status in a register. The latter is more aligned with 189to raise an error status in a register. The latter is more aligned with
164the standard error model that host-bus-adapter attached disks present. 190the standard error model that host-bus-adapter attached disks present.
191
165Also, if an administrator ever wants to replace a memory it is easier to 192Also, if an administrator ever wants to replace a memory it is easier to
166service a system at DIMM module boundaries. Compare this to PMEM where 193service a system at DIMM module boundaries. Compare this to PMEM where
167data could be interleaved in an opaque hardware specific manner across 194data could be interleaved in an opaque hardware specific manner across
168several DIMMs. 195several DIMMs.
169 196
170PMEM vs BLK 197PMEM vs BLK
198-----------
199
171BLK-apertures solve these RAS problems, but their presence is also the 200BLK-apertures solve these RAS problems, but their presence is also the
172major contributing factor to the complexity of the ND subsystem. They 201major contributing factor to the complexity of the ND subsystem. They
173complicate the implementation because PMEM and BLK alias in DPA space. 202complicate the implementation because PMEM and BLK alias in DPA space.
@@ -185,13 +214,14 @@ carved into an arbitrary number of BLK devices with discontiguous
185extents. 214extents.
186 215
187BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX 216BLK-REGIONs, PMEM-REGIONs, Atomic Sectors, and DAX
188-------------------------------------------------- 217^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
189 218
190One of the few 219One of the few
191reasons to allow multiple BLK namespaces per REGION is so that each 220reasons to allow multiple BLK namespaces per REGION is so that each
192BLK-namespace can be configured with a BTT with unique atomic sector 221BLK-namespace can be configured with a BTT with unique atomic sector
193sizes. While a PMEM device can host a BTT the LABEL specification does 222sizes. While a PMEM device can host a BTT the LABEL specification does
194not provide for a sector size to be specified for a PMEM namespace. 223not provide for a sector size to be specified for a PMEM namespace.
224
195This is due to the expectation that the primary usage model for PMEM is 225This is due to the expectation that the primary usage model for PMEM is
196via DAX, and the BTT is incompatible with DAX. However, for the cases 226via DAX, and the BTT is incompatible with DAX. However, for the cases
197where an application or filesystem still needs atomic sector update 227where an application or filesystem still needs atomic sector update
@@ -200,52 +230,52 @@ LIBNVDIMM/NDCTL: Block Translation Table "btt"
200 230
201 231
202Example NVDIMM Platform 232Example NVDIMM Platform
203----------------------- 233=======================
204 234
205For the remainder of this document the following diagram will be 235For the remainder of this document the following diagram will be
206referenced for any example sysfs layouts. 236referenced for any example sysfs layouts::
207 237
208 238
209 (a) (b) DIMM BLK-REGION 239 (a) (b) DIMM BLK-REGION
210 +-------------------+--------+--------+--------+ 240 +-------------------+--------+--------+--------+
211+------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2 241 +------+ | pm0.0 | blk2.0 | pm1.0 | blk2.1 | 0 region2
212| imc0 +--+- - - region0- - - +--------+ +--------+ 242 | imc0 +--+- - - region0- - - +--------+ +--------+
213+--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3 243 +--+---+ | pm0.0 | blk3.0 | pm1.0 | blk3.1 | 1 region3
214 | +-------------------+--------v v--------+ 244 | +-------------------+--------v v--------+
215+--+---+ | | 245 +--+---+ | |
216| cpu0 | region1 246 | cpu0 | region1
217+--+---+ | | 247 +--+---+ | |
218 | +----------------------------^ ^--------+ 248 | +----------------------------^ ^--------+
219+--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4 249 +--+---+ | blk4.0 | pm1.0 | blk4.0 | 2 region4
220| imc1 +--+----------------------------| +--------+ 250 | imc1 +--+----------------------------| +--------+
221+------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5 251 +------+ | blk5.0 | pm1.0 | blk5.0 | 3 region5
222 +----------------------------+--------+--------+ 252 +----------------------------+--------+--------+
223 253
224In this platform we have four DIMMs and two memory controllers in one 254In this platform we have four DIMMs and two memory controllers in one
225socket. Each unique interface (BLK or PMEM) to DPA space is identified 255socket. Each unique interface (BLK or PMEM) to DPA space is identified
226by a region device with a dynamically assigned id (REGION0 - REGION5). 256by a region device with a dynamically assigned id (REGION0 - REGION5).
227 257
228 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A 258 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A
229 single PMEM namespace is created in the REGION0-SPA-range that spans most 259 single PMEM namespace is created in the REGION0-SPA-range that spans most
230 of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that 260 of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that
231 interleaved system-physical-address range is reclaimed as BLK-aperture 261 interleaved system-physical-address range is reclaimed as BLK-aperture
232 accessed space starting at DPA-offset (a) into each DIMM. In that 262 accessed space starting at DPA-offset (a) into each DIMM. In that
233 reclaimed space we create two BLK-aperture "namespaces" from REGION2 and 263 reclaimed space we create two BLK-aperture "namespaces" from REGION2 and
234 REGION3 where "blk2.0" and "blk3.0" are just human readable names that 264 REGION3 where "blk2.0" and "blk3.0" are just human readable names that
235 could be set to any user-desired name in the LABEL. 265 could be set to any user-desired name in the LABEL.
236 266
237 2. In the last portion of DIMM0 and DIMM1 we have an interleaved 267 2. In the last portion of DIMM0 and DIMM1 we have an interleaved
238 system-physical-address range, REGION1, that spans those two DIMMs as 268 system-physical-address range, REGION1, that spans those two DIMMs as
239 well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace 269 well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace
240 named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for 270 named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for
241 each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and 271 each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and
242 "blk5.0". 272 "blk5.0".
243 273
244 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 274 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1
245 interleaved system-physical-address range (i.e. the DPA address past 275 interleaved system-physical-address range (i.e. the DPA address past
246 offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. 276 offset (b) are also included in the "blk4.0" and "blk5.0" namespaces.
247 Note, that this example shows that BLK-aperture namespaces don't need to 277 Note, that this example shows that BLK-aperture namespaces don't need to
248 be contiguous in DPA-space. 278 be contiguous in DPA-space.
249 279
250 This bus is provided by the kernel under the device 280 This bus is provided by the kernel under the device
251 /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and 281 /sys/devices/platform/nfit_test.0 when CONFIG_NFIT_TEST is enabled and
@@ -254,7 +284,7 @@ by a region device with a dynamically assigned id (REGION0 - REGION5).
254 284
255 285
256LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API 286LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API
257---------------------------------------------------- 287========================================================
258 288
259What follows is a description of the LIBNVDIMM sysfs layout and a 289What follows is a description of the LIBNVDIMM sysfs layout and a
260corresponding object hierarchy diagram as viewed through the LIBNDCTL 290corresponding object hierarchy diagram as viewed through the LIBNDCTL
@@ -263,12 +293,18 @@ NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit
263test. 293test.
264 294
265LIBNDCTL: Context 295LIBNDCTL: Context
296-----------------
297
266Every API call in the LIBNDCTL library requires a context that holds the 298Every API call in the LIBNDCTL library requires a context that holds the
267logging parameters and other library instance state. The library is 299logging parameters and other library instance state. The library is
268based on the libabc template: 300based on the libabc template:
269https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git 301
302 https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git
270 303
271LIBNDCTL: instantiate a new library context example 304LIBNDCTL: instantiate a new library context example
305^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
306
307::
272 308
273 struct ndctl_ctx *ctx; 309 struct ndctl_ctx *ctx;
274 310
@@ -278,7 +314,7 @@ LIBNDCTL: instantiate a new library context example
278 return NULL; 314 return NULL;
279 315
280LIBNVDIMM/LIBNDCTL: Bus 316LIBNVDIMM/LIBNDCTL: Bus
281------------------- 317-----------------------
282 318
283A bus has a 1:1 relationship with an NFIT. The current expectation for 319A bus has a 1:1 relationship with an NFIT. The current expectation for
284ACPI based systems is that there is only ever one platform-global NFIT. 320ACPI based systems is that there is only ever one platform-global NFIT.
@@ -288,9 +324,10 @@ we use this capability to test multiple NFIT configurations in the unit
288test. 324test.
289 325
290LIBNVDIMM: control class device in /sys/class 326LIBNVDIMM: control class device in /sys/class
327---------------------------------------------
291 328
292This character device accepts DSM messages to be passed to DIMM 329This character device accepts DSM messages to be passed to DIMM
293identified by its NFIT handle. 330identified by its NFIT handle::
294 331
295 /sys/class/nd/ndctl0 332 /sys/class/nd/ndctl0
296 |-- dev 333 |-- dev
@@ -300,10 +337,15 @@ identified by its NFIT handle.
300 337
301 338
302LIBNVDIMM: bus 339LIBNVDIMM: bus
340--------------
341
342::
303 343
304 struct nvdimm_bus *nvdimm_bus_register(struct device *parent, 344 struct nvdimm_bus *nvdimm_bus_register(struct device *parent,
305 struct nvdimm_bus_descriptor *nfit_desc); 345 struct nvdimm_bus_descriptor *nfit_desc);
306 346
347::
348
307 /sys/devices/platform/nfit_test.0/ndbus0 349 /sys/devices/platform/nfit_test.0/ndbus0
308 |-- commands 350 |-- commands
309 |-- nd 351 |-- nd
@@ -324,7 +366,9 @@ LIBNVDIMM: bus
324 `-- wait_probe 366 `-- wait_probe
325 367
326LIBNDCTL: bus enumeration example 368LIBNDCTL: bus enumeration example
327Find the bus handle that describes the bus from Example NVDIMM Platform 369^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
370
371Find the bus handle that describes the bus from Example NVDIMM Platform::
328 372
329 static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx, 373 static struct ndctl_bus *get_bus_by_provider(struct ndctl_ctx *ctx,
330 const char *provider) 374 const char *provider)
@@ -342,7 +386,7 @@ Find the bus handle that describes the bus from Example NVDIMM Platform
342 386
343 387
344LIBNVDIMM/LIBNDCTL: DIMM (NMEM) 388LIBNVDIMM/LIBNDCTL: DIMM (NMEM)
345--------------------------- 389-------------------------------
346 390
347The DIMM device provides a character device for sending commands to 391The DIMM device provides a character device for sending commands to
348hardware, and it is a container for LABELs. If the DIMM is defined by 392hardware, and it is a container for LABELs. If the DIMM is defined by
@@ -355,11 +399,16 @@ Range Mapping Structure", and there is no requirement that they actually
355be physical DIMMs, so we use a more generic name. 399be physical DIMMs, so we use a more generic name.
356 400
357LIBNVDIMM: DIMM (NMEM) 401LIBNVDIMM: DIMM (NMEM)
402^^^^^^^^^^^^^^^^^^^^^^
403
404::
358 405
359 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data, 406 struct nvdimm *nvdimm_create(struct nvdimm_bus *nvdimm_bus, void *provider_data,
360 const struct attribute_group **groups, unsigned long flags, 407 const struct attribute_group **groups, unsigned long flags,
361 unsigned long *dsm_mask); 408 unsigned long *dsm_mask);
362 409
410::
411
363 /sys/devices/platform/nfit_test.0/ndbus0 412 /sys/devices/platform/nfit_test.0/ndbus0
364 |-- nmem0 413 |-- nmem0
365 | |-- available_slots 414 | |-- available_slots
@@ -384,15 +433,20 @@ LIBNVDIMM: DIMM (NMEM)
384 433
385 434
386LIBNDCTL: DIMM enumeration example 435LIBNDCTL: DIMM enumeration example
436^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
387 437
388Note, in this example we are assuming NFIT-defined DIMMs which are 438Note, in this example we are assuming NFIT-defined DIMMs which are
389identified by an "nfit_handle" a 32-bit value where: 439identified by an "nfit_handle" a 32-bit value where:
390Bit 3:0 DIMM number within the memory channel 440
391Bit 7:4 memory channel number 441 - Bit 3:0 DIMM number within the memory channel
392Bit 11:8 memory controller ID 442 - Bit 7:4 memory channel number
393Bit 15:12 socket ID (within scope of a Node controller if node controller is present) 443 - Bit 11:8 memory controller ID
394Bit 27:16 Node Controller ID 444 - Bit 15:12 socket ID (within scope of a Node controller if node
395Bit 31:28 Reserved 445 controller is present)
446 - Bit 27:16 Node Controller ID
447 - Bit 31:28 Reserved
448
449::
396 450
397 static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus, 451 static struct ndctl_dimm *get_dimm_by_handle(struct ndctl_bus *bus,
398 unsigned int handle) 452 unsigned int handle)
@@ -413,7 +467,7 @@ Bit 31:28 Reserved
413 dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0)); 467 dimm = get_dimm_by_handle(bus, DIMM_HANDLE(0, 0, 0, 0, 0));
414 468
415LIBNVDIMM/LIBNDCTL: Region 469LIBNVDIMM/LIBNDCTL: Region
416---------------------- 470--------------------------
417 471
418A generic REGION device is registered for each PMEM range or BLK-aperture 472A generic REGION device is registered for each PMEM range or BLK-aperture
419set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture 473set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture
@@ -435,13 +489,15 @@ emits, "devtype" duplicates the DEVTYPE variable stored by udev at the
435at the 'add' event, and finally, the optional "spa_index" is provided in 489at the 'add' event, and finally, the optional "spa_index" is provided in
436the case where the region is defined by a SPA. 490the case where the region is defined by a SPA.
437 491
438LIBNVDIMM: region 492LIBNVDIMM: region::
439 493
440 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus, 494 struct nd_region *nvdimm_pmem_region_create(struct nvdimm_bus *nvdimm_bus,
441 struct nd_region_desc *ndr_desc); 495 struct nd_region_desc *ndr_desc);
442 struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus, 496 struct nd_region *nvdimm_blk_region_create(struct nvdimm_bus *nvdimm_bus,
443 struct nd_region_desc *ndr_desc); 497 struct nd_region_desc *ndr_desc);
444 498
499::
500
445 /sys/devices/platform/nfit_test.0/ndbus0 501 /sys/devices/platform/nfit_test.0/ndbus0
446 |-- region0 502 |-- region0
447 | |-- available_size 503 | |-- available_size
@@ -468,10 +524,11 @@ LIBNVDIMM: region
468 [..] 524 [..]
469 525
470LIBNDCTL: region enumeration example 526LIBNDCTL: region enumeration example
527^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
471 528
472Sample region retrieval routines based on NFIT-unique data like 529Sample region retrieval routines based on NFIT-unique data like
473"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for 530"spa_index" (interleave set id) for PMEM and "nfit_handle" (dimm id) for
474BLK. 531BLK::
475 532
476 static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus, 533 static struct ndctl_region *get_pmem_region_by_spa_index(struct ndctl_bus *bus,
477 unsigned int spa_index) 534 unsigned int spa_index)
@@ -518,33 +575,33 @@ REGION name generic and expects userspace to always consider the
518region-attributes for four reasons: 575region-attributes for four reasons:
519 576
520 1. There are already more than two REGION and "namespace" types. For 577 1. There are already more than two REGION and "namespace" types. For
521 PMEM there are two subtypes. As mentioned previously we have PMEM where 578 PMEM there are two subtypes. As mentioned previously we have PMEM where
522 the constituent DIMM devices are known and anonymous PMEM. For BLK 579 the constituent DIMM devices are known and anonymous PMEM. For BLK
523 regions the NFIT specification already anticipates vendor specific 580 regions the NFIT specification already anticipates vendor specific
524 implementations. The exact distinction of what a region contains is in 581 implementations. The exact distinction of what a region contains is in
525 the region-attributes not the region-name or the region-devtype. 582 the region-attributes not the region-name or the region-devtype.
526 583
527 2. A region with zero child-namespaces is a possible configuration. For 584 2. A region with zero child-namespaces is a possible configuration. For
528 example, the NFIT allows for a DCR to be published without a 585 example, the NFIT allows for a DCR to be published without a
529 corresponding BLK-aperture. This equates to a DIMM that can only accept 586 corresponding BLK-aperture. This equates to a DIMM that can only accept
530 control/configuration messages, but no i/o through a descendant block 587 control/configuration messages, but no i/o through a descendant block
531 device. Again, this "type" is advertised in the attributes ('mappings' 588 device. Again, this "type" is advertised in the attributes ('mappings'
532 == 0) and the name does not tell you much. 589 == 0) and the name does not tell you much.
533 590
534 3. What if a third major interface type arises in the future? Outside 591 3. What if a third major interface type arises in the future? Outside
535 of vendor specific implementations, it's not difficult to envision a 592 of vendor specific implementations, it's not difficult to envision a
536 third class of interface type beyond BLK and PMEM. With a generic name 593 third class of interface type beyond BLK and PMEM. With a generic name
537 for the REGION level of the device-hierarchy old userspace 594 for the REGION level of the device-hierarchy old userspace
538 implementations can still make sense of new kernel advertised 595 implementations can still make sense of new kernel advertised
539 region-types. Userspace can always rely on the generic region 596 region-types. Userspace can always rely on the generic region
540 attributes like "mappings", "size", etc and the expected child devices 597 attributes like "mappings", "size", etc and the expected child devices
541 named "namespace". This generic format of the device-model hierarchy 598 named "namespace". This generic format of the device-model hierarchy
542 allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and 599 allows the LIBNVDIMM and LIBNDCTL implementations to be more uniform and
543 future-proof. 600 future-proof.
544 601
545 4. There are more robust mechanisms for determining the major type of a 602 4. There are more robust mechanisms for determining the major type of a
546 region than a device name. See the next section, How Do I Determine the 603 region than a device name. See the next section, How Do I Determine the
547 Major Type of a Region? 604 Major Type of a Region?
548 605
549How Do I Determine the Major Type of a Region? 606How Do I Determine the Major Type of a Region?
550---------------------------------------------- 607----------------------------------------------
@@ -553,7 +610,8 @@ Outside of the blanket recommendation of "use libndctl", or simply
553looking at the kernel header (/usr/include/linux/ndctl.h) to decode the 610looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
554"nstype" integer attribute, here are some other options. 611"nstype" integer attribute, here are some other options.
555 612
556 1. module alias lookup: 6131. module alias lookup
614^^^^^^^^^^^^^^^^^^^^^^
557 615
558 The whole point of region/namespace device type differentiation is to 616 The whole point of region/namespace device type differentiation is to
559 decide which block-device driver will attach to a given LIBNVDIMM namespace. 617 decide which block-device driver will attach to a given LIBNVDIMM namespace.
@@ -569,28 +627,31 @@ looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
569 the resulting namespaces. The output from module resolution is more 627 the resulting namespaces. The output from module resolution is more
570 accurate than a region-name or region-devtype. 628 accurate than a region-name or region-devtype.
571 629
572 2. udev: 6302. udev
631^^^^^^^
632
633 The kernel "devtype" is registered in the udev database::
573 634
574 The kernel "devtype" is registered in the udev database 635 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0
575 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region0 636 P: /devices/platform/nfit_test.0/ndbus0/region0
576 P: /devices/platform/nfit_test.0/ndbus0/region0 637 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0
577 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region0 638 E: DEVTYPE=nd_pmem
578 E: DEVTYPE=nd_pmem 639 E: MODALIAS=nd:t2
579 E: MODALIAS=nd:t2 640 E: SUBSYSTEM=nd
580 E: SUBSYSTEM=nd
581 641
582 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4 642 # udevadm info --path=/devices/platform/nfit_test.0/ndbus0/region4
583 P: /devices/platform/nfit_test.0/ndbus0/region4 643 P: /devices/platform/nfit_test.0/ndbus0/region4
584 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4 644 E: DEVPATH=/devices/platform/nfit_test.0/ndbus0/region4
585 E: DEVTYPE=nd_blk 645 E: DEVTYPE=nd_blk
586 E: MODALIAS=nd:t3 646 E: MODALIAS=nd:t3
587 E: SUBSYSTEM=nd 647 E: SUBSYSTEM=nd
588 648
589 ...and is available as a region attribute, but keep in mind that the 649 ...and is available as a region attribute, but keep in mind that the
590 "devtype" does not indicate sub-type variations and scripts should 650 "devtype" does not indicate sub-type variations and scripts should
591 really be understanding the other attributes. 651 really be understanding the other attributes.
592 652
593 3. type specific attributes: 6533. type specific attributes
654^^^^^^^^^^^^^^^^^^^^^^^^^^^
594 655
595 As it currently stands a BLK-aperture region will never have a 656 As it currently stands a BLK-aperture region will never have a
596 "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A 657 "nfit/spa_index" attribute, but neither will a non-NFIT PMEM region. A
@@ -600,7 +661,7 @@ looking at the kernel header (/usr/include/linux/ndctl.h) to decode the
600 661
601 662
602LIBNVDIMM/LIBNDCTL: Namespace 663LIBNVDIMM/LIBNDCTL: Namespace
603------------------------- 664-----------------------------
604 665
605A REGION, after resolving DPA aliasing and LABEL specified boundaries, 666A REGION, after resolving DPA aliasing and LABEL specified boundaries,
606surfaces one or more "namespace" devices. The arrival of a "namespace" 667surfaces one or more "namespace" devices. The arrival of a "namespace"
@@ -608,12 +669,14 @@ device currently triggers either the nd_blk or nd_pmem driver to load
608and register a disk/block device. 669and register a disk/block device.
609 670
610LIBNVDIMM: namespace 671LIBNVDIMM: namespace
672^^^^^^^^^^^^^^^^^^^^
673
611Here is a sample layout from the three major types of NAMESPACE where 674Here is a sample layout from the three major types of NAMESPACE where
612namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid' 675namespace0.0 represents DIMM-info-backed PMEM (note that it has a 'uuid'
613attribute), namespace2.0 represents a BLK namespace (note it has a 676attribute), namespace2.0 represents a BLK namespace (note it has a
614'sector_size' attribute) that, and namespace6.0 represents an anonymous 677'sector_size' attribute) that, and namespace6.0 represents an anonymous
615PMEM namespace (note that has no 'uuid' attribute due to not support a 678PMEM namespace (note that has no 'uuid' attribute due to not support a
616LABEL). 679LABEL)::
617 680
618 /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0 681 /sys/devices/platform/nfit_test.0/ndbus0/region0/namespace0.0
619 |-- alt_name 682 |-- alt_name
@@ -656,76 +719,84 @@ LABEL).
656 `-- uevent 719 `-- uevent
657 720
658LIBNDCTL: namespace enumeration example 721LIBNDCTL: namespace enumeration example
722^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
659Namespaces are indexed relative to their parent region, example below. 723Namespaces are indexed relative to their parent region, example below.
660These indexes are mostly static from boot to boot, but subsystem makes 724These indexes are mostly static from boot to boot, but subsystem makes
661no guarantees in this regard. For a static namespace identifier use its 725no guarantees in this regard. For a static namespace identifier use its
662'uuid' attribute. 726'uuid' attribute.
663 727
664static struct ndctl_namespace *get_namespace_by_id(struct ndctl_region *region, 728::
665 unsigned int id)
666{
667 struct ndctl_namespace *ndns;
668 729
669 ndctl_namespace_foreach(region, ndns) 730 static struct ndctl_namespace
670 if (ndctl_namespace_get_id(ndns) == id) 731 *get_namespace_by_id(struct ndctl_region *region, unsigned int id)
671 return ndns; 732 {
733 struct ndctl_namespace *ndns;
672 734
673 return NULL; 735 ndctl_namespace_foreach(region, ndns)
674} 736 if (ndctl_namespace_get_id(ndns) == id)
737 return ndns;
738
739 return NULL;
740 }
675 741
676LIBNDCTL: namespace creation example 742LIBNDCTL: namespace creation example
743^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
744
677Idle namespaces are automatically created by the kernel if a given 745Idle namespaces are automatically created by the kernel if a given
678region has enough available capacity to create a new namespace. 746region has enough available capacity to create a new namespace.
679Namespace instantiation involves finding an idle namespace and 747Namespace instantiation involves finding an idle namespace and
680configuring it. For the most part the setting of namespace attributes 748configuring it. For the most part the setting of namespace attributes
681can occur in any order, the only constraint is that 'uuid' must be set 749can occur in any order, the only constraint is that 'uuid' must be set
682before 'size'. This enables the kernel to track DPA allocations 750before 'size'. This enables the kernel to track DPA allocations
683internally with a static identifier. 751internally with a static identifier::
684 752
685static int configure_namespace(struct ndctl_region *region, 753 static int configure_namespace(struct ndctl_region *region,
686 struct ndctl_namespace *ndns, 754 struct ndctl_namespace *ndns,
687 struct namespace_parameters *parameters) 755 struct namespace_parameters *parameters)
688{ 756 {
689 char devname[50]; 757 char devname[50];
690 758
691 snprintf(devname, sizeof(devname), "namespace%d.%d", 759 snprintf(devname, sizeof(devname), "namespace%d.%d",
692 ndctl_region_get_id(region), paramaters->id); 760 ndctl_region_get_id(region), paramaters->id);
693 761
694 ndctl_namespace_set_alt_name(ndns, devname); 762 ndctl_namespace_set_alt_name(ndns, devname);
695 /* 'uuid' must be set prior to setting size! */ 763 /* 'uuid' must be set prior to setting size! */
696 ndctl_namespace_set_uuid(ndns, paramaters->uuid); 764 ndctl_namespace_set_uuid(ndns, paramaters->uuid);
697 ndctl_namespace_set_size(ndns, paramaters->size); 765 ndctl_namespace_set_size(ndns, paramaters->size);
698 /* unlike pmem namespaces, blk namespaces have a sector size */ 766 /* unlike pmem namespaces, blk namespaces have a sector size */
699 if (parameters->lbasize) 767 if (parameters->lbasize)
700 ndctl_namespace_set_sector_size(ndns, parameters->lbasize); 768 ndctl_namespace_set_sector_size(ndns, parameters->lbasize);
701 ndctl_namespace_enable(ndns); 769 ndctl_namespace_enable(ndns);
702} 770 }
703 771
704 772
705Why the Term "namespace"? 773Why the Term "namespace"?
774^^^^^^^^^^^^^^^^^^^^^^^^^
706 775
707 1. Why not "volume" for instance? "volume" ran the risk of confusing 776 1. Why not "volume" for instance? "volume" ran the risk of confusing
708 ND (libnvdimm subsystem) to a volume manager like device-mapper. 777 ND (libnvdimm subsystem) to a volume manager like device-mapper.
709 778
710 2. The term originated to describe the sub-devices that can be created 779 2. The term originated to describe the sub-devices that can be created
711 within a NVME controller (see the nvme specification: 780 within a NVME controller (see the nvme specification:
712 http://www.nvmexpress.org/specifications/), and NFIT namespaces are 781 http://www.nvmexpress.org/specifications/), and NFIT namespaces are
713 meant to parallel the capabilities and configurability of 782 meant to parallel the capabilities and configurability of
714 NVME-namespaces. 783 NVME-namespaces.
715 784
716 785
717LIBNVDIMM/LIBNDCTL: Block Translation Table "btt" 786LIBNVDIMM/LIBNDCTL: Block Translation Table "btt"
718--------------------------------------------- 787-------------------------------------------------
719 788
720A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked 789A BTT (design document: http://pmem.io/2014/09/23/btt.html) is a stacked
721block device driver that fronts either the whole block device or a 790block device driver that fronts either the whole block device or a
722partition of a block device emitted by either a PMEM or BLK NAMESPACE. 791partition of a block device emitted by either a PMEM or BLK NAMESPACE.
723 792
724LIBNVDIMM: btt layout 793LIBNVDIMM: btt layout
794^^^^^^^^^^^^^^^^^^^^^
795
725Every region will start out with at least one BTT device which is the 796Every region will start out with at least one BTT device which is the
726seed device. To activate it set the "namespace", "uuid", and 797seed device. To activate it set the "namespace", "uuid", and
727"sector_size" attributes and then bind the device to the nd_pmem or 798"sector_size" attributes and then bind the device to the nd_pmem or
728nd_blk driver depending on the region type. 799nd_blk driver depending on the region type::
729 800
730 /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/ 801 /sys/devices/platform/nfit_test.1/ndbus0/region0/btt0/
731 |-- namespace 802 |-- namespace
@@ -739,10 +810,12 @@ nd_blk driver depending on the region type.
739 `-- uuid 810 `-- uuid
740 811
741LIBNDCTL: btt creation example 812LIBNDCTL: btt creation example
813^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
814
742Similar to namespaces an idle BTT device is automatically created per 815Similar to namespaces an idle BTT device is automatically created per
743region. Each time this "seed" btt device is configured and enabled a new 816region. Each time this "seed" btt device is configured and enabled a new
744seed is created. Creating a BTT configuration involves two steps of 817seed is created. Creating a BTT configuration involves two steps of
745finding and idle BTT and assigning it to consume a PMEM or BLK namespace. 818finding and idle BTT and assigning it to consume a PMEM or BLK namespace::
746 819
747 static struct ndctl_btt *get_idle_btt(struct ndctl_region *region) 820 static struct ndctl_btt *get_idle_btt(struct ndctl_region *region)
748 { 821 {
@@ -787,29 +860,28 @@ Summary LIBNDCTL Diagram
787------------------------ 860------------------------
788 861
789For the given example above, here is the view of the objects as seen by the 862For the given example above, here is the view of the objects as seen by the
790LIBNDCTL API: 863LIBNDCTL API::
791 +---+ 864
792 |CTX| +---------+ +--------------+ +---------------+ 865 +---+
793 +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | 866 |CTX| +---------+ +--------------+ +---------------+
794 | | +---------+ +--------------+ +---------------+ 867 +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" |
795+-------+ | | +---------+ +--------------+ +---------------+ 868 | | +---------+ +--------------+ +---------------+
796| DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" | 869 +-------+ | | +---------+ +--------------+ +---------------+
797+-------+ | | | +---------+ +--------------+ +---------------+ 870 | DIMM0 <-+ | +-> REGION1 +---> NAMESPACE1.0 +--> PMEM6 "pm1.0" |
798| DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+ 871 +-------+ | | | +---------+ +--------------+ +---------------+
799+-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" | 872 | DIMM1 <-+ +-v--+ | +---------+ +--------------+ +---------------+
800| DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+ 873 +-------+ +-+BUS0+---> REGION2 +-+-> NAMESPACE2.0 +--> ND6 "blk2.0" |
801+-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 | 874 | DIMM2 <-+ +----+ | +---------+ | +--------------+ +----------------------+
802| DIMM3 <-+ | +--------------+ +----------------------+ 875 +-------+ | | +-> NAMESPACE2.1 +--> ND5 "blk2.1" | BTT2 |
803+-------+ | +---------+ +--------------+ +---------------+ 876 | DIMM3 <-+ | +--------------+ +----------------------+
804 +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" | 877 +-------+ | +---------+ +--------------+ +---------------+
805 | +---------+ | +--------------+ +----------------------+ 878 +-> REGION3 +-+-> NAMESPACE3.0 +--> ND4 "blk3.0" |
806 | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 | 879 | +---------+ | +--------------+ +----------------------+
807 | +--------------+ +----------------------+ 880 | +-> NAMESPACE3.1 +--> ND3 "blk3.1" | BTT1 |
808 | +---------+ +--------------+ +---------------+ 881 | +--------------+ +----------------------+
809 +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" | 882 | +---------+ +--------------+ +---------------+
810 | +---------+ +--------------+ +---------------+ 883 +-> REGION4 +---> NAMESPACE4.0 +--> ND2 "blk4.0" |
811 | +---------+ +--------------+ +----------------------+ 884 | +---------+ +--------------+ +---------------+
812 +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 | 885 | +---------+ +--------------+ +----------------------+
813 +---------+ +--------------+ +---------------+------+ 886 +-> REGION5 +---> NAMESPACE5.0 +--> ND1 "blk5.0" | BTT0 |
814 887 +---------+ +--------------+ +---------------+------+
815
diff --git a/Documentation/nvdimm/security.txt b/Documentation/nvdimm/security.rst
index 4c36c05ca98e..ad9dea099b34 100644
--- a/Documentation/nvdimm/security.txt
+++ b/Documentation/nvdimm/security.rst
@@ -1,4 +1,5 @@
1NVDIMM SECURITY 1===============
2NVDIMM Security
2=============== 3===============
3 4
41. Introduction 51. Introduction
@@ -138,4 +139,5 @@ This command is only available when the master security is enabled, indicated
138by the extended security status. 139by the extended security status.
139 140
140[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf 141[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdf
142
141[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf 143[2]: http://www.t13.org/documents/UploadedDocuments/docs2006/e05179r4-ACS-SecurityClarifications.pdf
diff --git a/drivers/nvdimm/Kconfig b/drivers/nvdimm/Kconfig
index 54500798f23a..e89c1c332407 100644
--- a/drivers/nvdimm/Kconfig
+++ b/drivers/nvdimm/Kconfig
@@ -33,7 +33,7 @@ config BLK_DEV_PMEM
33 Documentation/admin-guide/kernel-parameters.rst). This driver converts 33 Documentation/admin-guide/kernel-parameters.rst). This driver converts
34 these persistent memory ranges into block devices that are 34 these persistent memory ranges into block devices that are
35 capable of DAX (direct-access) file system mappings. See 35 capable of DAX (direct-access) file system mappings. See
36 Documentation/nvdimm/nvdimm.txt for more details. 36 Documentation/nvdimm/nvdimm.rst for more details.
37 37
38 Say Y if you want to use an NVDIMM 38 Say Y if you want to use an NVDIMM
39 39