diff options
author | Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> | 2015-11-10 19:10:45 -0500 |
---|---|---|
committer | Dan Williams <dan.j.williams@intel.com> | 2015-11-12 12:55:23 -0500 |
commit | 8de5dff8bae634497f4413bc3067389f2ed267da (patch) | |
tree | 14307943d7a9889c6140f2b79fa1a71f8fa745d4 /Documentation | |
parent | 589e75d15702dc720b363a92f984876704864946 (diff) |
libnvdimm: documentation clarifications
A bunch of changes that I hope will help in understanding it
better for first-time readers.
Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/nvdimm/nvdimm.txt | 49 |
1 files changed, 28 insertions, 21 deletions
diff --git a/Documentation/nvdimm/nvdimm.txt b/Documentation/nvdimm/nvdimm.txt index 197a0b6b0582..e894de69915a 100644 --- a/Documentation/nvdimm/nvdimm.txt +++ b/Documentation/nvdimm/nvdimm.txt | |||
@@ -62,6 +62,12 @@ DAX: File system extensions to bypass the page cache and block layer to | |||
62 | mmap persistent memory, from a PMEM block device, directly into a | 62 | mmap persistent memory, from a PMEM block device, directly into a |
63 | process address space. | 63 | process address space. |
64 | 64 | ||
65 | DSM: Device Specific Method: ACPI method to to control specific | ||
66 | device - in this case the firmware. | ||
67 | |||
68 | DCR: NVDIMM Control Region Structure defined in ACPI 6 Section 5.2.25.5. | ||
69 | It defines a vendor-id, device-id, and interface format for a given DIMM. | ||
70 | |||
65 | BTT: Block Translation Table: Persistent memory is byte addressable. | 71 | BTT: Block Translation Table: Persistent memory is byte addressable. |
66 | Existing software may have an expectation that the power-fail-atomicity | 72 | Existing software may have an expectation that the power-fail-atomicity |
67 | of writes is at least one sector, 512 bytes. The BTT is an indirection | 73 | of writes is at least one sector, 512 bytes. The BTT is an indirection |
@@ -133,16 +139,16 @@ device driver: | |||
133 | registered, can be immediately attached to nd_pmem. | 139 | registered, can be immediately attached to nd_pmem. |
134 | 140 | ||
135 | 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform | 141 | 2. BLK (nd_blk.ko): This driver performs I/O using a set of platform |
136 | defined apertures. A set of apertures will all access just one DIMM. | 142 | defined apertures. A set of apertures will access just one DIMM. |
137 | Multiple windows allow multiple concurrent accesses, much like | 143 | Multiple windows (apertures) allow multiple concurrent accesses, much like |
138 | tagged-command-queuing, and would likely be used by different threads or | 144 | tagged-command-queuing, and would likely be used by different threads or |
139 | different CPUs. | 145 | different CPUs. |
140 | 146 | ||
141 | The NFIT specification defines a standard format for a BLK-aperture, but | 147 | The NFIT specification defines a standard format for a BLK-aperture, but |
142 | the spec also allows for vendor specific layouts, and non-NFIT BLK | 148 | the spec also allows for vendor specific layouts, and non-NFIT BLK |
143 | implementations may other designs for BLK I/O. For this reason "nd_blk" | 149 | implementations may have other designs for BLK I/O. For this reason |
144 | calls back into platform-specific code to perform the I/O. One such | 150 | "nd_blk" calls back into platform-specific code to perform the I/O. |
145 | implementation is defined in the "Driver Writer's Guide" and "DSM | 151 | One such implementation is defined in the "Driver Writer's Guide" and "DSM |
146 | Interface Example". | 152 | Interface Example". |
147 | 153 | ||
148 | 154 | ||
@@ -152,7 +158,7 @@ Why BLK? | |||
152 | While PMEM provides direct byte-addressable CPU-load/store access to | 158 | While PMEM provides direct byte-addressable CPU-load/store access to |
153 | NVDIMM storage, it does not provide the best system RAS (recovery, | 159 | NVDIMM storage, it does not provide the best system RAS (recovery, |
154 | availability, and serviceability) model. An access to a corrupted | 160 | availability, and serviceability) model. An access to a corrupted |
155 | system-physical-address address causes a cpu exception while an access | 161 | system-physical-address address causes a CPU exception while an access |
156 | to a corrupted address through an BLK-aperture causes that block window | 162 | to a corrupted address through an BLK-aperture causes that block window |
157 | to raise an error status in a register. The latter is more aligned with | 163 | to raise an error status in a register. The latter is more aligned with |
158 | the standard error model that host-bus-adapter attached disks present. | 164 | the standard error model that host-bus-adapter attached disks present. |
@@ -162,7 +168,7 @@ data could be interleaved in an opaque hardware specific manner across | |||
162 | several DIMMs. | 168 | several DIMMs. |
163 | 169 | ||
164 | PMEM vs BLK | 170 | PMEM vs BLK |
165 | BLK-apertures solve this RAS problem, but their presence is also the | 171 | BLK-apertures solve these RAS problems, but their presence is also the |
166 | major contributing factor to the complexity of the ND subsystem. They | 172 | major contributing factor to the complexity of the ND subsystem. They |
167 | complicate the implementation because PMEM and BLK alias in DPA space. | 173 | complicate the implementation because PMEM and BLK alias in DPA space. |
168 | Any given DIMM's DPA-range may contribute to one or more | 174 | Any given DIMM's DPA-range may contribute to one or more |
@@ -220,8 +226,8 @@ socket. Each unique interface (BLK or PMEM) to DPA space is identified | |||
220 | by a region device with a dynamically assigned id (REGION0 - REGION5). | 226 | by a region device with a dynamically assigned id (REGION0 - REGION5). |
221 | 227 | ||
222 | 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A | 228 | 1. The first portion of DIMM0 and DIMM1 are interleaved as REGION0. A |
223 | single PMEM namespace is created in the REGION0-SPA-range that spans | 229 | single PMEM namespace is created in the REGION0-SPA-range that spans most |
224 | DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that | 230 | of DIMM0 and DIMM1 with a user-specified name of "pm0.0". Some of that |
225 | interleaved system-physical-address range is reclaimed as BLK-aperture | 231 | interleaved system-physical-address range is reclaimed as BLK-aperture |
226 | accessed space starting at DPA-offset (a) into each DIMM. In that | 232 | accessed space starting at DPA-offset (a) into each DIMM. In that |
227 | reclaimed space we create two BLK-aperture "namespaces" from REGION2 and | 233 | reclaimed space we create two BLK-aperture "namespaces" from REGION2 and |
@@ -230,13 +236,13 @@ by a region device with a dynamically assigned id (REGION0 - REGION5). | |||
230 | 236 | ||
231 | 2. In the last portion of DIMM0 and DIMM1 we have an interleaved | 237 | 2. In the last portion of DIMM0 and DIMM1 we have an interleaved |
232 | system-physical-address range, REGION1, that spans those two DIMMs as | 238 | system-physical-address range, REGION1, that spans those two DIMMs as |
233 | well as DIMM2 and DIMM3. Some of REGION1 allocated to a PMEM namespace | 239 | well as DIMM2 and DIMM3. Some of REGION1 is allocated to a PMEM namespace |
234 | named "pm1.0" the rest is reclaimed in 4 BLK-aperture namespaces (for | 240 | named "pm1.0", the rest is reclaimed in 4 BLK-aperture namespaces (for |
235 | each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and | 241 | each DIMM in the interleave set), "blk2.1", "blk3.1", "blk4.0", and |
236 | "blk5.0". | 242 | "blk5.0". |
237 | 243 | ||
238 | 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 | 244 | 3. The portion of DIMM2 and DIMM3 that do not participate in the REGION1 |
239 | interleaved system-physical-address range (i.e. the DPA address below | 245 | interleaved system-physical-address range (i.e. the DPA address past |
240 | offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. | 246 | offset (b) are also included in the "blk4.0" and "blk5.0" namespaces. |
241 | Note, that this example shows that BLK-aperture namespaces don't need to | 247 | Note, that this example shows that BLK-aperture namespaces don't need to |
242 | be contiguous in DPA-space. | 248 | be contiguous in DPA-space. |
@@ -252,15 +258,15 @@ LIBNVDIMM Kernel Device Model and LIBNDCTL Userspace API | |||
252 | 258 | ||
253 | What follows is a description of the LIBNVDIMM sysfs layout and a | 259 | What follows is a description of the LIBNVDIMM sysfs layout and a |
254 | corresponding object hierarchy diagram as viewed through the LIBNDCTL | 260 | corresponding object hierarchy diagram as viewed through the LIBNDCTL |
255 | api. The example sysfs paths and diagrams are relative to the Example | 261 | API. The example sysfs paths and diagrams are relative to the Example |
256 | NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit | 262 | NVDIMM Platform which is also the LIBNVDIMM bus used in the LIBNDCTL unit |
257 | test. | 263 | test. |
258 | 264 | ||
259 | LIBNDCTL: Context | 265 | LIBNDCTL: Context |
260 | Every api call in the LIBNDCTL library requires a context that holds the | 266 | Every API call in the LIBNDCTL library requires a context that holds the |
261 | logging parameters and other library instance state. The library is | 267 | logging parameters and other library instance state. The library is |
262 | based on the libabc template: | 268 | based on the libabc template: |
263 | https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git/ | 269 | https://git.kernel.org/cgit/linux/kernel/git/kay/libabc.git |
264 | 270 | ||
265 | LIBNDCTL: instantiate a new library context example | 271 | LIBNDCTL: instantiate a new library context example |
266 | 272 | ||
@@ -409,7 +415,7 @@ Bit 31:28 Reserved | |||
409 | LIBNVDIMM/LIBNDCTL: Region | 415 | LIBNVDIMM/LIBNDCTL: Region |
410 | ---------------------- | 416 | ---------------------- |
411 | 417 | ||
412 | A generic REGION device is registered for each PMEM range orBLK-aperture | 418 | A generic REGION device is registered for each PMEM range or BLK-aperture |
413 | set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture | 419 | set. Per the example there are 6 regions: 2 PMEM and 4 BLK-aperture |
414 | sets on the "nfit_test.0" bus. The primary role of regions are to be a | 420 | sets on the "nfit_test.0" bus. The primary role of regions are to be a |
415 | container of "mappings". A mapping is a tuple of <DIMM, | 421 | container of "mappings". A mapping is a tuple of <DIMM, |
@@ -509,7 +515,7 @@ At first glance it seems since NFIT defines just PMEM and BLK interface | |||
509 | types that we should simply name REGION devices with something derived | 515 | types that we should simply name REGION devices with something derived |
510 | from those type names. However, the ND subsystem explicitly keeps the | 516 | from those type names. However, the ND subsystem explicitly keeps the |
511 | REGION name generic and expects userspace to always consider the | 517 | REGION name generic and expects userspace to always consider the |
512 | region-attributes for 4 reasons: | 518 | region-attributes for four reasons: |
513 | 519 | ||
514 | 1. There are already more than two REGION and "namespace" types. For | 520 | 1. There are already more than two REGION and "namespace" types. For |
515 | PMEM there are two subtypes. As mentioned previously we have PMEM where | 521 | PMEM there are two subtypes. As mentioned previously we have PMEM where |
@@ -698,8 +704,8 @@ static int configure_namespace(struct ndctl_region *region, | |||
698 | 704 | ||
699 | Why the Term "namespace"? | 705 | Why the Term "namespace"? |
700 | 706 | ||
701 | 1. Why not "volume" for instance? "volume" ran the risk of confusing ND | 707 | 1. Why not "volume" for instance? "volume" ran the risk of confusing |
702 | as a volume manager like device-mapper. | 708 | ND (libnvdimm subsystem) to a volume manager like device-mapper. |
703 | 709 | ||
704 | 2. The term originated to describe the sub-devices that can be created | 710 | 2. The term originated to describe the sub-devices that can be created |
705 | within a NVME controller (see the nvme specification: | 711 | within a NVME controller (see the nvme specification: |
@@ -774,13 +780,14 @@ block" needs to be destroyed. Note, that to destroy a BTT the media | |||
774 | needs to be written in raw mode. By default, the kernel will autodetect | 780 | needs to be written in raw mode. By default, the kernel will autodetect |
775 | the presence of a BTT and disable raw mode. This autodetect behavior | 781 | the presence of a BTT and disable raw mode. This autodetect behavior |
776 | can be suppressed by enabling raw mode for the namespace via the | 782 | can be suppressed by enabling raw mode for the namespace via the |
777 | ndctl_namespace_set_raw_mode() api. | 783 | ndctl_namespace_set_raw_mode() API. |
778 | 784 | ||
779 | 785 | ||
780 | Summary LIBNDCTL Diagram | 786 | Summary LIBNDCTL Diagram |
781 | ------------------------ | 787 | ------------------------ |
782 | 788 | ||
783 | For the given example above, here is the view of the objects as seen by the LIBNDCTL api: | 789 | For the given example above, here is the view of the objects as seen by the |
790 | LIBNDCTL API: | ||
784 | +---+ | 791 | +---+ |
785 | |CTX| +---------+ +--------------+ +---------------+ | 792 | |CTX| +---------+ +--------------+ +---------------+ |
786 | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | | 793 | +-+-+ +-> REGION0 +---> NAMESPACE0.0 +--> PMEM8 "pm0.0" | |