diff options
Diffstat (limited to 'Documentation/admin-guide/device-mapper/dm-zoned.rst')
| -rw-r--r-- | Documentation/admin-guide/device-mapper/dm-zoned.rst | 146 |
1 files changed, 146 insertions, 0 deletions
diff --git a/Documentation/admin-guide/device-mapper/dm-zoned.rst b/Documentation/admin-guide/device-mapper/dm-zoned.rst new file mode 100644 index 000000000000..07f56ebc1730 --- /dev/null +++ b/Documentation/admin-guide/device-mapper/dm-zoned.rst | |||
| @@ -0,0 +1,146 @@ | |||
| 1 | ======== | ||
| 2 | dm-zoned | ||
| 3 | ======== | ||
| 4 | |||
| 5 | The dm-zoned device mapper target exposes a zoned block device (ZBC and | ||
| 6 | ZAC compliant devices) as a regular block device without any write | ||
| 7 | pattern constraints. In effect, it implements a drive-managed zoned | ||
| 8 | block device which hides from the user (a file system or an application | ||
| 9 | doing raw block device accesses) the sequential write constraints of | ||
| 10 | host-managed zoned block devices and can mitigate the potential | ||
| 11 | device-side performance degradation due to excessive random writes on | ||
| 12 | host-aware zoned block devices. | ||
| 13 | |||
| 14 | For a more detailed description of the zoned block device models and | ||
| 15 | their constraints see (for SCSI devices): | ||
| 16 | |||
| 17 | http://www.t10.org/drafts.htm#ZBC_Family | ||
| 18 | |||
| 19 | and (for ATA devices): | ||
| 20 | |||
| 21 | http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf | ||
| 22 | |||
| 23 | The dm-zoned implementation is simple and minimizes system overhead (CPU | ||
| 24 | and memory usage as well as storage capacity loss). For a 10TB | ||
| 25 | host-managed disk with 256 MB zones, dm-zoned memory usage per disk | ||
| 26 | instance is at most 4.5 MB and as little as 5 zones will be used | ||
| 27 | internally for storing metadata and performaing reclaim operations. | ||
| 28 | |||
| 29 | dm-zoned target devices are formatted and checked using the dmzadm | ||
| 30 | utility available at: | ||
| 31 | |||
| 32 | https://github.com/hgst/dm-zoned-tools | ||
| 33 | |||
| 34 | Algorithm | ||
| 35 | ========= | ||
| 36 | |||
| 37 | dm-zoned implements an on-disk buffering scheme to handle non-sequential | ||
| 38 | write accesses to the sequential zones of a zoned block device. | ||
| 39 | Conventional zones are used for caching as well as for storing internal | ||
| 40 | metadata. | ||
| 41 | |||
| 42 | The zones of the device are separated into 2 types: | ||
| 43 | |||
| 44 | 1) Metadata zones: these are conventional zones used to store metadata. | ||
| 45 | Metadata zones are not reported as useable capacity to the user. | ||
| 46 | |||
| 47 | 2) Data zones: all remaining zones, the vast majority of which will be | ||
| 48 | sequential zones used exclusively to store user data. The conventional | ||
| 49 | zones of the device may be used also for buffering user random writes. | ||
| 50 | Data in these zones may be directly mapped to the conventional zone, but | ||
| 51 | later moved to a sequential zone so that the conventional zone can be | ||
| 52 | reused for buffering incoming random writes. | ||
| 53 | |||
| 54 | dm-zoned exposes a logical device with a sector size of 4096 bytes, | ||
| 55 | irrespective of the physical sector size of the backend zoned block | ||
| 56 | device being used. This allows reducing the amount of metadata needed to | ||
| 57 | manage valid blocks (blocks written). | ||
| 58 | |||
| 59 | The on-disk metadata format is as follows: | ||
| 60 | |||
| 61 | 1) The first block of the first conventional zone found contains the | ||
| 62 | super block which describes the on disk amount and position of metadata | ||
| 63 | blocks. | ||
| 64 | |||
| 65 | 2) Following the super block, a set of blocks is used to describe the | ||
| 66 | mapping of the logical device blocks. The mapping is done per chunk of | ||
| 67 | blocks, with the chunk size equal to the zoned block device size. The | ||
| 68 | mapping table is indexed by chunk number and each mapping entry | ||
| 69 | indicates the zone number of the device storing the chunk of data. Each | ||
| 70 | mapping entry may also indicate if the zone number of a conventional | ||
| 71 | zone used to buffer random modification to the data zone. | ||
| 72 | |||
| 73 | 3) A set of blocks used to store bitmaps indicating the validity of | ||
| 74 | blocks in the data zones follows the mapping table. A valid block is | ||
| 75 | defined as a block that was written and not discarded. For a buffered | ||
| 76 | data chunk, a block is always valid only in the data zone mapping the | ||
| 77 | chunk or in the buffer zone of the chunk. | ||
| 78 | |||
| 79 | For a logical chunk mapped to a conventional zone, all write operations | ||
| 80 | are processed by directly writing to the zone. If the mapping zone is a | ||
| 81 | sequential zone, the write operation is processed directly only if the | ||
| 82 | write offset within the logical chunk is equal to the write pointer | ||
| 83 | offset within of the sequential data zone (i.e. the write operation is | ||
| 84 | aligned on the zone write pointer). Otherwise, write operations are | ||
| 85 | processed indirectly using a buffer zone. In that case, an unused | ||
| 86 | conventional zone is allocated and assigned to the chunk being | ||
| 87 | accessed. Writing a block to the buffer zone of a chunk will | ||
| 88 | automatically invalidate the same block in the sequential zone mapping | ||
| 89 | the chunk. If all blocks of the sequential zone become invalid, the zone | ||
| 90 | is freed and the chunk buffer zone becomes the primary zone mapping the | ||
| 91 | chunk, resulting in native random write performance similar to a regular | ||
| 92 | block device. | ||
| 93 | |||
| 94 | Read operations are processed according to the block validity | ||
| 95 | information provided by the bitmaps. Valid blocks are read either from | ||
| 96 | the sequential zone mapping a chunk, or if the chunk is buffered, from | ||
| 97 | the buffer zone assigned. If the accessed chunk has no mapping, or the | ||
| 98 | accessed blocks are invalid, the read buffer is zeroed and the read | ||
| 99 | operation terminated. | ||
| 100 | |||
| 101 | After some time, the limited number of convnetional zones available may | ||
| 102 | be exhausted (all used to map chunks or buffer sequential zones) and | ||
| 103 | unaligned writes to unbuffered chunks become impossible. To avoid this | ||
| 104 | situation, a reclaim process regularly scans used conventional zones and | ||
| 105 | tries to reclaim the least recently used zones by copying the valid | ||
| 106 | blocks of the buffer zone to a free sequential zone. Once the copy | ||
| 107 | completes, the chunk mapping is updated to point to the sequential zone | ||
| 108 | and the buffer zone freed for reuse. | ||
| 109 | |||
| 110 | Metadata Protection | ||
| 111 | =================== | ||
| 112 | |||
| 113 | To protect metadata against corruption in case of sudden power loss or | ||
| 114 | system crash, 2 sets of metadata zones are used. One set, the primary | ||
| 115 | set, is used as the main metadata region, while the secondary set is | ||
| 116 | used as a staging area. Modified metadata is first written to the | ||
| 117 | secondary set and validated by updating the super block in the secondary | ||
| 118 | set, a generation counter is used to indicate that this set contains the | ||
| 119 | newest metadata. Once this operation completes, in place of metadata | ||
| 120 | block updates can be done in the primary metadata set. This ensures that | ||
| 121 | one of the set is always consistent (all modifications committed or none | ||
| 122 | at all). Flush operations are used as a commit point. Upon reception of | ||
| 123 | a flush request, metadata modification activity is temporarily blocked | ||
| 124 | (for both incoming BIO processing and reclaim process) and all dirty | ||
| 125 | metadata blocks are staged and updated. Normal operation is then | ||
| 126 | resumed. Flushing metadata thus only temporarily delays write and | ||
| 127 | discard requests. Read requests can be processed concurrently while | ||
| 128 | metadata flush is being executed. | ||
| 129 | |||
| 130 | Usage | ||
| 131 | ===== | ||
| 132 | |||
| 133 | A zoned block device must first be formatted using the dmzadm tool. This | ||
| 134 | will analyze the device zone configuration, determine where to place the | ||
| 135 | metadata sets on the device and initialize the metadata sets. | ||
| 136 | |||
| 137 | Ex:: | ||
| 138 | |||
| 139 | dmzadm --format /dev/sdxx | ||
| 140 | |||
| 141 | For a formatted device, the target can be created normally with the | ||
| 142 | dmsetup utility. The only parameter that dm-zoned requires is the | ||
| 143 | underlying zoned block device name. Ex:: | ||
| 144 | |||
| 145 | echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \ | ||
| 146 | dmsetup create dmz-`basename ${dev}` | ||
