1 files changed, 146 insertions, 0 deletions
diff --git a/Documentation/device-mapper/dm-zoned.rst b/Documentation/device-mapper/dm-zoned.rst
new file mode 100644
index 000000000000..07f56ebc1730
--- /dev/null
+++ b/Documentation/device-mapper/dm-zoned.rst
@@ -0,0 +1,146 @@
+========
+dm-zoned
+========
+The dm-zoned device mapper target exposes a zoned block device (ZBC and
+ZAC compliant devices) as a regular block device without any write
+pattern constraints. In effect, it implements a drive-managed zoned
+block device which hides from the user (a file system or an application
+doing raw block device accesses) the sequential write constraints of
+host-managed zoned block devices and can mitigate the potential
+device-side performance degradation due to excessive random writes on
+host-aware zoned block devices.
+For a more detailed description of the zoned block device models and
+their constraints see (for SCSI devices):
+http://www.t10.org/drafts.htm#ZBC_Family
+and (for ATA devices):
+http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
+The dm-zoned implementation is simple and minimizes system overhead (CPU
+and memory usage as well as storage capacity loss). For a 10TB
+host-managed disk with 256 MB zones, dm-zoned memory usage per disk
+instance is at most 4.5 MB and as little as 5 zones will be used
+internally for storing metadata and performaing reclaim operations.
+dm-zoned target devices are formatted and checked using the dmzadm
+utility available at:
+https://github.com/hgst/dm-zoned-tools
+Algorithm
+=========
+dm-zoned implements an on-disk buffering scheme to handle non-sequential
+write accesses to the sequential zones of a zoned block device.
+Conventional zones are used for caching as well as for storing internal
+metadata.
+The zones of the device are separated into 2 types:
+1) Metadata zones: these are conventional zones used to store metadata.
+Metadata zones are not reported as useable capacity to the user.
+2) Data zones: all remaining zones, the vast majority of which will be
+sequential zones used exclusively to store user data. The conventional
+zones of the device may be used also for buffering user random writes.
+Data in these zones may be directly mapped to the conventional zone, but
+later moved to a sequential zone so that the conventional zone can be
+reused for buffering incoming random writes.
+dm-zoned exposes a logical device with a sector size of 4096 bytes,
+irrespective of the physical sector size of the backend zoned block
+device being used. This allows reducing the amount of metadata needed to
+manage valid blocks (blocks written).
+The on-disk metadata format is as follows:
+1) The first block of the first conventional zone found contains the
+super block which describes the on disk amount and position of metadata
+blocks.
+2) Following the super block, a set of blocks is used to describe the
+mapping of the logical device blocks. The mapping is done per chunk of
+blocks, with the chunk size equal to the zoned block device size. The
+mapping table is indexed by chunk number and each mapping entry
+indicates the zone number of the device storing the chunk of data. Each
+mapping entry may also indicate if the zone number of a conventional
+zone used to buffer random modification to the data zone.
+3) A set of blocks used to store bitmaps indicating the validity of
+blocks in the data zones follows the mapping table. A valid block is
+defined as a block that was written and not discarded. For a buffered
+data chunk, a block is always valid only in the data zone mapping the
+chunk or in the buffer zone of the chunk.
+For a logical chunk mapped to a conventional zone, all write operations
+are processed by directly writing to the zone. If the mapping zone is a
+sequential zone, the write operation is processed directly only if the
+write offset within the logical chunk is equal to the write pointer
+offset within of the sequential data zone (i.e. the write operation is
+aligned on the zone write pointer). Otherwise, write operations are
+processed indirectly using a buffer zone. In that case, an unused
+conventional zone is allocated and assigned to the chunk being
+accessed. Writing a block to the buffer zone of a chunk will
+automatically invalidate the same block in the sequential zone mapping
+the chunk. If all blocks of the sequential zone become invalid, the zone
+is freed and the chunk buffer zone becomes the primary zone mapping the
+chunk, resulting in native random write performance similar to a regular
+block device.
+Read operations are processed according to the block validity
+information provided by the bitmaps. Valid blocks are read either from
+the sequential zone mapping a chunk, or if the chunk is buffered, from
+the buffer zone assigned. If the accessed chunk has no mapping, or the
+accessed blocks are invalid, the read buffer is zeroed and the read
+operation terminated.
+After some time, the limited number of convnetional zones available may
+be exhausted (all used to map chunks or buffer sequential zones) and
+unaligned writes to unbuffered chunks become impossible. To avoid this
+situation, a reclaim process regularly scans used conventional zones and
+tries to reclaim the least recently used zones by copying the valid
+blocks of the buffer zone to a free sequential zone. Once the copy
+completes, the chunk mapping is updated to point to the sequential zone
+and the buffer zone freed for reuse.
+Metadata Protection
+===================
+To protect metadata against corruption in case of sudden power loss or
+system crash, 2 sets of metadata zones are used. One set, the primary
+set, is used as the main metadata region, while the secondary set is
+used as a staging area. Modified metadata is first written to the
+secondary set and validated by updating the super block in the secondary
+set, a generation counter is used to indicate that this set contains the
+newest metadata. Once this operation completes, in place of metadata
+block updates can be done in the primary metadata set. This ensures that
+one of the set is always consistent (all modifications committed or none
+at all). Flush operations are used as a commit point. Upon reception of
+a flush request, metadata modification activity is temporarily blocked
+(for both incoming BIO processing and reclaim process) and all dirty
+metadata blocks are staged and updated. Normal operation is then
+resumed. Flushing metadata thus only temporarily delays write and
+discard requests. Read requests can be processed concurrently while
+metadata flush is being executed.
+Usage
+=====
+A zoned block device must first be formatted using the dmzadm tool. This
+will analyze the device zone configuration, determine where to place the
+metadata sets on the device and initialize the metadata sets.
+Ex::
+        dmzadm --format /dev/sdxx
+For a formatted device, the target can be created normally with the
+dmsetup utility. The only parameter that dm-zoned requires is the
+underlying zoned block device name. Ex::
+        echo "0 `blockdev --getsize ${dev}` zoned ${dev}" | \
+        dmsetup create dmz-`basename ${dev}`

diff --git a/Documentation/device-mapper/dm-zoned.rst b/Documentation/device-mapper/dm-zoned.rst new file mode 100644 index 000000000000..07f56ebc1730 --- /dev/null +++ b/Documentation/device-mapper/dm-zoned.rst
@@ -0,0 +1,146 @@
	1	========
	2	dm-zoned
	3	========
	4
	5	The dm-zoned device mapper target exposes a zoned block device (ZBC and
	6	ZAC compliant devices) as a regular block device without any write
	7	pattern constraints. In effect, it implements a drive-managed zoned
	8	block device which hides from the user (a file system or an application
	9	doing raw block device accesses) the sequential write constraints of
	10	host-managed zoned block devices and can mitigate the potential
	11	device-side performance degradation due to excessive random writes on
	12	host-aware zoned block devices.
	13
	14	For a more detailed description of the zoned block device models and
	15	their constraints see (for SCSI devices):
	16
	17	http://www.t10.org/drafts.htm#ZBC_Family
	18
	19	and (for ATA devices):
	20
	21	http://www.t13.org/Documents/UploadedDocuments/docs2015/di537r05-Zoned_Device_ATA_Command_Set_ZAC.pdf
	22
	23	The dm-zoned implementation is simple and minimizes system overhead (CPU
	24	and memory usage as well as storage capacity loss). For a 10TB
	25	host-managed disk with 256 MB zones, dm-zoned memory usage per disk
	26	instance is at most 4.5 MB and as little as 5 zones will be used
	27	internally for storing metadata and performaing reclaim operations.
	28
	29	dm-zoned target devices are formatted and checked using the dmzadm
	30	utility available at:
	31
	32	https://github.com/hgst/dm-zoned-tools
	33
	34	Algorithm
	35	=========
	36
	37	dm-zoned implements an on-disk buffering scheme to handle non-sequential
	38	write accesses to the sequential zones of a zoned block device.
	39	Conventional zones are used for caching as well as for storing internal
	40	metadata.
	41
	42	The zones of the device are separated into 2 types:
	43
	44	1) Metadata zones: these are conventional zones used to store metadata.
	45	Metadata zones are not reported as useable capacity to the user.
	46
	47	2) Data zones: all remaining zones, the vast majority of which will be
	48	sequential zones used exclusively to store user data. The conventional
	49	zones of the device may be used also for buffering user random writes.
	50	Data in these zones may be directly mapped to the conventional zone, but
	51	later moved to a sequential zone so that the conventional zone can be
	52	reused for buffering incoming random writes.
	53
	54	dm-zoned exposes a logical device with a sector size of 4096 bytes,
	55	irrespective of the physical sector size of the backend zoned block
	56	device being used. This allows reducing the amount of metadata needed to
	57	manage valid blocks (blocks written).
	58
	59	The on-disk metadata format is as follows:
	60
	61	1) The first block of the first conventional zone found contains the
	62	super block which describes the on disk amount and position of metadata
	63	blocks.
	64
	65	2) Following the super block, a set of blocks is used to describe the
	66	mapping of the logical device blocks. The mapping is done per chunk of
	67	blocks, with the chunk size equal to the zoned block device size. The
	68	mapping table is indexed by chunk number and each mapping entry
	69	indicates the zone number of the device storing the chunk of data. Each
	70	mapping entry may also indicate if the zone number of a conventional
	71	zone used to buffer random modification to the data zone.
	72
	73	3) A set of blocks used to store bitmaps indicating the validity of
	74	blocks in the data zones follows the mapping table. A valid block is
	75	defined as a block that was written and not discarded. For a buffered
	76	data chunk, a block is always valid only in the data zone mapping the
	77	chunk or in the buffer zone of the chunk.
	78
	79	For a logical chunk mapped to a conventional zone, all write operations
	80	are processed by directly writing to the zone. If the mapping zone is a
	81	sequential zone, the write operation is processed directly only if the
	82	write offset within the logical chunk is equal to the write pointer
	83	offset within of the sequential data zone (i.e. the write operation is
	84	aligned on the zone write pointer). Otherwise, write operations are
	85	processed indirectly using a buffer zone. In that case, an unused
	86	conventional zone is allocated and assigned to the chunk being
	87	accessed. Writing a block to the buffer zone of a chunk will
	88	automatically invalidate the same block in the sequential zone mapping
	89	the chunk. If all blocks of the sequential zone become invalid, the zone
	90	is freed and the chunk buffer zone becomes the primary zone mapping the
	91	chunk, resulting in native random write performance similar to a regular
	92	block device.
	93
	94	Read operations are processed according to the block validity
	95	information provided by the bitmaps. Valid blocks are read either from
	96	the sequential zone mapping a chunk, or if the chunk is buffered, from
	97	the buffer zone assigned. If the accessed chunk has no mapping, or the
	98	accessed blocks are invalid, the read buffer is zeroed and the read
	99	operation terminated.
	100
	101	After some time, the limited number of convnetional zones available may
	102	be exhausted (all used to map chunks or buffer sequential zones) and
	103	unaligned writes to unbuffered chunks become impossible. To avoid this
	104	situation, a reclaim process regularly scans used conventional zones and
	105	tries to reclaim the least recently used zones by copying the valid
	106	blocks of the buffer zone to a free sequential zone. Once the copy
	107	completes, the chunk mapping is updated to point to the sequential zone
	108	and the buffer zone freed for reuse.
	109
	110	Metadata Protection
	111	===================
	112
	113	To protect metadata against corruption in case of sudden power loss or
	114	system crash, 2 sets of metadata zones are used. One set, the primary
	115	set, is used as the main metadata region, while the secondary set is
	116	used as a staging area. Modified metadata is first written to the
	117	secondary set and validated by updating the super block in the secondary
	118	set, a generation counter is used to indicate that this set contains the
	119	newest metadata. Once this operation completes, in place of metadata
	120	block updates can be done in the primary metadata set. This ensures that
	121	one of the set is always consistent (all modifications committed or none
	122	at all). Flush operations are used as a commit point. Upon reception of
	123	a flush request, metadata modification activity is temporarily blocked
	124	(for both incoming BIO processing and reclaim process) and all dirty
	125	metadata blocks are staged and updated. Normal operation is then
	126	resumed. Flushing metadata thus only temporarily delays write and
	127	discard requests. Read requests can be processed concurrently while
	128	metadata flush is being executed.
	129
	130	Usage
	131	=====
	132
	133	A zoned block device must first be formatted using the dmzadm tool. This
	134	will analyze the device zone configuration, determine where to place the
	135	metadata sets on the device and initialize the metadata sets.
	136
	137	Ex::
	138
	139	dmzadm --format /dev/sdxx
	140
	141	For a formatted device, the target can be created normally with the
	142	dmsetup utility. The only parameter that dm-zoned requires is the
	143	underlying zoned block device name. Ex::
	144
	145	echo "0 `blockdev --getsize ${dev}` zoned ${dev}" \| \
	146	dmsetup create dmz-`basename ${dev}`