md-cluster: Design Documentation

Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
author: Goldwyn Rodrigues <rgoldwyn@suse.com> 2014-06-10 17:31:01 -0400
committer: Goldwyn Rodrigues <rgoldwyn@suse.com> 2015-02-23 08:16:46 -0500
commit: b8d834488fd7c0c5a79cd2bab112c37a3d3292b9 (patch)
tree: 2b07c77e9c39e6f3d34438021a3826f51e083a56 /Documentation
parent: c517d838eb7d07bbe9507871fab3931deccff539 (diff)
1 files changed, 176 insertions, 0 deletions
diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt
new file mode 100644
index 000000000000..de1af7db3355
--- /dev/null
+++ b/Documentation/md-cluster.txt
@@ -0,0 +1,176 @@
+The cluster MD is a shared-device RAID for a cluster.
+1. On-disk format
+Separate write-intent-bitmap are used for each cluster node.
+The bitmaps record all writes that may have been started on that node,
+and may not yet have finished. The on-disk layout is:
+0                    4k                     8k                    12k
+-------------------------------------------------------------------
+| idle                | md super            | bm super [0] + bits |
+| bm bits[0, contd]   | bm super[1] + bits  | bm bits[1, contd]   |
+| bm super[2] + bits  | bm bits [2, contd]  | bm super[3] + bits  |
+| bm bits [3, contd]  |                     |                     |
+During "normal" functioning we assume the filesystem ensures that only one
+node writes to any given block at a time, so a write
+request will
+ - set the appropriate bit (if not already set)
+ - commit the write to all mirrors
+ - schedule the bit to be cleared after a timeout.
+Reads are just handled normally.  It is up to the filesystem to
+ensure one node doesn't read from a location where another node (or the same
+node) is writing.
+2. DLM Locks for management
+There are two locks for managing the device:
+2.1 Bitmap lock resource (bm_lockres)
+ The bm_lockres protects individual node bitmaps. They are named in the
+ form bitmap001 for node 1, bitmap002 for node and so on. When a node
+ joins the cluster, it acquires the lock in PW mode and it stays so
+ during the lifetime the node is part of the cluster. The lock resource
+ number is based on the slot number returned by the DLM subsystem. Since
+ DLM starts node count from one and bitmap slots start from zero, one is
+ subtracted from the DLM slot number to arrive at the bitmap slot number.
+3. Communication
+Each node has to communicate with other nodes when starting or ending
+resync, and metadata superblock updates.
+3.1 Message Types
+ There are 3 types, of messages which are passed
+ 3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
+   updated, and the node must re-read the md superblock. This is performed
+   synchronously.
+ 3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
+   so that each node may suspend or resume the region.
+3.2 Communication mechanism
+ The DLM LVB is used to communicate within nodes of the cluster. There
+ are three resources used for the purpose:
+  3.2.1 Token: The resource which protects the entire communication
+   system. The node having the token resource is allowed to
+   communicate.
+  3.2.2 Message: The lock resource which carries the data to
+   communicate.
+  3.2.3 Ack: The resource, acquiring which means the message has been
+   acknowledged by all nodes in the cluster. The BAST of the resource
+   is used to inform the receive node that a node wants to communicate.
+The algorithm is:
+ 1. receive status
+   sender                         receiver                   receiver
+   ACK:CR                          ACK:CR                     ACK:CR
+ 2. sender get EX of TOKEN
+    sender get EX of MESSAGE
+    sender                        receiver                 receiver
+    TOKEN:EX                       ACK:CR                   ACK:CR
+    MESSAGE:EX
+    ACK:CR
+    Sender checks that it still needs to send a message. Messages received
+    or other events that happened while waiting for the TOKEN may have made
+    this message inappropriate or redundant.
+ 3. sender write LVB.
+    sender down-convert MESSAGE from EX to CR
+    sender try to get EX of ACK
+    [ wait until all receiver has *processed* the MESSAGE ]
+                                     [ triggered by bast of ACK ]
+                                     receiver get CR of MESSAGE
+                                     receiver read LVB
+                                     receiver processes the message
+                                     [ wait finish ]
+                                     receiver release ACK
+   sender                         receiver                   receiver
+   TOKEN:EX                       MESSAGE:CR                 MESSAGE:CR
+   MESSAGE:CR
+   ACK:EX
+ 4. triggered by grant of EX on ACK (indicating all receivers have processed
+    message)
+    sender down-convert ACK from EX to CR
+    sender release MESSAGE
+    sender release TOKEN
+                               receiver upconvert to EX of MESSAGE
+                               receiver get CR of ACK
+                               receiver release MESSAGE
+   sender                      receiver                   receiver
+   ACK:CR                       ACK:CR                     ACK:CR
+4. Handling Failures
+4.1 Node Failure
+ When a node fails, the DLM informs the cluster with the slot. The node
+ starts a cluster recovery thread. The cluster recovery thread:
+        - acquires the bitmap<number> lock of the failed node
+        - opens the bitmap
+        - reads the bitmap of the failed node
+        - copies the set bitmap to local node
+        - cleans the bitmap of the failed node
+        - releases bitmap<number> lock of the failed node
+        - initiates resync of the bitmap on the current node
+ The resync process, is the regular md resync. However, in a clustered
+ environment when a resync is performed, it needs to tell other nodes
+ of the areas which are suspended. Before a resync starts, the node
+ send out RESYNC_START with the (lo,hi) range of the area which needs
+ to be suspended. Each node maintains a suspend_list, which contains
+ the list  of ranges which are currently suspended. On receiving
+ RESYNC_START, the node adds the range to the suspend_list. Similarly,
+ when the node performing resync finishes, it send RESYNC_FINISHED
+ to other nodes and other nodes remove the corresponding entry from
+ the suspend_list.
+ A helper function, should_suspend() can be used to check if a particular
+ I/O range should be suspended or not.
+4.2 Device Failure
+ Device failures are handled and communicated with the metadata update
+ routine.
+5. Adding a new Device
+For adding a new device, it is necessary that all nodes "see" the new device
+to be added. For this, the following algorithm is used:
+    1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
+       ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
+    2. Node 1 sends NEWDISK with uuid and slot number
+    3. Other nodes issue kobject_uevent_env with uuid and slot number
+       (Steps 4,5 could be a udev rule)
+    4. In userspace, the node searches for the disk, perhaps
+       using blkid -t SUB_UUID=""
+    5. Other nodes issue either of the following depending on whether the disk
+       was found:
+       ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
+                disc.number set to slot number)
+       ioctl(CLUSTERED_DISK_NACK)
+    6. Other nodes drop lock on no-new-devs (CR) if device is found
+    7. Node 1 attempts EX lock on no-new-devs
+    8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
+       as SpareLocal
+    9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
+    10. Other nodes get the information whether a disk is added or not
+        by the following METADATA_UPDATED.
author	Goldwyn Rodrigues <rgoldwyn@suse.com>	2014-06-10 17:31:01 -0400
committer	Goldwyn Rodrigues <rgoldwyn@suse.com>	2015-02-23 08:16:46 -0500
commit	b8d834488fd7c0c5a79cd2bab112c37a3d3292b9 (patch)
tree	2b07c77e9c39e6f3d34438021a3826f51e083a56 /Documentation
parent	c517d838eb7d07bbe9507871fab3931deccff539 (diff)

diff --git a/Documentation/md-cluster.txt b/Documentation/md-cluster.txt new file mode 100644 index 000000000000..de1af7db3355 --- /dev/null +++ b/Documentation/md-cluster.txt
@@ -0,0 +1,176 @@
	1	The cluster MD is a shared-device RAID for a cluster.
	2
	3
	4	1. On-disk format
	5
	6	Separate write-intent-bitmap are used for each cluster node.
	7	The bitmaps record all writes that may have been started on that node,
	8	and may not yet have finished. The on-disk layout is:
	9
	10	0 4k 8k 12k
	11	-------------------------------------------------------------------
	12	\| idle \| md super \| bm super [0] + bits \|
	13	\| bm bits[0, contd] \| bm super[1] + bits \| bm bits[1, contd] \|
	14	\| bm super[2] + bits \| bm bits [2, contd] \| bm super[3] + bits \|
	15	\| bm bits [3, contd] \| \| \|
	16
	17	During "normal" functioning we assume the filesystem ensures that only one
	18	node writes to any given block at a time, so a write
	19	request will
	20	- set the appropriate bit (if not already set)
	21	- commit the write to all mirrors
	22	- schedule the bit to be cleared after a timeout.
	23
	24	Reads are just handled normally. It is up to the filesystem to
	25	ensure one node doesn't read from a location where another node (or the same
	26	node) is writing.
	27
	28
	29	2. DLM Locks for management
	30
	31	There are two locks for managing the device:
	32
	33	2.1 Bitmap lock resource (bm_lockres)
	34
	35	The bm_lockres protects individual node bitmaps. They are named in the
	36	form bitmap001 for node 1, bitmap002 for node and so on. When a node
	37	joins the cluster, it acquires the lock in PW mode and it stays so
	38	during the lifetime the node is part of the cluster. The lock resource
	39	number is based on the slot number returned by the DLM subsystem. Since
	40	DLM starts node count from one and bitmap slots start from zero, one is
	41	subtracted from the DLM slot number to arrive at the bitmap slot number.
	42
	43	3. Communication
	44
	45	Each node has to communicate with other nodes when starting or ending
	46	resync, and metadata superblock updates.
	47
	48	3.1 Message Types
	49
	50	There are 3 types, of messages which are passed
	51
	52	3.1.1 METADATA_UPDATED: informs other nodes that the metadata has been
	53	updated, and the node must re-read the md superblock. This is performed
	54	synchronously.
	55
	56	3.1.2 RESYNC: informs other nodes that a resync is initiated or ended
	57	so that each node may suspend or resume the region.
	58
	59	3.2 Communication mechanism
	60
	61	The DLM LVB is used to communicate within nodes of the cluster. There
	62	are three resources used for the purpose:
	63
	64	3.2.1 Token: The resource which protects the entire communication
	65	system. The node having the token resource is allowed to
	66	communicate.
	67
	68	3.2.2 Message: The lock resource which carries the data to
	69	communicate.
	70
	71	3.2.3 Ack: The resource, acquiring which means the message has been
	72	acknowledged by all nodes in the cluster. The BAST of the resource
	73	is used to inform the receive node that a node wants to communicate.
	74
	75	The algorithm is:
	76
	77	1. receive status
	78
	79	sender receiver receiver
	80	ACK:CR ACK:CR ACK:CR
	81
	82	2. sender get EX of TOKEN
	83	sender get EX of MESSAGE
	84	sender receiver receiver
	85	TOKEN:EX ACK:CR ACK:CR
	86	MESSAGE:EX
	87	ACK:CR
	88
	89	Sender checks that it still needs to send a message. Messages received
	90	or other events that happened while waiting for the TOKEN may have made
	91	this message inappropriate or redundant.
	92
	93	3. sender write LVB.
	94	sender down-convert MESSAGE from EX to CR
	95	sender try to get EX of ACK
	96	[ wait until all receiver has processed the MESSAGE ]
	97
	98	[ triggered by bast of ACK ]
	99	receiver get CR of MESSAGE
	100	receiver read LVB
	101	receiver processes the message
	102	[ wait finish ]
	103	receiver release ACK
	104
	105	sender receiver receiver
	106	TOKEN:EX MESSAGE:CR MESSAGE:CR
	107	MESSAGE:CR
	108	ACK:EX
	109
	110	4. triggered by grant of EX on ACK (indicating all receivers have processed
	111	message)
	112	sender down-convert ACK from EX to CR
	113	sender release MESSAGE
	114	sender release TOKEN
	115	receiver upconvert to EX of MESSAGE
	116	receiver get CR of ACK
	117	receiver release MESSAGE
	118
	119	sender receiver receiver
	120	ACK:CR ACK:CR ACK:CR
	121
	122
	123	4. Handling Failures
	124
	125	4.1 Node Failure
	126	When a node fails, the DLM informs the cluster with the slot. The node
	127	starts a cluster recovery thread. The cluster recovery thread:
	128	- acquires the bitmap<number> lock of the failed node
	129	- opens the bitmap
	130	- reads the bitmap of the failed node
	131	- copies the set bitmap to local node
	132	- cleans the bitmap of the failed node
	133	- releases bitmap<number> lock of the failed node
	134	- initiates resync of the bitmap on the current node
	135
	136	The resync process, is the regular md resync. However, in a clustered
	137	environment when a resync is performed, it needs to tell other nodes
	138	of the areas which are suspended. Before a resync starts, the node
	139	send out RESYNC_START with the (lo,hi) range of the area which needs
	140	to be suspended. Each node maintains a suspend_list, which contains
	141	the list of ranges which are currently suspended. On receiving
	142	RESYNC_START, the node adds the range to the suspend_list. Similarly,
	143	when the node performing resync finishes, it send RESYNC_FINISHED
	144	to other nodes and other nodes remove the corresponding entry from
	145	the suspend_list.
	146
	147	A helper function, should_suspend() can be used to check if a particular
	148	I/O range should be suspended or not.
	149
	150	4.2 Device Failure
	151	Device failures are handled and communicated with the metadata update
	152	routine.
	153
	154	5. Adding a new Device
	155	For adding a new device, it is necessary that all nodes "see" the new device
	156	to be added. For this, the following algorithm is used:
	157
	158	1. Node 1 issues mdadm --manage /dev/mdX --add /dev/sdYY which issues
	159	ioctl(ADD_NEW_DISC with disc.state set to MD_DISK_CLUSTER_ADD)
	160	2. Node 1 sends NEWDISK with uuid and slot number
	161	3. Other nodes issue kobject_uevent_env with uuid and slot number
	162	(Steps 4,5 could be a udev rule)
	163	4. In userspace, the node searches for the disk, perhaps
	164	using blkid -t SUB_UUID=""
	165	5. Other nodes issue either of the following depending on whether the disk
	166	was found:
	167	ioctl(ADD_NEW_DISK with disc.state set to MD_DISK_CANDIDATE and
	168	disc.number set to slot number)
	169	ioctl(CLUSTERED_DISK_NACK)
	170	6. Other nodes drop lock on no-new-devs (CR) if device is found
	171	7. Node 1 attempts EX lock on no-new-devs
	172	8. If node 1 gets the lock, it sends METADATA_UPDATED after unmarking the disk
	173	as SpareLocal
	174	9. If not (get no-new-dev lock), it fails the operation and sends METADATA_UPDATED
	175	10. Other nodes get the information whether a disk is added or not
	176	by the following METADATA_UPDATED.