diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-07-11 16:05:40 -0400 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-07-11 16:05:40 -0400 |
commit | 9903883f1dd6e86f286b7bfa6e4b423f98c1cd9e (patch) | |
tree | 63c907110eac32c31a1786ebff3e7d9257e61c9b /Documentation | |
parent | 36805aaea5ae3cf1bb32f1643e0a800bb69f0d5b (diff) | |
parent | 9d0eb0ab432aaa9160cf2675aee73b3900b9bc18 (diff) |
Merge tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm
Pull device-mapper changes from Alasdair G Kergon:
"Add a device-mapper target called dm-switch to provide a multipath
framework for storage arrays that dynamically reconfigure their
preferred paths for different device regions.
Fix a bug in the verity target that prevented its use with some
specific sizes of devices.
Improve some locking mechanisms in the device-mapper core and bufio.
Add Mike Snitzer as a device-mapper maintainer.
A few more clean-ups and fixes"
* tag 'dm-3.11-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm:
dm: add switch target
dm: update maintainers
dm: optimize reorder structure
dm: optimize use SRCU and RCU
dm bufio: submit writes outside lock
dm cache: fix arm link errors with inline
dm verity: use __ffs and __fls
dm flakey: correct ctr alloc failure mesg
dm verity: remove pointless comparison
dm: use __GFP_HIGHMEM in __vmalloc
dm verity: fix inability to use a few specific devices sizes
dm ioctl: set noio flag to avoid __vmalloc deadlock
dm mpath: fix ioctl deadlock when no paths
Diffstat (limited to 'Documentation')
-rw-r--r-- | Documentation/device-mapper/switch.txt | 126 |
1 files changed, 126 insertions, 0 deletions
diff --git a/Documentation/device-mapper/switch.txt b/Documentation/device-mapper/switch.txt new file mode 100644 index 000000000000..2fa749387be8 --- /dev/null +++ b/Documentation/device-mapper/switch.txt | |||
@@ -0,0 +1,126 @@ | |||
1 | dm-switch | ||
2 | ========= | ||
3 | |||
4 | The device-mapper switch target creates a device that supports an | ||
5 | arbitrary mapping of fixed-size regions of I/O across a fixed set of | ||
6 | paths. The path used for any specific region can be switched | ||
7 | dynamically by sending the target a message. | ||
8 | |||
9 | It maps I/O to underlying block devices efficiently when there is a large | ||
10 | number of fixed-sized address regions but there is no simple pattern | ||
11 | that would allow for a compact representation of the mapping such as | ||
12 | dm-stripe. | ||
13 | |||
14 | Background | ||
15 | ---------- | ||
16 | |||
17 | Dell EqualLogic and some other iSCSI storage arrays use a distributed | ||
18 | frameless architecture. In this architecture, the storage group | ||
19 | consists of a number of distinct storage arrays ("members") each having | ||
20 | independent controllers, disk storage and network adapters. When a LUN | ||
21 | is created it is spread across multiple members. The details of the | ||
22 | spreading are hidden from initiators connected to this storage system. | ||
23 | The storage group exposes a single target discovery portal, no matter | ||
24 | how many members are being used. When iSCSI sessions are created, each | ||
25 | session is connected to an eth port on a single member. Data to a LUN | ||
26 | can be sent on any iSCSI session, and if the blocks being accessed are | ||
27 | stored on another member the I/O will be forwarded as required. This | ||
28 | forwarding is invisible to the initiator. The storage layout is also | ||
29 | dynamic, and the blocks stored on disk may be moved from member to | ||
30 | member as needed to balance the load. | ||
31 | |||
32 | This architecture simplifies the management and configuration of both | ||
33 | the storage group and initiators. In a multipathing configuration, it | ||
34 | is possible to set up multiple iSCSI sessions to use multiple network | ||
35 | interfaces on both the host and target to take advantage of the | ||
36 | increased network bandwidth. An initiator could use a simple round | ||
37 | robin algorithm to send I/O across all paths and let the storage array | ||
38 | members forward it as necessary, but there is a performance advantage to | ||
39 | sending data directly to the correct member. | ||
40 | |||
41 | A device-mapper table already lets you map different regions of a | ||
42 | device onto different targets. However in this architecture the LUN is | ||
43 | spread with an address region size on the order of 10s of MBs, which | ||
44 | means the resulting table could have more than a million entries and | ||
45 | consume far too much memory. | ||
46 | |||
47 | Using this device-mapper switch target we can now build a two-layer | ||
48 | device hierarchy: | ||
49 | |||
50 | Upper Tier – Determine which array member the I/O should be sent to. | ||
51 | Lower Tier – Load balance amongst paths to a particular member. | ||
52 | |||
53 | The lower tier consists of a single dm multipath device for each member. | ||
54 | Each of these multipath devices contains the set of paths directly to | ||
55 | the array member in one priority group, and leverages existing path | ||
56 | selectors to load balance amongst these paths. We also build a | ||
57 | non-preferred priority group containing paths to other array members for | ||
58 | failover reasons. | ||
59 | |||
60 | The upper tier consists of a single dm-switch device. This device uses | ||
61 | a bitmap to look up the location of the I/O and choose the appropriate | ||
62 | lower tier device to route the I/O. By using a bitmap we are able to | ||
63 | use 4 bits for each address range in a 16 member group (which is very | ||
64 | large for us). This is a much denser representation than the dm table | ||
65 | b-tree can achieve. | ||
66 | |||
67 | Construction Parameters | ||
68 | ======================= | ||
69 | |||
70 | <num_paths> <region_size> <num_optional_args> [<optional_args>...] | ||
71 | [<dev_path> <offset>]+ | ||
72 | |||
73 | <num_paths> | ||
74 | The number of paths across which to distribute the I/O. | ||
75 | |||
76 | <region_size> | ||
77 | The number of 512-byte sectors in a region. Each region can be redirected | ||
78 | to any of the available paths. | ||
79 | |||
80 | <num_optional_args> | ||
81 | The number of optional arguments. Currently, no optional arguments | ||
82 | are supported and so this must be zero. | ||
83 | |||
84 | <dev_path> | ||
85 | The block device that represents a specific path to the device. | ||
86 | |||
87 | <offset> | ||
88 | The offset of the start of data on the specific <dev_path> (in units | ||
89 | of 512-byte sectors). This number is added to the sector number when | ||
90 | forwarding the request to the specific path. Typically it is zero. | ||
91 | |||
92 | Messages | ||
93 | ======== | ||
94 | |||
95 | set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... | ||
96 | |||
97 | Modify the region table by specifying which regions are redirected to | ||
98 | which paths. | ||
99 | |||
100 | <index> | ||
101 | The region number (region size was specified in constructor parameters). | ||
102 | If index is omitted, the next region (previous index + 1) is used. | ||
103 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | ||
104 | |||
105 | <path_nr> | ||
106 | The path number in the range 0 ... (<num_paths> - 1). | ||
107 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | ||
108 | |||
109 | Status | ||
110 | ====== | ||
111 | |||
112 | No status line is reported. | ||
113 | |||
114 | Example | ||
115 | ======= | ||
116 | |||
117 | Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with | ||
118 | the same size. | ||
119 | |||
120 | Create a switch device with 64kB region size: | ||
121 | dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0` | ||
122 | switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" | ||
123 | |||
124 | Set mappings for the first 7 entries to point to devices switch0, switch1, | ||
125 | switch2, switch0, switch1, switch2, switch1: | ||
126 | dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 | ||