diff options
Diffstat (limited to 'Documentation/admin-guide/device-mapper/switch.rst')
| -rw-r--r-- | Documentation/admin-guide/device-mapper/switch.rst | 141 |
1 files changed, 141 insertions, 0 deletions
diff --git a/Documentation/admin-guide/device-mapper/switch.rst b/Documentation/admin-guide/device-mapper/switch.rst new file mode 100644 index 000000000000..7dde06be1a4f --- /dev/null +++ b/Documentation/admin-guide/device-mapper/switch.rst | |||
| @@ -0,0 +1,141 @@ | |||
| 1 | ========= | ||
| 2 | dm-switch | ||
| 3 | ========= | ||
| 4 | |||
| 5 | The device-mapper switch target creates a device that supports an | ||
| 6 | arbitrary mapping of fixed-size regions of I/O across a fixed set of | ||
| 7 | paths. The path used for any specific region can be switched | ||
| 8 | dynamically by sending the target a message. | ||
| 9 | |||
| 10 | It maps I/O to underlying block devices efficiently when there is a large | ||
| 11 | number of fixed-sized address regions but there is no simple pattern | ||
| 12 | that would allow for a compact representation of the mapping such as | ||
| 13 | dm-stripe. | ||
| 14 | |||
| 15 | Background | ||
| 16 | ---------- | ||
| 17 | |||
| 18 | Dell EqualLogic and some other iSCSI storage arrays use a distributed | ||
| 19 | frameless architecture. In this architecture, the storage group | ||
| 20 | consists of a number of distinct storage arrays ("members") each having | ||
| 21 | independent controllers, disk storage and network adapters. When a LUN | ||
| 22 | is created it is spread across multiple members. The details of the | ||
| 23 | spreading are hidden from initiators connected to this storage system. | ||
| 24 | The storage group exposes a single target discovery portal, no matter | ||
| 25 | how many members are being used. When iSCSI sessions are created, each | ||
| 26 | session is connected to an eth port on a single member. Data to a LUN | ||
| 27 | can be sent on any iSCSI session, and if the blocks being accessed are | ||
| 28 | stored on another member the I/O will be forwarded as required. This | ||
| 29 | forwarding is invisible to the initiator. The storage layout is also | ||
| 30 | dynamic, and the blocks stored on disk may be moved from member to | ||
| 31 | member as needed to balance the load. | ||
| 32 | |||
| 33 | This architecture simplifies the management and configuration of both | ||
| 34 | the storage group and initiators. In a multipathing configuration, it | ||
| 35 | is possible to set up multiple iSCSI sessions to use multiple network | ||
| 36 | interfaces on both the host and target to take advantage of the | ||
| 37 | increased network bandwidth. An initiator could use a simple round | ||
| 38 | robin algorithm to send I/O across all paths and let the storage array | ||
| 39 | members forward it as necessary, but there is a performance advantage to | ||
| 40 | sending data directly to the correct member. | ||
| 41 | |||
| 42 | A device-mapper table already lets you map different regions of a | ||
| 43 | device onto different targets. However in this architecture the LUN is | ||
| 44 | spread with an address region size on the order of 10s of MBs, which | ||
| 45 | means the resulting table could have more than a million entries and | ||
| 46 | consume far too much memory. | ||
| 47 | |||
| 48 | Using this device-mapper switch target we can now build a two-layer | ||
| 49 | device hierarchy: | ||
| 50 | |||
| 51 | Upper Tier - Determine which array member the I/O should be sent to. | ||
| 52 | Lower Tier - Load balance amongst paths to a particular member. | ||
| 53 | |||
| 54 | The lower tier consists of a single dm multipath device for each member. | ||
| 55 | Each of these multipath devices contains the set of paths directly to | ||
| 56 | the array member in one priority group, and leverages existing path | ||
| 57 | selectors to load balance amongst these paths. We also build a | ||
| 58 | non-preferred priority group containing paths to other array members for | ||
| 59 | failover reasons. | ||
| 60 | |||
| 61 | The upper tier consists of a single dm-switch device. This device uses | ||
| 62 | a bitmap to look up the location of the I/O and choose the appropriate | ||
| 63 | lower tier device to route the I/O. By using a bitmap we are able to | ||
| 64 | use 4 bits for each address range in a 16 member group (which is very | ||
| 65 | large for us). This is a much denser representation than the dm table | ||
| 66 | b-tree can achieve. | ||
| 67 | |||
| 68 | Construction Parameters | ||
| 69 | ======================= | ||
| 70 | |||
| 71 | <num_paths> <region_size> <num_optional_args> [<optional_args>...] [<dev_path> <offset>]+ | ||
| 72 | <num_paths> | ||
| 73 | The number of paths across which to distribute the I/O. | ||
| 74 | |||
| 75 | <region_size> | ||
| 76 | The number of 512-byte sectors in a region. Each region can be redirected | ||
| 77 | to any of the available paths. | ||
| 78 | |||
| 79 | <num_optional_args> | ||
| 80 | The number of optional arguments. Currently, no optional arguments | ||
| 81 | are supported and so this must be zero. | ||
| 82 | |||
| 83 | <dev_path> | ||
| 84 | The block device that represents a specific path to the device. | ||
| 85 | |||
| 86 | <offset> | ||
| 87 | The offset of the start of data on the specific <dev_path> (in units | ||
| 88 | of 512-byte sectors). This number is added to the sector number when | ||
| 89 | forwarding the request to the specific path. Typically it is zero. | ||
| 90 | |||
| 91 | Messages | ||
| 92 | ======== | ||
| 93 | |||
| 94 | set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... | ||
| 95 | |||
| 96 | Modify the region table by specifying which regions are redirected to | ||
| 97 | which paths. | ||
| 98 | |||
| 99 | <index> | ||
| 100 | The region number (region size was specified in constructor parameters). | ||
| 101 | If index is omitted, the next region (previous index + 1) is used. | ||
| 102 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | ||
| 103 | |||
| 104 | <path_nr> | ||
| 105 | The path number in the range 0 ... (<num_paths> - 1). | ||
| 106 | Expressed in hexadecimal (WITHOUT any prefix like 0x). | ||
| 107 | |||
| 108 | R<n>,<m> | ||
| 109 | This parameter allows repetitive patterns to be loaded quickly. <n> and <m> | ||
| 110 | are hexadecimal numbers. The last <n> mappings are repeated in the next <m> | ||
| 111 | slots. | ||
| 112 | |||
| 113 | Status | ||
| 114 | ====== | ||
| 115 | |||
| 116 | No status line is reported. | ||
| 117 | |||
| 118 | Example | ||
| 119 | ======= | ||
| 120 | |||
| 121 | Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with | ||
| 122 | the same size. | ||
| 123 | |||
| 124 | Create a switch device with 64kB region size:: | ||
| 125 | |||
| 126 | dmsetup create switch --table "0 `blockdev --getsz /dev/vg1/switch0` | ||
| 127 | switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" | ||
| 128 | |||
| 129 | Set mappings for the first 7 entries to point to devices switch0, switch1, | ||
| 130 | switch2, switch0, switch1, switch2, switch1:: | ||
| 131 | |||
| 132 | dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 | ||
| 133 | |||
| 134 | Set repetitive mapping. This command:: | ||
| 135 | |||
| 136 | dmsetup message switch 0 set_region_mappings 1000:1 :2 R2,10 | ||
| 137 | |||
| 138 | is equivalent to:: | ||
| 139 | |||
| 140 | dmsetup message switch 0 set_region_mappings 1000:1 :2 :1 :2 :1 :2 :1 :2 \ | ||
| 141 | :1 :2 :1 :2 :1 :2 :1 :2 :1 :2 | ||
