summaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2018-01-31 14:05:47 -0500
committerLinus Torvalds <torvalds@linux-foundation.org>2018-01-31 14:05:47 -0500
commit0be600a5add76e8e8b9e1119f2a7426ff849aca8 (patch)
treed5fcc2b119f03143f9bed1b9aa5cb85458c8bd03
parent040639b7fcf73ee39c15d38257f652a2048e96f2 (diff)
parent9614e2ba9161c7f5419f4212fa6057d2a65f6ae6 (diff)
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer: - DM core fixes to ensure that bio submission follows a depth-first tree walk; this is critical to allow forward progress without the need to use the bioset's BIOSET_NEED_RESCUER. - Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure. - DM core cleanups and improvements to make bio-based DM more efficient (e.g. reduced memory footprint as well leveraging per-bio-data more). - Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages the more direct IO submission path in the block layer; this mode is used by DM multipath and also optimizes targets like DM thin-pool that stack directly on NVMe data device. - DM multipath improvements to factor out legacy SCSI-only (e.g. scsi_dh) code paths to allow for more optimized support for NVMe multipath. - A fix for DM multipath path selectors (service-time and queue-length) to select paths in a more balanced way; largely academic but doesn't hurt. - Numerous DM raid target fixes and improvements. - Add a new DM "unstriped" target that enables Intel to workaround firmware limitations in some NVMe drives that are striped internally (this target also works when stacked above the DM "striped" target). - Various Documentation fixes and improvements. - Misc cleanups and fixes across various DM infrastructure and targets (e.g. bufio, flakey, log-writes, snapshot). * tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits) dm cache: Documentation: update default migration_throttling value dm mpath selector: more evenly distribute ties dm unstripe: fix target length versus number of stripes size check dm thin: fix trailing semicolon in __remap_and_issue_shared_cell dm table: fix NVMe bio-based dm_table_determine_type() validation dm: various cleanups to md->queue initialization code dm mpath: delay the retry of a request if the target responded as busy dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure dm log writes: fix max length used for kstrndup dm: backfill missing calls to mutex_destroy() dm snapshot: use mutex instead of rw_semaphore dm flakey: check for null arg_name in parse_features() dm thin: extend thinpool status format string with omitted fields dm thin: fixes in thin-provisioning.txt dm thin: document representation of <highest mapped sector> when there is none dm thin: fix documentation relative to low water mark threshold dm cache: be consistent in specifying sectors and SI units in cache.txt dm cache: delete obsoleted paragraph in cache.txt dm cache: fix grammar in cache-policies.txt ...
-rw-r--r--Documentation/device-mapper/cache-policies.txt4
-rw-r--r--Documentation/device-mapper/cache.txt9
-rw-r--r--Documentation/device-mapper/dm-raid.txt5
-rw-r--r--Documentation/device-mapper/snapshot.txt4
-rw-r--r--Documentation/device-mapper/thin-provisioning.txt14
-rw-r--r--Documentation/device-mapper/unstriped.txt124
-rw-r--r--drivers/md/Kconfig7
-rw-r--r--drivers/md/Makefile1
-rw-r--r--drivers/md/dm-bufio.c37
-rw-r--r--drivers/md/dm-core.h5
-rw-r--r--drivers/md/dm-crypt.c5
-rw-r--r--drivers/md/dm-delay.c2
-rw-r--r--drivers/md/dm-flakey.c5
-rw-r--r--drivers/md/dm-io.c3
-rw-r--r--drivers/md/dm-kcopyd.c6
-rw-r--r--drivers/md/dm-log-writes.c2
-rw-r--r--drivers/md/dm-mpath.c297
-rw-r--r--drivers/md/dm-queue-length.c6
-rw-r--r--drivers/md/dm-raid.c380
-rw-r--r--drivers/md/dm-rq.c6
-rw-r--r--drivers/md/dm-service-time.c6
-rw-r--r--drivers/md/dm-snap.c84
-rw-r--r--drivers/md/dm-stats.c1
-rw-r--r--drivers/md/dm-table.c114
-rw-r--r--drivers/md/dm-thin.c9
-rw-r--r--drivers/md/dm-unstripe.c219
-rw-r--r--drivers/md/dm-zoned-metadata.c3
-rw-r--r--drivers/md/dm-zoned-target.c3
-rw-r--r--drivers/md/dm.c659
-rw-r--r--drivers/md/dm.h4
-rw-r--r--include/linux/device-mapper.h56
31 files changed, 1409 insertions, 671 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt
index d3ca8af21a31..86786d87d9a8 100644
--- a/Documentation/device-mapper/cache-policies.txt
+++ b/Documentation/device-mapper/cache-policies.txt
@@ -60,7 +60,7 @@ Memory usage:
60The mq policy used a lot of memory; 88 bytes per cache block on a 64 60The mq policy used a lot of memory; 88 bytes per cache block on a 64
61bit machine. 61bit machine.
62 62
63smq uses 28bit indexes to implement it's data structures rather than 63smq uses 28bit indexes to implement its data structures rather than
64pointers. It avoids storing an explicit hit count for each block. It 64pointers. It avoids storing an explicit hit count for each block. It
65has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of 65has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
66the entries (each hotspot block covers a larger area than a single 66the entries (each hotspot block covers a larger area than a single
@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions.
84 84
85Adaptability: 85Adaptability:
86The mq policy maintained a hit count for each cache block. For a 86The mq policy maintained a hit count for each cache block. For a
87different block to get promoted to the cache it's hit count has to 87different block to get promoted to the cache its hit count has to
88exceed the lowest currently in the cache. This meant it could take a 88exceed the lowest currently in the cache. This meant it could take a
89long time for the cache to adapt between varying IO patterns. 89long time for the cache to adapt between varying IO patterns.
90 90
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt
index cdfd0feb294e..ff0841711fd5 100644
--- a/Documentation/device-mapper/cache.txt
+++ b/Documentation/device-mapper/cache.txt
@@ -59,7 +59,7 @@ Fixed block size
59The origin is divided up into blocks of a fixed size. This block size 59The origin is divided up into blocks of a fixed size. This block size
60is configurable when you first create the cache. Typically we've been 60is configurable when you first create the cache. Typically we've been
61using block sizes of 256KB - 1024KB. The block size must be between 64 61using block sizes of 256KB - 1024KB. The block size must be between 64
62(32KB) and 2097152 (1GB) and a multiple of 64 (32KB). 62sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB).
63 63
64Having a fixed block size simplifies the target a lot. But it is 64Having a fixed block size simplifies the target a lot. But it is
65something of a compromise. For instance, a small part of a block may be 65something of a compromise. For instance, a small part of a block may be
@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments.
119 119
120For the time being, a message "migration_threshold <#sectors>" 120For the time being, a message "migration_threshold <#sectors>"
121can be used to set the maximum number of sectors being migrated, 121can be used to set the maximum number of sectors being migrated,
122the default being 204800 sectors (or 100MB). 122the default being 2048 sectors (1MB).
123 123
124Updating on-disk metadata 124Updating on-disk metadata
125------------------------- 125-------------------------
@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the
143dirty flags this data is lost if there's a crash so a safe fallback 143dirty flags this data is lost if there's a crash so a safe fallback
144value should always be possible. 144value should always be possible.
145 145
146For instance, the 'mq' policy, which is currently the default policy,
147uses this facility to store the hit count of the cache blocks. If
148there's a crash this information will be lost, which means the cache
149may be less efficient until those hit counts are regenerated.
150
151Policy hints affect performance, not correctness. 146Policy hints affect performance, not correctness.
152 147
153Policy messaging 148Policy messaging
diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt
index 32df07e29f68..390c145f01d7 100644
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -343,5 +343,8 @@ Version History
3431.11.0 Fix table line argument order 3431.11.0 Fix table line argument order
344 (wrong raid10_copies/raid10_format sequence) 344 (wrong raid10_copies/raid10_format sequence)
3451.11.1 Add raid4/5/6 journal write-back support via journal_mode option 3451.11.1 Add raid4/5/6 journal write-back support via journal_mode option
3461.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available 3461.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available
3471.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') 3471.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A')
3481.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an
349 state races.
3501.13.2 Fix raid redundancy validation and avoid keeping raid set frozen
diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt
index ad6949bff2e3..b8bbb516f989 100644
--- a/Documentation/device-mapper/snapshot.txt
+++ b/Documentation/device-mapper/snapshot.txt
@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient
49snapshots less metadata must be saved on disk - they can be kept in 49snapshots less metadata must be saved on disk - they can be kept in
50memory by the kernel. 50memory by the kernel.
51 51
52When loading or unloading the snapshot target, the corresponding
53snapshot-origin or snapshot-merge target must be suspended. A failure to
54suspend the origin target could result in data corruption.
55
52 56
53* snapshot-merge <origin> <COW device> <persistent> <chunksize> 57* snapshot-merge <origin> <COW device> <persistent> <chunksize>
54 58
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
index 1699a55b7b70..4bcd4b7f79f9 100644
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If
112free space on the data device drops below this level then a dm event 112free space on the data device drops below this level then a dm event
113will be triggered which a userspace daemon should catch allowing it to 113will be triggered which a userspace daemon should catch allowing it to
114extend the pool device. Only one such event will be sent. 114extend the pool device. Only one such event will be sent.
115Resuming a device with a new table itself triggers an event so the 115
116userspace daemon can use this to detect a situation where a new table 116No special event is triggered if a just resumed device's free space is below
117already exceeds the threshold. 117the low water mark. However, resuming a device always triggers an
118event; a userspace daemon should verify that free space exceeds the low
119water mark when handling this event.
118 120
119A low water mark for the metadata device is maintained in the kernel and 121A low water mark for the metadata device is maintained in the kernel and
120will trigger a dm event if free space on the metadata device drops below 122will trigger a dm event if free space on the metadata device drops below
@@ -274,7 +276,8 @@ ii) Status
274 276
275 <transaction id> <used metadata blocks>/<total metadata blocks> 277 <transaction id> <used metadata blocks>/<total metadata blocks>
276 <used data blocks>/<total data blocks> <held metadata root> 278 <used data blocks>/<total data blocks> <held metadata root>
277 [no_]discard_passdown ro|rw 279 ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space
280 needs_check|-
278 281
279 transaction id: 282 transaction id:
280 A 64-bit number used by userspace to help synchronise with metadata 283 A 64-bit number used by userspace to help synchronise with metadata
@@ -394,3 +397,6 @@ ii) Status
394 If the pool has encountered device errors and failed, the status 397 If the pool has encountered device errors and failed, the status
395 will just contain the string 'Fail'. The userspace recovery 398 will just contain the string 'Fail'. The userspace recovery
396 tools should then be used. 399 tools should then be used.
400
401 In the case where <nr mapped sectors> is 0, there is no highest
402 mapped sector and the value of <highest mapped sector> is unspecified.
diff --git a/Documentation/device-mapper/unstriped.txt b/Documentation/device-mapper/unstriped.txt
new file mode 100644
index 000000000000..0b2a306c54ee
--- /dev/null
+++ b/Documentation/device-mapper/unstriped.txt
@@ -0,0 +1,124 @@
1Introduction
2============
3
4The device-mapper "unstriped" target provides a transparent mechanism to
5unstripe a device-mapper "striped" target to access the underlying disks
6without having to touch the true backing block-device. It can also be
7used to unstripe a hardware RAID-0 to access backing disks.
8
9Parameters:
10<number of stripes> <chunk size> <stripe #> <dev_path> <offset>
11
12<number of stripes>
13 The number of stripes in the RAID 0.
14
15<chunk size>
16 The amount of 512B sectors in the chunk striping.
17
18<dev_path>
19 The block device you wish to unstripe.
20
21<stripe #>
22 The stripe number within the device that corresponds to physical
23 drive you wish to unstripe. This must be 0 indexed.
24
25
26Why use this module?
27====================
28
29An example of undoing an existing dm-stripe
30-------------------------------------------
31
32This small bash script will setup 4 loop devices and use the existing
33striped target to combine the 4 devices into one. It then will use
34the unstriped target ontop of the striped device to access the
35individual backing loop devices. We write data to the newly exposed
36unstriped devices and verify the data written matches the correct
37underlying device on the striped array.
38
39#!/bin/bash
40
41MEMBER_SIZE=$((128 * 1024 * 1024))
42NUM=4
43SEQ_END=$((${NUM}-1))
44CHUNK=256
45BS=4096
46
47RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512))
48DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}"
49COUNT=$((${MEMBER_SIZE} / ${BS}))
50
51for i in $(seq 0 ${SEQ_END}); do
52 dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct
53 losetup /dev/loop${i} member-${i}
54 DM_PARMS+=" /dev/loop${i} 0"
55done
56
57echo $DM_PARMS | dmsetup create raid0
58for i in $(seq 0 ${SEQ_END}); do
59 echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i}
60done;
61
62for i in $(seq 0 ${SEQ_END}); do
63 dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct
64 diff /dev/mapper/set-${i} member-${i}
65done;
66
67for i in $(seq 0 ${SEQ_END}); do
68 dmsetup remove set-${i}
69done
70
71dmsetup remove raid0
72
73for i in $(seq 0 ${SEQ_END}); do
74 losetup -d /dev/loop${i}
75 rm -f member-${i}
76done
77
78Another example
79---------------
80
81Intel NVMe drives contain two cores on the physical device.
82Each core of the drive has segregated access to its LBA range.
83The current LBA model has a RAID 0 128k chunk on each core, resulting
84in a 256k stripe across the two cores:
85
86 Core 0: Core 1:
87 __________ __________
88 | LBA 512| | LBA 768|
89 | LBA 0 | | LBA 256|
90 ---------- ----------
91
92The purpose of this unstriping is to provide better QoS in noisy
93neighbor environments. When two partitions are created on the
94aggregate drive without this unstriping, reads on one partition
95can affect writes on another partition. This is because the partitions
96are striped across the two cores. When we unstripe this hardware RAID 0
97and make partitions on each new exposed device the two partitions are now
98physically separated.
99
100With the dm-unstriped target we're able to segregate an fio script that
101has read and write jobs that are independent of each other. Compared to
102when we run the test on a combined drive with partitions, we were able
103to get a 92% reduction in read latency using this device mapper target.
104
105
106Example dmsetup usage
107=====================
108
109unstriped ontop of Intel NVMe device that has 2 cores
110-----------------------------------------------------
111dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0'
112dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0'
113
114There will now be two devices that expose Intel NVMe core 0 and 1
115respectively:
116/dev/mapper/nvmset0
117/dev/mapper/nvmset1
118
119unstriped ontop of striped with 4 drives using 128K chunk size
120--------------------------------------------------------------
121dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0'
122dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0'
123dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0'
124dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0'
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index 83b9362be09c..2c8ac3688815 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -269,6 +269,13 @@ config DM_BIO_PRISON
269 269
270source "drivers/md/persistent-data/Kconfig" 270source "drivers/md/persistent-data/Kconfig"
271 271
272config DM_UNSTRIPED
273 tristate "Unstriped target"
274 depends on BLK_DEV_DM
275 ---help---
276 Unstripes I/O so it is issued solely on a single drive in a HW
277 RAID0 or dm-striped target.
278
272config DM_CRYPT 279config DM_CRYPT
273 tristate "Crypt target support" 280 tristate "Crypt target support"
274 depends on BLK_DEV_DM 281 depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index f701bb211783..63255f3ebd97 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/
43obj-$(CONFIG_BLK_DEV_MD) += md-mod.o 43obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
44obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o 44obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
45obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o 45obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o
46obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o
46obj-$(CONFIG_DM_BUFIO) += dm-bufio.o 47obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
47obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o 48obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o
48obj-$(CONFIG_DM_CRYPT) += dm-crypt.o 49obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
index c546b567f3b5..414c9af54ded 100644
--- a/drivers/md/dm-bufio.c
+++ b/drivers/md/dm-bufio.c
@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io)
662 662
663 sector = (b->block << b->c->sectors_per_block_bits) + b->c->start; 663 sector = (b->block << b->c->sectors_per_block_bits) + b->c->start;
664 664
665 if (rw != WRITE) { 665 if (rw != REQ_OP_WRITE) {
666 n_sectors = 1 << b->c->sectors_per_block_bits; 666 n_sectors = 1 << b->c->sectors_per_block_bits;
667 offset = 0; 667 offset = 0;
668 } else { 668 } else {
@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b,
740 b->write_end = b->dirty_end; 740 b->write_end = b->dirty_end;
741 741
742 if (!write_list) 742 if (!write_list)
743 submit_io(b, WRITE, write_endio); 743 submit_io(b, REQ_OP_WRITE, write_endio);
744 else 744 else
745 list_add_tail(&b->write_list, write_list); 745 list_add_tail(&b->write_list, write_list);
746} 746}
@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list)
753 struct dm_buffer *b = 753 struct dm_buffer *b =
754 list_entry(write_list->next, struct dm_buffer, write_list); 754 list_entry(write_list->next, struct dm_buffer, write_list);
755 list_del(&b->write_list); 755 list_del(&b->write_list);
756 submit_io(b, WRITE, write_endio); 756 submit_io(b, REQ_OP_WRITE, write_endio);
757 cond_resched(); 757 cond_resched();
758 } 758 }
759 blk_finish_plug(&plug); 759 blk_finish_plug(&plug);
@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block,
1123 return NULL; 1123 return NULL;
1124 1124
1125 if (need_submit) 1125 if (need_submit)
1126 submit_io(b, READ, read_endio); 1126 submit_io(b, REQ_OP_READ, read_endio);
1127 1127
1128 wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE); 1128 wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE);
1129 1129
@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c,
1193 dm_bufio_unlock(c); 1193 dm_bufio_unlock(c);
1194 1194
1195 if (need_submit) 1195 if (need_submit)
1196 submit_io(b, READ, read_endio); 1196 submit_io(b, REQ_OP_READ, read_endio);
1197 dm_bufio_release(b); 1197 dm_bufio_release(b);
1198 1198
1199 cond_resched(); 1199 cond_resched();
@@ -1454,7 +1454,7 @@ retry:
1454 old_block = b->block; 1454 old_block = b->block;
1455 __unlink_buffer(b); 1455 __unlink_buffer(b);
1456 __link_buffer(b, new_block, b->list_mode); 1456 __link_buffer(b, new_block, b->list_mode);
1457 submit_io(b, WRITE, write_endio); 1457 submit_io(b, REQ_OP_WRITE, write_endio);
1458 wait_on_bit_io(&b->state, B_WRITING, 1458 wait_on_bit_io(&b->state, B_WRITING,
1459 TASK_UNINTERRUPTIBLE); 1459 TASK_UNINTERRUPTIBLE);
1460 __unlink_buffer(b); 1460 __unlink_buffer(b);
@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
1716 if (!DM_BUFIO_CACHE_NAME(c)) { 1716 if (!DM_BUFIO_CACHE_NAME(c)) {
1717 r = -ENOMEM; 1717 r = -ENOMEM;
1718 mutex_unlock(&dm_bufio_clients_lock); 1718 mutex_unlock(&dm_bufio_clients_lock);
1719 goto bad_cache; 1719 goto bad;
1720 } 1720 }
1721 } 1721 }
1722 1722
@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
1727 if (!DM_BUFIO_CACHE(c)) { 1727 if (!DM_BUFIO_CACHE(c)) {
1728 r = -ENOMEM; 1728 r = -ENOMEM;
1729 mutex_unlock(&dm_bufio_clients_lock); 1729 mutex_unlock(&dm_bufio_clients_lock);
1730 goto bad_cache; 1730 goto bad;
1731 } 1731 }
1732 } 1732 }
1733 } 1733 }
@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign
1738 1738
1739 if (!b) { 1739 if (!b) {
1740 r = -ENOMEM; 1740 r = -ENOMEM;
1741 goto bad_buffer; 1741 goto bad;
1742 } 1742 }
1743 __free_buffer_wake(b); 1743 __free_buffer_wake(b);
1744 } 1744 }
1745 1745
1746 c->shrinker.count_objects = dm_bufio_shrink_count;
1747 c->shrinker.scan_objects = dm_bufio_shrink_scan;
1748 c->shrinker.seeks = 1;
1749 c->shrinker.batch = 0;
1750 r = register_shrinker(&c->shrinker);
1751 if (r)
1752 goto bad;
1753
1746 mutex_lock(&dm_bufio_clients_lock); 1754 mutex_lock(&dm_bufio_clients_lock);
1747 dm_bufio_client_count++; 1755 dm_bufio_client_count++;
1748 list_add(&c->client_list, &dm_bufio_all_clients); 1756 list_add(&c->client_list, &dm_bufio_all_clients);
1749 __cache_size_refresh(); 1757 __cache_size_refresh();
1750 mutex_unlock(&dm_bufio_clients_lock); 1758 mutex_unlock(&dm_bufio_clients_lock);
1751 1759
1752 c->shrinker.count_objects = dm_bufio_shrink_count;
1753 c->shrinker.scan_objects = dm_bufio_shrink_scan;
1754 c->shrinker.seeks = 1;
1755 c->shrinker.batch = 0;
1756 register_shrinker(&c->shrinker);
1757
1758 return c; 1760 return c;
1759 1761
1760bad_buffer: 1762bad:
1761bad_cache:
1762 while (!list_empty(&c->reserved_buffers)) { 1763 while (!list_empty(&c->reserved_buffers)) {
1763 struct dm_buffer *b = list_entry(c->reserved_buffers.next, 1764 struct dm_buffer *b = list_entry(c->reserved_buffers.next,
1764 struct dm_buffer, lru_list); 1765 struct dm_buffer, lru_list);
@@ -1767,6 +1768,7 @@ bad_cache:
1767 } 1768 }
1768 dm_io_client_destroy(c->dm_io); 1769 dm_io_client_destroy(c->dm_io);
1769bad_dm_io: 1770bad_dm_io:
1771 mutex_destroy(&c->lock);
1770 kfree(c); 1772 kfree(c);
1771bad_client: 1773bad_client:
1772 return ERR_PTR(r); 1774 return ERR_PTR(r);
@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c)
1811 BUG_ON(c->n_buffers[i]); 1813 BUG_ON(c->n_buffers[i]);
1812 1814
1813 dm_io_client_destroy(c->dm_io); 1815 dm_io_client_destroy(c->dm_io);
1816 mutex_destroy(&c->lock);
1814 kfree(c); 1817 kfree(c);
1815} 1818}
1816EXPORT_SYMBOL_GPL(dm_bufio_client_destroy); 1819EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h
index 6a14f945783c..3222e21cbbf8 100644
--- a/drivers/md/dm-core.h
+++ b/drivers/md/dm-core.h
@@ -91,8 +91,7 @@ struct mapped_device {
91 /* 91 /*
92 * io objects are allocated from here. 92 * io objects are allocated from here.
93 */ 93 */
94 mempool_t *io_pool; 94 struct bio_set *io_bs;
95
96 struct bio_set *bs; 95 struct bio_set *bs;
97 96
98 /* 97 /*
@@ -130,8 +129,6 @@ struct mapped_device {
130 struct srcu_struct io_barrier; 129 struct srcu_struct io_barrier;
131}; 130};
132 131
133void dm_init_md_queue(struct mapped_device *md);
134void dm_init_normal_md_queue(struct mapped_device *md);
135int md_in_flight(struct mapped_device *md); 132int md_in_flight(struct mapped_device *md);
136void disable_write_same(struct mapped_device *md); 133void disable_write_same(struct mapped_device *md);
137void disable_write_zeroes(struct mapped_device *md); 134void disable_write_zeroes(struct mapped_device *md);
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 2ad429100d25..8168f737590e 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti)
2193 kzfree(cc->cipher_auth); 2193 kzfree(cc->cipher_auth);
2194 kzfree(cc->authenc_key); 2194 kzfree(cc->authenc_key);
2195 2195
2196 mutex_destroy(&cc->bio_alloc_lock);
2197
2196 /* Must zero key material before freeing */ 2198 /* Must zero key material before freeing */
2197 kzfree(cc); 2199 kzfree(cc);
2198} 2200}
@@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
2702 goto bad; 2704 goto bad;
2703 } 2705 }
2704 2706
2705 cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS | 2707 cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS);
2706 BIOSET_NEED_RESCUER));
2707 if (!cc->bs) { 2708 if (!cc->bs) {
2708 ti->error = "Cannot allocate crypt bioset"; 2709 ti->error = "Cannot allocate crypt bioset";
2709 goto bad; 2710 goto bad;
diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c
index 288386bfbfb5..1783d80c9cad 100644
--- a/drivers/md/dm-delay.c
+++ b/drivers/md/dm-delay.c
@@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti)
229 if (dc->dev_write) 229 if (dc->dev_write)
230 dm_put_device(ti, dc->dev_write); 230 dm_put_device(ti, dc->dev_write);
231 231
232 mutex_destroy(&dc->timer_lock);
233
232 kfree(dc); 234 kfree(dc);
233} 235}
234 236
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c
index b82cb1ab1eaa..1b907b15f5c3 100644
--- a/drivers/md/dm-flakey.c
+++ b/drivers/md/dm-flakey.c
@@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc,
70 arg_name = dm_shift_arg(as); 70 arg_name = dm_shift_arg(as);
71 argc--; 71 argc--;
72 72
73 if (!arg_name) {
74 ti->error = "Insufficient feature arguments";
75 return -EINVAL;
76 }
77
73 /* 78 /*
74 * drop_writes 79 * drop_writes
75 */ 80 */
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c
index b4357ed4d541..a8d914d5abbe 100644
--- a/drivers/md/dm-io.c
+++ b/drivers/md/dm-io.c
@@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void)
58 if (!client->pool) 58 if (!client->pool)
59 goto bad; 59 goto bad;
60 60
61 client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS | 61 client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS);
62 BIOSET_NEED_RESCUER));
63 if (!client->bios) 62 if (!client->bios)
64 goto bad; 63 goto bad;
65 64
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index eb45cc3df31d..e6e7c686646d 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job)
477 * If this is the master job, the sub jobs have already 477 * If this is the master job, the sub jobs have already
478 * completed so we can free everything. 478 * completed so we can free everything.
479 */ 479 */
480 if (job->master_job == job) 480 if (job->master_job == job) {
481 mutex_destroy(&job->lock);
481 mempool_free(job, kc->job_pool); 482 mempool_free(job, kc->job_pool);
483 }
482 fn(read_err, write_err, context); 484 fn(read_err, write_err, context);
483 485
484 if (atomic_dec_and_test(&kc->nr_jobs)) 486 if (atomic_dec_and_test(&kc->nr_jobs))
@@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
750 * followed by SPLIT_COUNT sub jobs. 752 * followed by SPLIT_COUNT sub jobs.
751 */ 753 */
752 job = mempool_alloc(kc->job_pool, GFP_NOIO); 754 job = mempool_alloc(kc->job_pool, GFP_NOIO);
755 mutex_init(&job->lock);
753 756
754 /* 757 /*
755 * set up for the read. 758 * set up for the read.
@@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
811 if (job->source.count <= SUB_JOB_SIZE) 814 if (job->source.count <= SUB_JOB_SIZE)
812 dispatch_job(job); 815 dispatch_job(job);
813 else { 816 else {
814 mutex_init(&job->lock);
815 job->progress = 0; 817 job->progress = 0;
816 split_job(job); 818 split_job(job);
817 } 819 }
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c
index 189badbeddaf..3362d866793b 100644
--- a/drivers/md/dm-log-writes.c
+++ b/drivers/md/dm-log-writes.c
@@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data)
594 return -ENOMEM; 594 return -ENOMEM;
595 } 595 }
596 596
597 block->data = kstrndup(data, maxsize, GFP_KERNEL); 597 block->data = kstrndup(data, maxsize - 1, GFP_KERNEL);
598 if (!block->data) { 598 if (!block->data) {
599 DMERR("Error copying mark data"); 599 DMERR("Error copying mark data");
600 kfree(block); 600 kfree(block);
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c
index ef57c6d1c887..7d3e572072f5 100644
--- a/drivers/md/dm-mpath.c
+++ b/drivers/md/dm-mpath.c
@@ -64,36 +64,30 @@ struct priority_group {
64 64
65/* Multipath context */ 65/* Multipath context */
66struct multipath { 66struct multipath {
67 struct list_head list; 67 unsigned long flags; /* Multipath state flags */
68 struct dm_target *ti;
69
70 const char *hw_handler_name;
71 char *hw_handler_params;
72 68
73 spinlock_t lock; 69 spinlock_t lock;
74 70 enum dm_queue_mode queue_mode;
75 unsigned nr_priority_groups;
76 struct list_head priority_groups;
77
78 wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
79 71
80 struct pgpath *current_pgpath; 72 struct pgpath *current_pgpath;
81 struct priority_group *current_pg; 73 struct priority_group *current_pg;
82 struct priority_group *next_pg; /* Switch to this PG if set */ 74 struct priority_group *next_pg; /* Switch to this PG if set */
83 75
84 unsigned long flags; /* Multipath state flags */ 76 atomic_t nr_valid_paths; /* Total number of usable paths */
77 unsigned nr_priority_groups;
78 struct list_head priority_groups;
85 79
80 const char *hw_handler_name;
81 char *hw_handler_params;
82 wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */
86 unsigned pg_init_retries; /* Number of times to retry pg_init */ 83 unsigned pg_init_retries; /* Number of times to retry pg_init */
87 unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ 84 unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */
88
89 atomic_t nr_valid_paths; /* Total number of usable paths */
90 atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ 85 atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */
91 atomic_t pg_init_count; /* Number of times pg_init called */ 86 atomic_t pg_init_count; /* Number of times pg_init called */
92 87
93 enum dm_queue_mode queue_mode;
94
95 struct mutex work_mutex; 88 struct mutex work_mutex;
96 struct work_struct trigger_event; 89 struct work_struct trigger_event;
90 struct dm_target *ti;
97 91
98 struct work_struct process_queued_bios; 92 struct work_struct process_queued_bios;
99 struct bio_list queued_bios; 93 struct bio_list queued_bios;
@@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void)
135{ 129{
136 struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL); 130 struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL);
137 131
138 if (pgpath) { 132 if (!pgpath)
139 pgpath->is_active = true; 133 return NULL;
140 INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work); 134
141 } 135 pgpath->is_active = true;
142 136
143 return pgpath; 137 return pgpath;
144} 138}
@@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti)
193 if (m) { 187 if (m) {
194 INIT_LIST_HEAD(&m->priority_groups); 188 INIT_LIST_HEAD(&m->priority_groups);
195 spin_lock_init(&m->lock); 189 spin_lock_init(&m->lock);
196 set_bit(MPATHF_QUEUE_IO, &m->flags);
197 atomic_set(&m->nr_valid_paths, 0); 190 atomic_set(&m->nr_valid_paths, 0);
198 atomic_set(&m->pg_init_in_progress, 0);
199 atomic_set(&m->pg_init_count, 0);
200 m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
201 INIT_WORK(&m->trigger_event, trigger_event); 191 INIT_WORK(&m->trigger_event, trigger_event);
202 init_waitqueue_head(&m->pg_init_wait);
203 mutex_init(&m->work_mutex); 192 mutex_init(&m->work_mutex);
204 193
205 m->queue_mode = DM_TYPE_NONE; 194 m->queue_mode = DM_TYPE_NONE;
@@ -221,13 +210,26 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m)
221 m->queue_mode = DM_TYPE_MQ_REQUEST_BASED; 210 m->queue_mode = DM_TYPE_MQ_REQUEST_BASED;
222 else 211 else
223 m->queue_mode = DM_TYPE_REQUEST_BASED; 212 m->queue_mode = DM_TYPE_REQUEST_BASED;
224 } else if (m->queue_mode == DM_TYPE_BIO_BASED) { 213
214 } else if (m->queue_mode == DM_TYPE_BIO_BASED ||
215 m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
225 INIT_WORK(&m->process_queued_bios, process_queued_bios); 216 INIT_WORK(&m->process_queued_bios, process_queued_bios);
226 /* 217
227 * bio-based doesn't support any direct scsi_dh management; 218 if (m->queue_mode == DM_TYPE_BIO_BASED) {
228 * it just discovers if a scsi_dh is attached. 219 /*
229 */ 220 * bio-based doesn't support any direct scsi_dh management;
230 set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); 221 * it just discovers if a scsi_dh is attached.
222 */
223 set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags);
224 }
225 }
226
227 if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
228 set_bit(MPATHF_QUEUE_IO, &m->flags);
229 atomic_set(&m->pg_init_in_progress, 0);
230 atomic_set(&m->pg_init_count, 0);
231 m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT;
232 init_waitqueue_head(&m->pg_init_wait);
231 } 233 }
232 234
233 dm_table_set_type(ti->table, m->queue_mode); 235 dm_table_set_type(ti->table, m->queue_mode);
@@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m)
246 248
247 kfree(m->hw_handler_name); 249 kfree(m->hw_handler_name);
248 kfree(m->hw_handler_params); 250 kfree(m->hw_handler_params);
251 mutex_destroy(&m->work_mutex);
249 kfree(m); 252 kfree(m);
250} 253}
251 254
@@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio)
264 return dm_per_bio_data(bio, multipath_per_bio_data_size()); 267 return dm_per_bio_data(bio, multipath_per_bio_data_size());
265} 268}
266 269
267static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio) 270static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio)
268{ 271{
269 /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */ 272 /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */
270 struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
271 void *bio_details = mpio + 1; 273 void *bio_details = mpio + 1;
272
273 return bio_details; 274 return bio_details;
274} 275}
275 276
276static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p, 277static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p)
277 struct dm_bio_details **bio_details_p)
278{ 278{
279 struct dm_mpath_io *mpio = get_mpio_from_bio(bio); 279 struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
280 struct dm_bio_details *bio_details = get_bio_details_from_bio(bio); 280 struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio);
281 281
282 memset(mpio, 0, sizeof(*mpio)); 282 mpio->nr_bytes = bio->bi_iter.bi_size;
283 memset(bio_details, 0, sizeof(*bio_details)); 283 mpio->pgpath = NULL;
284 dm_bio_record(bio_details, bio); 284 *mpio_p = mpio;
285 285
286 if (mpio_p) 286 dm_bio_record(bio_details, bio);
287 *mpio_p = mpio;
288 if (bio_details_p)
289 *bio_details_p = bio_details;
290} 287}
291 288
292/*----------------------------------------------- 289/*-----------------------------------------------
@@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg)
340{ 337{
341 m->current_pg = pg; 338 m->current_pg = pg;
342 339
340 if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
341 return;
342
343 /* Must we initialise the PG first, and queue I/O till it's ready? */ 343 /* Must we initialise the PG first, and queue I/O till it's ready? */
344 if (m->hw_handler_name) { 344 if (m->hw_handler_name) {
345 set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); 345 set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags);
@@ -385,7 +385,8 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes)
385 unsigned bypassed = 1; 385 unsigned bypassed = 1;
386 386
387 if (!atomic_read(&m->nr_valid_paths)) { 387 if (!atomic_read(&m->nr_valid_paths)) {
388 clear_bit(MPATHF_QUEUE_IO, &m->flags); 388 if (m->queue_mode != DM_TYPE_NVME_BIO_BASED)
389 clear_bit(MPATHF_QUEUE_IO, &m->flags);
389 goto failed; 390 goto failed;
390 } 391 }
391 392
@@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
516 return DM_MAPIO_KILL; 517 return DM_MAPIO_KILL;
517 } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || 518 } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) ||
518 test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { 519 test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) {
519 if (pg_init_all_paths(m)) 520 pg_init_all_paths(m);
520 return DM_MAPIO_DELAY_REQUEUE; 521 return DM_MAPIO_DELAY_REQUEUE;
521 return DM_MAPIO_REQUEUE;
522 } 522 }
523 523
524 memset(mpio, 0, sizeof(*mpio));
525 mpio->pgpath = pgpath; 524 mpio->pgpath = pgpath;
526 mpio->nr_bytes = nr_bytes; 525 mpio->nr_bytes = nr_bytes;
527 526
@@ -530,12 +529,23 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq,
530 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); 529 clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC);
531 if (IS_ERR(clone)) { 530 if (IS_ERR(clone)) {
532 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ 531 /* EBUSY, ENODEV or EWOULDBLOCK: requeue */
533 bool queue_dying = blk_queue_dying(q); 532 if (blk_queue_dying(q)) {
534 if (queue_dying) {
535 atomic_inc(&m->pg_init_in_progress); 533 atomic_inc(&m->pg_init_in_progress);
536 activate_or_offline_path(pgpath); 534 activate_or_offline_path(pgpath);
535 return DM_MAPIO_DELAY_REQUEUE;
537 } 536 }
538 return DM_MAPIO_DELAY_REQUEUE; 537
538 /*
539 * blk-mq's SCHED_RESTART can cover this requeue, so we
540 * needn't deal with it by DELAY_REQUEUE. More importantly,
541 * we have to return DM_MAPIO_REQUEUE so that blk-mq can
542 * get the queue busy feedback (via BLK_STS_RESOURCE),
543 * otherwise I/O merging can suffer.
544 */
545 if (q->mq_ops)
546 return DM_MAPIO_REQUEUE;
547 else
548 return DM_MAPIO_DELAY_REQUEUE;
539 } 549 }
540 clone->bio = clone->biotail = NULL; 550 clone->bio = clone->biotail = NULL;
541 clone->rq_disk = bdev->bd_disk; 551 clone->rq_disk = bdev->bd_disk;
@@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone)
557/* 567/*
558 * Map cloned bios (bio-based multipath) 568 * Map cloned bios (bio-based multipath)
559 */ 569 */
560static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio) 570
571static struct pgpath *__map_bio(struct multipath *m, struct bio *bio)
561{ 572{
562 size_t nr_bytes = bio->bi_iter.bi_size;
563 struct pgpath *pgpath; 573 struct pgpath *pgpath;
564 unsigned long flags; 574 unsigned long flags;
565 bool queue_io; 575 bool queue_io;
@@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
568 pgpath = READ_ONCE(m->current_pgpath); 578 pgpath = READ_ONCE(m->current_pgpath);
569 queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); 579 queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags);
570 if (!pgpath || !queue_io) 580 if (!pgpath || !queue_io)
571 pgpath = choose_pgpath(m, nr_bytes); 581 pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
572 582
573 if ((pgpath && queue_io) || 583 if ((pgpath && queue_io) ||
574 (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { 584 (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) {
@@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
576 spin_lock_irqsave(&m->lock, flags); 586 spin_lock_irqsave(&m->lock, flags);
577 bio_list_add(&m->queued_bios, bio); 587 bio_list_add(&m->queued_bios, bio);
578 spin_unlock_irqrestore(&m->lock, flags); 588 spin_unlock_irqrestore(&m->lock, flags);
589
579 /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */ 590 /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */
580 if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) 591 if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags))
581 pg_init_all_paths(m); 592 pg_init_all_paths(m);
582 else if (!queue_io) 593 else if (!queue_io)
583 queue_work(kmultipathd, &m->process_queued_bios); 594 queue_work(kmultipathd, &m->process_queued_bios);
584 return DM_MAPIO_SUBMITTED; 595
596 return ERR_PTR(-EAGAIN);
585 } 597 }
586 598
599 return pgpath;
600}
601
602static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio)
603{
604 struct pgpath *pgpath;
605 unsigned long flags;
606
607 /* Do we need to select a new pgpath? */
608 /*
609 * FIXME: currently only switching path if no path (due to failure, etc)
610 * - which negates the point of using a path selector
611 */
612 pgpath = READ_ONCE(m->current_pgpath);
613 if (!pgpath)
614 pgpath = choose_pgpath(m, bio->bi_iter.bi_size);
615
616 if (!pgpath) {
617 if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) {
618 /* Queue for the daemon to resubmit */
619 spin_lock_irqsave(&m->lock, flags);
620 bio_list_add(&m->queued_bios, bio);
621 spin_unlock_irqrestore(&m->lock, flags);
622 queue_work(kmultipathd, &m->process_queued_bios);
623
624 return ERR_PTR(-EAGAIN);
625 }
626 return NULL;
627 }
628
629 return pgpath;
630}
631
632static int __multipath_map_bio(struct multipath *m, struct bio *bio,
633 struct dm_mpath_io *mpio)
634{
635 struct pgpath *pgpath;
636
637 if (m->queue_mode == DM_TYPE_NVME_BIO_BASED)
638 pgpath = __map_bio_nvme(m, bio);
639 else
640 pgpath = __map_bio(m, bio);
641
642 if (IS_ERR(pgpath))
643 return DM_MAPIO_SUBMITTED;
644
587 if (!pgpath) { 645 if (!pgpath) {
588 if (must_push_back_bio(m)) 646 if (must_push_back_bio(m))
589 return DM_MAPIO_REQUEUE; 647 return DM_MAPIO_REQUEUE;
@@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
592 } 650 }
593 651
594 mpio->pgpath = pgpath; 652 mpio->pgpath = pgpath;
595 mpio->nr_bytes = nr_bytes;
596 653
597 bio->bi_status = 0; 654 bio->bi_status = 0;
598 bio_set_dev(bio, pgpath->path.dev->bdev); 655 bio_set_dev(bio, pgpath->path.dev->bdev);
@@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m
601 if (pgpath->pg->ps.type->start_io) 658 if (pgpath->pg->ps.type->start_io)
602 pgpath->pg->ps.type->start_io(&pgpath->pg->ps, 659 pgpath->pg->ps.type->start_io(&pgpath->pg->ps,
603 &pgpath->path, 660 &pgpath->path,
604 nr_bytes); 661 mpio->nr_bytes);
605 return DM_MAPIO_REMAPPED; 662 return DM_MAPIO_REMAPPED;
606} 663}
607 664
@@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio)
610 struct multipath *m = ti->private; 667 struct multipath *m = ti->private;
611 struct dm_mpath_io *mpio = NULL; 668 struct dm_mpath_io *mpio = NULL;
612 669
613 multipath_init_per_bio_data(bio, &mpio, NULL); 670 multipath_init_per_bio_data(bio, &mpio);
614
615 return __multipath_map_bio(m, bio, mpio); 671 return __multipath_map_bio(m, bio, mpio);
616} 672}
617 673
@@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m)
619{ 675{
620 if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) 676 if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED)
621 dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table)); 677 dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table));
622 else if (m->queue_mode == DM_TYPE_BIO_BASED) 678 else if (m->queue_mode == DM_TYPE_BIO_BASED ||
679 m->queue_mode == DM_TYPE_NVME_BIO_BASED)
623 queue_work(kmultipathd, &m->process_queued_bios); 680 queue_work(kmultipathd, &m->process_queued_bios);
624} 681}
625 682
@@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work)
649 706
650 blk_start_plug(&plug); 707 blk_start_plug(&plug);
651 while ((bio = bio_list_pop(&bios))) { 708 while ((bio = bio_list_pop(&bios))) {
652 r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio)); 709 struct dm_mpath_io *mpio = get_mpio_from_bio(bio);
710 dm_bio_restore(get_bio_details_from_mpio(mpio), bio);
711 r = __multipath_map_bio(m, bio, mpio);
653 switch (r) { 712 switch (r) {
654 case DM_MAPIO_KILL: 713 case DM_MAPIO_KILL:
655 bio->bi_status = BLK_STS_IOERR; 714 bio->bi_status = BLK_STS_IOERR;
@@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg,
752 return 0; 811 return 0;
753} 812}
754 813
755static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, 814static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error)
756 struct dm_target *ti)
757{ 815{
758 int r; 816 struct request_queue *q = bdev_get_queue(bdev);
759 struct pgpath *p;
760 struct multipath *m = ti->private;
761 struct request_queue *q = NULL;
762 const char *attached_handler_name; 817 const char *attached_handler_name;
763 818 int r;
764 /* we need at least a path arg */
765 if (as->argc < 1) {
766 ti->error = "no device given";
767 return ERR_PTR(-EINVAL);
768 }
769
770 p = alloc_pgpath();
771 if (!p)
772 return ERR_PTR(-ENOMEM);
773
774 r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
775 &p->path.dev);
776 if (r) {
777 ti->error = "error getting device";
778 goto bad;
779 }
780
781 if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name)
782 q = bdev_get_queue(p->path.dev->bdev);
783 819
784 if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { 820 if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) {
785retain: 821retain:
@@ -811,26 +847,59 @@ retain:
811 char b[BDEVNAME_SIZE]; 847 char b[BDEVNAME_SIZE];
812 848
813 printk(KERN_INFO "dm-mpath: retaining handler on device %s\n", 849 printk(KERN_INFO "dm-mpath: retaining handler on device %s\n",
814 bdevname(p->path.dev->bdev, b)); 850 bdevname(bdev, b));
815 goto retain; 851 goto retain;
816 } 852 }
817 if (r < 0) { 853 if (r < 0) {
818 ti->error = "error attaching hardware handler"; 854 *error = "error attaching hardware handler";
819 dm_put_device(ti, p->path.dev); 855 return r;
820 goto bad;
821 } 856 }
822 857
823 if (m->hw_handler_params) { 858 if (m->hw_handler_params) {
824 r = scsi_dh_set_params(q, m->hw_handler_params); 859 r = scsi_dh_set_params(q, m->hw_handler_params);
825 if (r < 0) { 860 if (r < 0) {
826 ti->error = "unable to set hardware " 861 *error = "unable to set hardware handler parameters";
827 "handler parameters"; 862 return r;
828 dm_put_device(ti, p->path.dev);
829 goto bad;
830 } 863 }
831 } 864 }
832 } 865 }
833 866
867 return 0;
868}
869
870static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps,
871 struct dm_target *ti)
872{
873 int r;
874 struct pgpath *p;
875 struct multipath *m = ti->private;
876
877 /* we need at least a path arg */
878 if (as->argc < 1) {
879 ti->error = "no device given";
880 return ERR_PTR(-EINVAL);
881 }
882
883 p = alloc_pgpath();
884 if (!p)
885 return ERR_PTR(-ENOMEM);
886
887 r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table),
888 &p->path.dev);
889 if (r) {
890 ti->error = "error getting device";
891 goto bad;
892 }
893
894 if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) {
895 INIT_DELAYED_WORK(&p->activate_path, activate_path_work);
896 r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error);
897 if (r) {
898 dm_put_device(ti, p->path.dev);
899 goto bad;
900 }
901 }
902
834 r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error); 903 r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error);
835 if (r) { 904 if (r) {
836 dm_put_device(ti, p->path.dev); 905 dm_put_device(ti, p->path.dev);
@@ -838,7 +907,6 @@ retain:
838 } 907 }
839 908
840 return p; 909 return p;
841
842 bad: 910 bad:
843 free_pgpath(p); 911 free_pgpath(p);
844 return ERR_PTR(r); 912 return ERR_PTR(r);
@@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m)
933 if (!hw_argc) 1001 if (!hw_argc)
934 return 0; 1002 return 0;
935 1003
936 if (m->queue_mode == DM_TYPE_BIO_BASED) { 1004 if (m->queue_mode == DM_TYPE_BIO_BASED ||
1005 m->queue_mode == DM_TYPE_NVME_BIO_BASED) {
937 dm_consume_args(as, hw_argc); 1006 dm_consume_args(as, hw_argc);
938 DMERR("bio-based multipath doesn't allow hardware handler args"); 1007 DMERR("bio-based multipath doesn't allow hardware handler args");
939 return 0; 1008 return 0;
@@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m)
1022 1091
1023 if (!strcasecmp(queue_mode_name, "bio")) 1092 if (!strcasecmp(queue_mode_name, "bio"))
1024 m->queue_mode = DM_TYPE_BIO_BASED; 1093 m->queue_mode = DM_TYPE_BIO_BASED;
1094 else if (!strcasecmp(queue_mode_name, "nvme"))
1095 m->queue_mode = DM_TYPE_NVME_BIO_BASED;
1025 else if (!strcasecmp(queue_mode_name, "rq")) 1096 else if (!strcasecmp(queue_mode_name, "rq"))
1026 m->queue_mode = DM_TYPE_REQUEST_BASED; 1097 m->queue_mode = DM_TYPE_REQUEST_BASED;
1027 else if (!strcasecmp(queue_mode_name, "mq")) 1098 else if (!strcasecmp(queue_mode_name, "mq"))
@@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv)
1122 ti->num_discard_bios = 1; 1193 ti->num_discard_bios = 1;
1123 ti->num_write_same_bios = 1; 1194 ti->num_write_same_bios = 1;
1124 ti->num_write_zeroes_bios = 1; 1195 ti->num_write_zeroes_bios = 1;
1125 if (m->queue_mode == DM_TYPE_BIO_BASED) 1196 if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED)
1126 ti->per_io_data_size = multipath_per_bio_data_size(); 1197 ti->per_io_data_size = multipath_per_bio_data_size();
1127 else 1198 else
1128 ti->per_io_data_size = sizeof(struct dm_mpath_io); 1199 ti->per_io_data_size = sizeof(struct dm_mpath_io);
@@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m)
1151 1222
1152static void flush_multipath_work(struct multipath *m) 1223static void flush_multipath_work(struct multipath *m)
1153{ 1224{
1154 set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); 1225 if (m->hw_handler_name) {
1155 smp_mb__after_atomic(); 1226 set_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
1227 smp_mb__after_atomic();
1228
1229 flush_workqueue(kmpath_handlerd);
1230 multipath_wait_for_pg_init_completion(m);
1231
1232 clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
1233 smp_mb__after_atomic();
1234 }
1156 1235
1157 flush_workqueue(kmpath_handlerd);
1158 multipath_wait_for_pg_init_completion(m);
1159 flush_workqueue(kmultipathd); 1236 flush_workqueue(kmultipathd);
1160 flush_work(&m->trigger_event); 1237 flush_work(&m->trigger_event);
1161
1162 clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags);
1163 smp_mb__after_atomic();
1164} 1238}
1165 1239
1166static void multipath_dtr(struct dm_target *ti) 1240static void multipath_dtr(struct dm_target *ti)
@@ -1496,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
1496 if (error && blk_path_error(error)) { 1570 if (error && blk_path_error(error)) {
1497 struct multipath *m = ti->private; 1571 struct multipath *m = ti->private;
1498 1572
1499 r = DM_ENDIO_REQUEUE; 1573 if (error == BLK_STS_RESOURCE)
1574 r = DM_ENDIO_DELAY_REQUEUE;
1575 else
1576 r = DM_ENDIO_REQUEUE;
1500 1577
1501 if (pgpath) 1578 if (pgpath)
1502 fail_path(pgpath); 1579 fail_path(pgpath);
@@ -1521,7 +1598,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone,
1521} 1598}
1522 1599
1523static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, 1600static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
1524 blk_status_t *error) 1601 blk_status_t *error)
1525{ 1602{
1526 struct multipath *m = ti->private; 1603 struct multipath *m = ti->private;
1527 struct dm_mpath_io *mpio = get_mpio_from_bio(clone); 1604 struct dm_mpath_io *mpio = get_mpio_from_bio(clone);
@@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone,
1546 goto done; 1623 goto done;
1547 } 1624 }
1548 1625
1549 /* Queue for the daemon to resubmit */
1550 dm_bio_restore(get_bio_details_from_bio(clone), clone);
1551
1552 spin_lock_irqsave(&m->lock, flags); 1626 spin_lock_irqsave(&m->lock, flags);
1553 bio_list_add(&m->queued_bios, clone); 1627 bio_list_add(&m->queued_bios, clone);
1554 spin_unlock_irqrestore(&m->lock, flags); 1628 spin_unlock_irqrestore(&m->lock, flags);
@@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type,
1656 case DM_TYPE_BIO_BASED: 1730 case DM_TYPE_BIO_BASED:
1657 DMEMIT("queue_mode bio "); 1731 DMEMIT("queue_mode bio ");
1658 break; 1732 break;
1733 case DM_TYPE_NVME_BIO_BASED:
1734 DMEMIT("queue_mode nvme ");
1735 break;
1659 case DM_TYPE_MQ_REQUEST_BASED: 1736 case DM_TYPE_MQ_REQUEST_BASED:
1660 DMEMIT("queue_mode mq "); 1737 DMEMIT("queue_mode mq ");
1661 break; 1738 break;
diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c
index 23f178641794..969c4f1a3633 100644
--- a/drivers/md/dm-queue-length.c
+++ b/drivers/md/dm-queue-length.c
@@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
195 if (list_empty(&s->valid_paths)) 195 if (list_empty(&s->valid_paths))
196 goto out; 196 goto out;
197 197
198 /* Change preferred (first in list) path to evenly balance. */
199 list_move_tail(s->valid_paths.next, &s->valid_paths);
200
201 list_for_each_entry(pi, &s->valid_paths, list) { 198 list_for_each_entry(pi, &s->valid_paths, list) {
202 if (!best || 199 if (!best ||
203 (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) 200 (atomic_read(&pi->qlen) < atomic_read(&best->qlen)))
@@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes)
210 if (!best) 207 if (!best)
211 goto out; 208 goto out;
212 209
210 /* Move most recently used to least preferred to evenly balance. */
211 list_move_tail(&best->list, &s->valid_paths);
212
213 ret = best->path; 213 ret = best->path;
214out: 214out:
215 spin_unlock_irqrestore(&s->lock, flags); 215 spin_unlock_irqrestore(&s->lock, flags);
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index e5ef0757fe23..7ef469e902c6 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -29,6 +29,9 @@
29 */ 29 */
30#define MIN_RAID456_JOURNAL_SPACE (4*2048) 30#define MIN_RAID456_JOURNAL_SPACE (4*2048)
31 31
32/* Global list of all raid sets */
33static LIST_HEAD(raid_sets);
34
32static bool devices_handle_discard_safely = false; 35static bool devices_handle_discard_safely = false;
33 36
34/* 37/*
@@ -105,8 +108,6 @@ struct raid_dev {
105#define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV) 108#define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV)
106#define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE) 109#define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE)
107 110
108#define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET)
109
110/* 111/*
111 * Definitions of various constructor flags to 112 * Definitions of various constructor flags to
112 * be used in checks of valid / invalid flags 113 * be used in checks of valid / invalid flags
@@ -209,6 +210,8 @@ struct raid_dev {
209#define RT_FLAG_UPDATE_SBS 3 210#define RT_FLAG_UPDATE_SBS 3
210#define RT_FLAG_RESHAPE_RS 4 211#define RT_FLAG_RESHAPE_RS 4
211#define RT_FLAG_RS_SUSPENDED 5 212#define RT_FLAG_RS_SUSPENDED 5
213#define RT_FLAG_RS_IN_SYNC 6
214#define RT_FLAG_RS_RESYNCING 7
212 215
213/* Array elements of 64 bit needed for rebuild/failed disk bits */ 216/* Array elements of 64 bit needed for rebuild/failed disk bits */
214#define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) 217#define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8)
@@ -224,8 +227,8 @@ struct rs_layout {
224 227
225struct raid_set { 228struct raid_set {
226 struct dm_target *ti; 229 struct dm_target *ti;
230 struct list_head list;
227 231
228 uint32_t bitmap_loaded;
229 uint32_t stripe_cache_entries; 232 uint32_t stripe_cache_entries;
230 unsigned long ctr_flags; 233 unsigned long ctr_flags;
231 unsigned long runtime_flags; 234 unsigned long runtime_flags;
@@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l)
270 mddev->new_chunk_sectors = l->new_chunk_sectors; 273 mddev->new_chunk_sectors = l->new_chunk_sectors;
271} 274}
272 275
276/* Find any raid_set in active slot for @rs on global list */
277static struct raid_set *rs_find_active(struct raid_set *rs)
278{
279 struct raid_set *r;
280 struct mapped_device *md = dm_table_get_md(rs->ti->table);
281
282 list_for_each_entry(r, &raid_sets, list)
283 if (r != rs && dm_table_get_md(r->ti->table) == md)
284 return r;
285
286 return NULL;
287}
288
273/* raid10 algorithms (i.e. formats) */ 289/* raid10 algorithms (i.e. formats) */
274#define ALGORITHM_RAID10_DEFAULT 0 290#define ALGORITHM_RAID10_DEFAULT 0
275#define ALGORITHM_RAID10_NEAR 1 291#define ALGORITHM_RAID10_NEAR 1
@@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout)
572} 588}
573 589
574/* Return md raid10 algorithm for @name */ 590/* Return md raid10 algorithm for @name */
575static int raid10_name_to_format(const char *name) 591static const int raid10_name_to_format(const char *name)
576{ 592{
577 if (!strcasecmp(name, "near")) 593 if (!strcasecmp(name, "near"))
578 return ALGORITHM_RAID10_NEAR; 594 return ALGORITHM_RAID10_NEAR;
@@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout)
675 return NULL; 691 return NULL;
676} 692}
677 693
678/* 694/* Adjust rdev sectors */
679 * Conditionally change bdev capacity of @rs 695static void rs_set_rdev_sectors(struct raid_set *rs)
680 * in case of a disk add/remove reshape
681 */
682static void rs_set_capacity(struct raid_set *rs)
683{ 696{
684 struct mddev *mddev = &rs->md; 697 struct mddev *mddev = &rs->md;
685 struct md_rdev *rdev; 698 struct md_rdev *rdev;
686 struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
687 699
688 /* 700 /*
689 * raid10 sets rdev->sector to the device size, which 701 * raid10 sets rdev->sector to the device size, which
@@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs)
692 rdev_for_each(rdev, mddev) 704 rdev_for_each(rdev, mddev)
693 if (!test_bit(Journal, &rdev->flags)) 705 if (!test_bit(Journal, &rdev->flags))
694 rdev->sectors = mddev->dev_sectors; 706 rdev->sectors = mddev->dev_sectors;
707}
695 708
696 set_capacity(gendisk, mddev->array_sectors); 709/*
710 * Change bdev capacity of @rs in case of a disk add/remove reshape
711 */
712static void rs_set_capacity(struct raid_set *rs)
713{
714 struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table));
715
716 set_capacity(gendisk, rs->md.array_sectors);
697 revalidate_disk(gendisk); 717 revalidate_disk(gendisk);
698} 718}
699 719
@@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
744 764
745 mddev_init(&rs->md); 765 mddev_init(&rs->md);
746 766
767 INIT_LIST_HEAD(&rs->list);
747 rs->raid_disks = raid_devs; 768 rs->raid_disks = raid_devs;
748 rs->delta_disks = 0; 769 rs->delta_disks = 0;
749 770
@@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
761 for (i = 0; i < raid_devs; i++) 782 for (i = 0; i < raid_devs; i++)
762 md_rdev_init(&rs->dev[i].rdev); 783 md_rdev_init(&rs->dev[i].rdev);
763 784
785 /* Add @rs to global list. */
786 list_add(&rs->list, &raid_sets);
787
764 /* 788 /*
765 * Remaining items to be initialized by further RAID params: 789 * Remaining items to be initialized by further RAID params:
766 * rs->md.persistent 790 * rs->md.persistent
@@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r
773 return rs; 797 return rs;
774} 798}
775 799
800/* Free all @rs allocations and remove it from global list. */
776static void raid_set_free(struct raid_set *rs) 801static void raid_set_free(struct raid_set *rs)
777{ 802{
778 int i; 803 int i;
@@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs)
790 dm_put_device(rs->ti, rs->dev[i].data_dev); 815 dm_put_device(rs->ti, rs->dev[i].data_dev);
791 } 816 }
792 817
818 list_del(&rs->list);
819
793 kfree(rs); 820 kfree(rs);
794} 821}
795 822
@@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs)
1002 !rs->dev[i].rdev.sb_page) 1029 !rs->dev[i].rdev.sb_page)
1003 rebuild_cnt++; 1030 rebuild_cnt++;
1004 1031
1005 switch (rs->raid_type->level) { 1032 switch (rs->md.level) {
1006 case 0: 1033 case 0:
1007 break; 1034 break;
1008 case 1: 1035 case 1:
@@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs)
1017 break; 1044 break;
1018 case 10: 1045 case 10:
1019 copies = raid10_md_layout_to_copies(rs->md.new_layout); 1046 copies = raid10_md_layout_to_copies(rs->md.new_layout);
1047 if (copies < 2) {
1048 DMERR("Bogus raid10 data copies < 2!");
1049 return -EINVAL;
1050 }
1051
1020 if (rebuild_cnt < copies) 1052 if (rebuild_cnt < copies)
1021 break; 1053 break;
1022 1054
@@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs)
1576 return 0; 1608 return 0;
1577} 1609}
1578 1610
1611/* Check that calculated dev_sectors fits all component devices. */
1612static int _check_data_dev_sectors(struct raid_set *rs)
1613{
1614 sector_t ds = ~0;
1615 struct md_rdev *rdev;
1616
1617 rdev_for_each(rdev, &rs->md)
1618 if (!test_bit(Journal, &rdev->flags) && rdev->bdev) {
1619 ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode)));
1620 if (ds < rs->md.dev_sectors) {
1621 rs->ti->error = "Component device(s) too small";
1622 return -EINVAL;
1623 }
1624 }
1625
1626 return 0;
1627}
1628
1579/* Calculate the sectors per device and per array used for @rs */ 1629/* Calculate the sectors per device and per array used for @rs */
1580static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) 1630static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
1581{ 1631{
@@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev)
1625 mddev->array_sectors = array_sectors; 1675 mddev->array_sectors = array_sectors;
1626 mddev->dev_sectors = dev_sectors; 1676 mddev->dev_sectors = dev_sectors;
1627 1677
1628 return 0; 1678 return _check_data_dev_sectors(rs);
1629bad: 1679bad:
1630 rs->ti->error = "Target length not divisible by number of data devices"; 1680 rs->ti->error = "Target length not divisible by number of data devices";
1631 return -EINVAL; 1681 return -EINVAL;
@@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws)
1674 struct raid_set *rs = container_of(ws, struct raid_set, md.event_work); 1724 struct raid_set *rs = container_of(ws, struct raid_set, md.event_work);
1675 1725
1676 smp_rmb(); /* Make sure we access most actual mddev properties */ 1726 smp_rmb(); /* Make sure we access most actual mddev properties */
1677 if (!rs_is_reshaping(rs)) 1727 if (!rs_is_reshaping(rs)) {
1728 if (rs_is_raid10(rs))
1729 rs_set_rdev_sectors(rs);
1678 rs_set_capacity(rs); 1730 rs_set_capacity(rs);
1731 }
1679 dm_table_event(rs->ti->table); 1732 dm_table_event(rs->ti->table);
1680} 1733}
1681 1734
@@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
1860 if (rs_takeover_requested(rs)) 1913 if (rs_takeover_requested(rs))
1861 return false; 1914 return false;
1862 1915
1863 if (!mddev->level) 1916 if (rs_is_raid0(rs))
1864 return false; 1917 return false;
1865 1918
1866 change = mddev->new_layout != mddev->layout || 1919 change = mddev->new_layout != mddev->layout ||
@@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
1868 rs->delta_disks; 1921 rs->delta_disks;
1869 1922
1870 /* Historical case to support raid1 reshape without delta disks */ 1923 /* Historical case to support raid1 reshape without delta disks */
1871 if (mddev->level == 1) { 1924 if (rs_is_raid1(rs)) {
1872 if (rs->delta_disks) 1925 if (rs->delta_disks)
1873 return !!rs->delta_disks; 1926 return !!rs->delta_disks;
1874 1927
@@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs)
1876 mddev->raid_disks != rs->raid_disks; 1929 mddev->raid_disks != rs->raid_disks;
1877 } 1930 }
1878 1931
1879 if (mddev->level == 10) 1932 if (rs_is_raid10(rs))
1880 return change && 1933 return change &&
1881 !__is_raid10_far(mddev->new_layout) && 1934 !__is_raid10_far(mddev->new_layout) &&
1882 rs->delta_disks >= 0; 1935 rs->delta_disks >= 0;
@@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev)
2340 DMERR("new device%s provided without 'rebuild'", 2393 DMERR("new device%s provided without 'rebuild'",
2341 new_devs > 1 ? "s" : ""); 2394 new_devs > 1 ? "s" : "");
2342 return -EINVAL; 2395 return -EINVAL;
2343 } else if (rs_is_recovering(rs)) { 2396 } else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) {
2344 DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)", 2397 DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)",
2345 (unsigned long long) mddev->recovery_cp); 2398 (unsigned long long) mddev->recovery_cp);
2346 return -EINVAL; 2399 return -EINVAL;
@@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs)
2640 * Make sure we got a minimum amount of free sectors per device 2693 * Make sure we got a minimum amount of free sectors per device
2641 */ 2694 */
2642 if (rs->data_offset && 2695 if (rs->data_offset &&
2643 to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) { 2696 to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) {
2644 rs->ti->error = data_offset ? "No space for forward reshape" : 2697 rs->ti->error = data_offset ? "No space for forward reshape" :
2645 "No space for backward reshape"; 2698 "No space for backward reshape";
2646 return -ENOSPC; 2699 return -ENOSPC;
2647 } 2700 }
2648out: 2701out:
2702 /*
2703 * Raise recovery_cp in case data_offset != 0 to
2704 * avoid false recovery positives in the constructor.
2705 */
2706 if (rs->md.recovery_cp < rs->md.dev_sectors)
2707 rs->md.recovery_cp += rs->dev[0].rdev.data_offset;
2708
2649 /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */ 2709 /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */
2650 rdev_for_each(rdev, &rs->md) { 2710 rdev_for_each(rdev, &rs->md) {
2651 if (!test_bit(Journal, &rdev->flags)) { 2711 if (!test_bit(Journal, &rdev->flags)) {
@@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs)
2682 sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset; 2742 sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset;
2683 2743
2684 if (rt_is_raid10(rs->raid_type)) { 2744 if (rt_is_raid10(rs->raid_type)) {
2685 if (mddev->level == 0) { 2745 if (rs_is_raid0(rs)) {
2686 /* Userpace reordered disks -> adjust raid_disk indexes */ 2746 /* Userpace reordered disks -> adjust raid_disk indexes */
2687 __reorder_raid_disk_indexes(rs); 2747 __reorder_raid_disk_indexes(rs);
2688 2748
2689 /* raid0 -> raid10_far layout */ 2749 /* raid0 -> raid10_far layout */
2690 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR, 2750 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR,
2691 rs->raid10_copies); 2751 rs->raid10_copies);
2692 } else if (mddev->level == 1) 2752 } else if (rs_is_raid1(rs))
2693 /* raid1 -> raid10_near layout */ 2753 /* raid1 -> raid10_near layout */
2694 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR, 2754 mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR,
2695 rs->raid_disks); 2755 rs->raid_disks);
@@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs)
2777 return 0; 2837 return 0;
2778} 2838}
2779 2839
2840/* Get reshape sectors from data_offsets or raid set */
2841static sector_t _get_reshape_sectors(struct raid_set *rs)
2842{
2843 struct md_rdev *rdev;
2844 sector_t reshape_sectors = 0;
2845
2846 rdev_for_each(rdev, &rs->md)
2847 if (!test_bit(Journal, &rdev->flags)) {
2848 reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ?
2849 rdev->data_offset - rdev->new_data_offset :
2850 rdev->new_data_offset - rdev->data_offset;
2851 break;
2852 }
2853
2854 return max(reshape_sectors, (sector_t) rs->data_offset);
2855}
2856
2780/* 2857/*
2781 * 2858 *
2782 * - change raid layout 2859 * - change raid layout
@@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs)
2788{ 2865{
2789 int r = 0; 2866 int r = 0;
2790 unsigned int cur_raid_devs, d; 2867 unsigned int cur_raid_devs, d;
2868 sector_t reshape_sectors = _get_reshape_sectors(rs);
2791 struct mddev *mddev = &rs->md; 2869 struct mddev *mddev = &rs->md;
2792 struct md_rdev *rdev; 2870 struct md_rdev *rdev;
2793 2871
@@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs)
2804 /* 2882 /*
2805 * Adjust array size: 2883 * Adjust array size:
2806 * 2884 *
2807 * - in case of adding disks, array size has 2885 * - in case of adding disk(s), array size has
2808 * to grow after the disk adding reshape, 2886 * to grow after the disk adding reshape,
2809 * which'll hapen in the event handler; 2887 * which'll hapen in the event handler;
2810 * reshape will happen forward, so space has to 2888 * reshape will happen forward, so space has to
2811 * be available at the beginning of each disk 2889 * be available at the beginning of each disk
2812 * 2890 *
2813 * - in case of removing disks, array size 2891 * - in case of removing disk(s), array size
2814 * has to shrink before starting the reshape, 2892 * has to shrink before starting the reshape,
2815 * which'll happen here; 2893 * which'll happen here;
2816 * reshape will happen backward, so space has to 2894 * reshape will happen backward, so space has to
@@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs)
2841 rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector; 2919 rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector;
2842 } 2920 }
2843 2921
2844 mddev->reshape_backwards = 0; /* adding disks -> forward reshape */ 2922 mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */
2845 2923
2846 /* Remove disk(s) */ 2924 /* Remove disk(s) */
2847 } else if (rs->delta_disks < 0) { 2925 } else if (rs->delta_disks < 0) {
@@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs)
2874 mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1; 2952 mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1;
2875 } 2953 }
2876 2954
2955 /*
2956 * Adjust device size for forward reshape
2957 * because md_finish_reshape() reduces it.
2958 */
2959 if (!mddev->reshape_backwards)
2960 rdev_for_each(rdev, &rs->md)
2961 if (!test_bit(Journal, &rdev->flags))
2962 rdev->sectors += reshape_sectors;
2963
2877 return r; 2964 return r;
2878} 2965}
2879 2966
@@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs)
2890 /* 2977 /*
2891 * XXX: RAID level 4,5,6 require zeroing for safety. 2978 * XXX: RAID level 4,5,6 require zeroing for safety.
2892 */ 2979 */
2893 raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6); 2980 raid456 = rs_is_raid456(rs);
2894 2981
2895 for (i = 0; i < rs->raid_disks; i++) { 2982 for (i = 0; i < rs->raid_disks; i++) {
2896 struct request_queue *q; 2983 struct request_queue *q;
@@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs)
2915 * RAID1 and RAID10 personalities require bio splitting, 3002 * RAID1 and RAID10 personalities require bio splitting,
2916 * RAID0/4/5/6 don't and process large discard bios properly. 3003 * RAID0/4/5/6 don't and process large discard bios properly.
2917 */ 3004 */
2918 ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10); 3005 ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs));
2919 ti->num_discard_bios = 1; 3006 ti->num_discard_bios = 1;
2920} 3007}
2921 3008
@@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs)
2935static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) 3022static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
2936{ 3023{
2937 int r; 3024 int r;
2938 bool resize; 3025 bool resize = false;
2939 struct raid_type *rt; 3026 struct raid_type *rt;
2940 unsigned int num_raid_params, num_raid_devs; 3027 unsigned int num_raid_params, num_raid_devs;
2941 sector_t calculated_dev_sectors, rdev_sectors; 3028 sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors;
2942 struct raid_set *rs = NULL; 3029 struct raid_set *rs = NULL;
2943 const char *arg; 3030 const char *arg;
2944 struct rs_layout rs_layout; 3031 struct rs_layout rs_layout;
@@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
3021 goto bad; 3108 goto bad;
3022 } 3109 }
3023 3110
3024 resize = calculated_dev_sectors != rdev_sectors; 3111
3112 reshape_sectors = _get_reshape_sectors(rs);
3113 if (calculated_dev_sectors != rdev_sectors)
3114 resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors);
3025 3115
3026 INIT_WORK(&rs->md.event_work, do_table_event); 3116 INIT_WORK(&rs->md.event_work, do_table_event);
3027 ti->private = rs; 3117 ti->private = rs;
@@ -3105,19 +3195,22 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
3105 goto bad; 3195 goto bad;
3106 } 3196 }
3107 3197
3108 /* 3198 /* Out-of-place space has to be available to allow for a reshape unless raid1! */
3109 * We can only prepare for a reshape here, because the 3199 if (reshape_sectors || rs_is_raid1(rs)) {
3110 * raid set needs to run to provide the repective reshape 3200 /*
3111 * check functions via its MD personality instance. 3201 * We can only prepare for a reshape here, because the
3112 * 3202 * raid set needs to run to provide the repective reshape
3113 * So do the reshape check after md_run() succeeded. 3203 * check functions via its MD personality instance.
3114 */ 3204 *
3115 r = rs_prepare_reshape(rs); 3205 * So do the reshape check after md_run() succeeded.
3116 if (r) 3206 */
3117 return r; 3207 r = rs_prepare_reshape(rs);
3208 if (r)
3209 return r;
3118 3210
3119 /* Reshaping ain't recovery, so disable recovery */ 3211 /* Reshaping ain't recovery, so disable recovery */
3120 rs_setup_recovery(rs, MaxSector); 3212 rs_setup_recovery(rs, MaxSector);
3213 }
3121 rs_set_cur(rs); 3214 rs_set_cur(rs);
3122 } else { 3215 } else {
3123 /* May not set recovery when a device rebuild is requested */ 3216 /* May not set recovery when a device rebuild is requested */
@@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv)
3144 mddev_lock_nointr(&rs->md); 3237 mddev_lock_nointr(&rs->md);
3145 r = md_run(&rs->md); 3238 r = md_run(&rs->md);
3146 rs->md.in_sync = 0; /* Assume already marked dirty */ 3239 rs->md.in_sync = 0; /* Assume already marked dirty */
3147
3148 if (r) { 3240 if (r) {
3149 ti->error = "Failed to run raid array"; 3241 ti->error = "Failed to run raid array";
3150 mddev_unlock(&rs->md); 3242 mddev_unlock(&rs->md);
@@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio)
3248} 3340}
3249 3341
3250/* Return string describing the current sync action of @mddev */ 3342/* Return string describing the current sync action of @mddev */
3251static const char *decipher_sync_action(struct mddev *mddev) 3343static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery)
3252{ 3344{
3253 if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) 3345 if (test_bit(MD_RECOVERY_FROZEN, &recovery))
3254 return "frozen"; 3346 return "frozen";
3255 3347
3256 if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || 3348 /* The MD sync thread can be done with io but still be running */
3257 (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) { 3349 if (!test_bit(MD_RECOVERY_DONE, &recovery) &&
3258 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) 3350 (test_bit(MD_RECOVERY_RUNNING, &recovery) ||
3351 (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) {
3352 if (test_bit(MD_RECOVERY_RESHAPE, &recovery))
3259 return "reshape"; 3353 return "reshape";
3260 3354
3261 if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { 3355 if (test_bit(MD_RECOVERY_SYNC, &recovery)) {
3262 if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) 3356 if (!test_bit(MD_RECOVERY_REQUESTED, &recovery))
3263 return "resync"; 3357 return "resync";
3264 else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) 3358 else if (test_bit(MD_RECOVERY_CHECK, &recovery))
3265 return "check"; 3359 return "check";
3266 return "repair"; 3360 return "repair";
3267 } 3361 }
3268 3362
3269 if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) 3363 if (test_bit(MD_RECOVERY_RECOVER, &recovery))
3270 return "recover"; 3364 return "recover";
3271 } 3365 }
3272 3366
@@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev)
3283 * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device 3377 * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device
3284 * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr) 3378 * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr)
3285 */ 3379 */
3286static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync) 3380static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev)
3287{ 3381{
3288 if (!rdev->bdev) 3382 if (!rdev->bdev)
3289 return "-"; 3383 return "-";
@@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev,
3291 return "D"; 3385 return "D";
3292 else if (test_bit(Journal, &rdev->flags)) 3386 else if (test_bit(Journal, &rdev->flags))
3293 return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a"; 3387 return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a";
3294 else if (!array_in_sync || !test_bit(In_sync, &rdev->flags)) 3388 else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) ||
3389 (!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) &&
3390 !test_bit(In_sync, &rdev->flags)))
3295 return "a"; 3391 return "a";
3296 else 3392 else
3297 return "A"; 3393 return "A";
3298} 3394}
3299 3395
3300/* Helper to return resync/reshape progress for @rs and @array_in_sync */ 3396/* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */
3301static sector_t rs_get_progress(struct raid_set *rs, 3397static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery,
3302 sector_t resync_max_sectors, bool *array_in_sync) 3398 sector_t resync_max_sectors)
3303{ 3399{
3304 sector_t r, curr_resync_completed; 3400 sector_t r;
3305 struct mddev *mddev = &rs->md; 3401 struct mddev *mddev = &rs->md;
3306 3402
3307 curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp; 3403 clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3308 *array_in_sync = false; 3404 clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
3309 3405
3310 if (rs_is_raid0(rs)) { 3406 if (rs_is_raid0(rs)) {
3311 r = resync_max_sectors; 3407 r = resync_max_sectors;
3312 *array_in_sync = true; 3408 set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3313 3409
3314 } else { 3410 } else {
3315 r = mddev->reshape_position; 3411 if (test_bit(MD_RECOVERY_NEEDED, &recovery) ||
3316 3412 test_bit(MD_RECOVERY_RESHAPE, &recovery) ||
3317 /* Reshape is relative to the array size */ 3413 test_bit(MD_RECOVERY_RUNNING, &recovery))
3318 if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) || 3414 r = mddev->curr_resync_completed;
3319 r != MaxSector) {
3320 if (r == MaxSector) {
3321 *array_in_sync = true;
3322 r = resync_max_sectors;
3323 } else {
3324 /* Got to reverse on backward reshape */
3325 if (mddev->reshape_backwards)
3326 r = mddev->array_sectors - r;
3327
3328 /* Devide by # of data stripes */
3329 sector_div(r, mddev_data_stripes(rs));
3330 }
3331
3332 /* Sync is relative to the component device size */
3333 } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery))
3334 r = curr_resync_completed;
3335 else 3415 else
3336 r = mddev->recovery_cp; 3416 r = mddev->recovery_cp;
3337 3417
3338 if ((r == MaxSector) || 3418 if (r >= resync_max_sectors &&
3339 (test_bit(MD_RECOVERY_DONE, &mddev->recovery) && 3419 (!test_bit(MD_RECOVERY_REQUESTED, &recovery) ||
3340 (mddev->curr_resync_completed == resync_max_sectors))) { 3420 (!test_bit(MD_RECOVERY_FROZEN, &recovery) &&
3421 !test_bit(MD_RECOVERY_NEEDED, &recovery) &&
3422 !test_bit(MD_RECOVERY_RUNNING, &recovery)))) {
3341 /* 3423 /*
3342 * Sync complete. 3424 * Sync complete.
3343 */ 3425 */
3344 *array_in_sync = true; 3426 /* In case we have finished recovering, the array is in sync. */
3345 r = resync_max_sectors; 3427 if (test_bit(MD_RECOVERY_RECOVER, &recovery))
3346 } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { 3428 set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3429
3430 } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) {
3431 /*
3432 * In case we are recovering, the array is not in sync
3433 * and health chars should show the recovering legs.
3434 */
3435 ;
3436
3437 } else if (test_bit(MD_RECOVERY_SYNC, &recovery) &&
3438 !test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
3439 /*
3440 * If "resync" is occurring, the raid set
3441 * is or may be out of sync hence the health
3442 * characters shall be 'a'.
3443 */
3444 set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
3445
3446 } else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) &&
3447 !test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
3448 /*
3449 * If "reshape" is occurring, the raid set
3450 * is or may be out of sync hence the health
3451 * characters shall be 'a'.
3452 */
3453 set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
3454
3455 } else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) {
3347 /* 3456 /*
3348 * If "check" or "repair" is occurring, the raid set has 3457 * If "check" or "repair" is occurring, the raid set has
3349 * undergone an initial sync and the health characters 3458 * undergone an initial sync and the health characters
3350 * should not be 'a' anymore. 3459 * should not be 'a' anymore.
3351 */ 3460 */
3352 *array_in_sync = true; 3461 set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3462
3353 } else { 3463 } else {
3354 struct md_rdev *rdev; 3464 struct md_rdev *rdev;
3355 3465
3356 /* 3466 /*
3467 * We are idle and recovery is needed, prevent 'A' chars race
3468 * caused by components still set to in-sync by constrcuctor.
3469 */
3470 if (test_bit(MD_RECOVERY_NEEDED, &recovery))
3471 set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags);
3472
3473 /*
3357 * The raid set may be doing an initial sync, or it may 3474 * The raid set may be doing an initial sync, or it may
3358 * be rebuilding individual components. If all the 3475 * be rebuilding individual components. If all the
3359 * devices are In_sync, then it is the raid set that is 3476 * devices are In_sync, then it is the raid set that is
3360 * being initialized. 3477 * being initialized.
3361 */ 3478 */
3479 set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3362 rdev_for_each(rdev, mddev) 3480 rdev_for_each(rdev, mddev)
3363 if (!test_bit(Journal, &rdev->flags) && 3481 if (!test_bit(Journal, &rdev->flags) &&
3364 !test_bit(In_sync, &rdev->flags)) 3482 !test_bit(In_sync, &rdev->flags)) {
3365 *array_in_sync = true; 3483 clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags);
3366#if 0 3484 break;
3367 r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */ 3485 }
3368#endif
3369 } 3486 }
3370 } 3487 }
3371 3488
3372 return r; 3489 return min(r, resync_max_sectors);
3373} 3490}
3374 3491
3375/* Helper to return @dev name or "-" if !@dev */ 3492/* Helper to return @dev name or "-" if !@dev */
@@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
3385 struct mddev *mddev = &rs->md; 3502 struct mddev *mddev = &rs->md;
3386 struct r5conf *conf = mddev->private; 3503 struct r5conf *conf = mddev->private;
3387 int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0; 3504 int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0;
3388 bool array_in_sync; 3505 unsigned long recovery;
3389 unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */ 3506 unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */
3390 unsigned int sz = 0; 3507 unsigned int sz = 0;
3391 unsigned int rebuild_disks; 3508 unsigned int rebuild_disks;
@@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type,
3405 3522
3406 /* Access most recent mddev properties for status output */ 3523 /* Access most recent mddev properties for status output */
3407 smp_rmb(); 3524 smp_rmb();
3525 recovery = rs->md.recovery;
3408 /* Get sensible max sectors even if raid set not yet started */ 3526 /* Get sensible max sectors even if raid set not yet started */
3409 resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ? 3527 resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ?
3410 mddev->resync_max_sectors : mddev->dev_sectors; 3528 mddev->resync_max_sectors : mddev->dev_sectors;
3411 progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync); 3529 progress = rs_get_progress(rs, recovery, resync_max_sectors);
3412 resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? 3530 resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ?
3413 atomic64_read(&mddev->resync_mismatches) : 0; 3531 atomic64_read(&mddev->resync_mismatches) : 0;
3414 sync_action = decipher_sync_action(&rs->md); 3532 sync_action = decipher_sync_action(&rs->md, recovery);
3415 3533
3416 /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ 3534 /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */
3417 for (i = 0; i < rs->raid_disks; i++) 3535 for (i = 0; i < rs->raid_disks; i++)
3418 DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync)); 3536 DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev));
3419 3537
3420 /* 3538 /*
3421 * In-sync/Reshape ratio: 3539 * In-sync/Reshape ratio:
@@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type,
3466 * v1.10.0+: 3584 * v1.10.0+:
3467 */ 3585 */
3468 DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? 3586 DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ?
3469 __raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-"); 3587 __raid_dev_status(rs, &rs->journal_dev.rdev) : "-");
3470 break; 3588 break;
3471 3589
3472 case STATUSTYPE_TABLE: 3590 case STATUSTYPE_TABLE:
@@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits)
3622 blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs)); 3740 blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs));
3623} 3741}
3624 3742
3625static void raid_presuspend(struct dm_target *ti)
3626{
3627 struct raid_set *rs = ti->private;
3628
3629 md_stop_writes(&rs->md);
3630}
3631
3632static void raid_postsuspend(struct dm_target *ti) 3743static void raid_postsuspend(struct dm_target *ti)
3633{ 3744{
3634 struct raid_set *rs = ti->private; 3745 struct raid_set *rs = ti->private;
3635 3746
3636 if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { 3747 if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
3748 /* Writes have to be stopped before suspending to avoid deadlocks. */
3749 if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery))
3750 md_stop_writes(&rs->md);
3751
3637 mddev_lock_nointr(&rs->md); 3752 mddev_lock_nointr(&rs->md);
3638 mddev_suspend(&rs->md); 3753 mddev_suspend(&rs->md);
3639 mddev_unlock(&rs->md); 3754 mddev_unlock(&rs->md);
3640 } 3755 }
3641
3642 rs->md.ro = 1;
3643} 3756}
3644 3757
3645static void attempt_restore_of_faulty_devices(struct raid_set *rs) 3758static void attempt_restore_of_faulty_devices(struct raid_set *rs)
@@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti)
3816 struct raid_set *rs = ti->private; 3929 struct raid_set *rs = ti->private;
3817 struct mddev *mddev = &rs->md; 3930 struct mddev *mddev = &rs->md;
3818 3931
3819 /* This is a resume after a suspend of the set -> it's already started */ 3932 /* This is a resume after a suspend of the set -> it's already started. */
3820 if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags)) 3933 if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags))
3821 return 0; 3934 return 0;
3822 3935
3936 if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) {
3937 struct raid_set *rs_active = rs_find_active(rs);
3938
3939 if (rs_active) {
3940 /*
3941 * In case no rebuilds have been requested
3942 * and an active table slot exists, copy
3943 * current resynchonization completed and
3944 * reshape position pointers across from
3945 * suspended raid set in the active slot.
3946 *
3947 * This resumes the new mapping at current
3948 * offsets to continue recover/reshape without
3949 * necessarily redoing a raid set partially or
3950 * causing data corruption in case of a reshape.
3951 */
3952 if (rs_active->md.curr_resync_completed != MaxSector)
3953 mddev->curr_resync_completed = rs_active->md.curr_resync_completed;
3954 if (rs_active->md.reshape_position != MaxSector)
3955 mddev->reshape_position = rs_active->md.reshape_position;
3956 }
3957 }
3958
3823 /* 3959 /*
3824 * The superblocks need to be updated on disk if the 3960 * The superblocks need to be updated on disk if the
3825 * array is new or new devices got added (thus zeroed 3961 * array is new or new devices got added (thus zeroed
@@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti)
3851 mddev->resync_min = mddev->recovery_cp; 3987 mddev->resync_min = mddev->recovery_cp;
3852 } 3988 }
3853 3989
3854 rs_set_capacity(rs);
3855
3856 /* Check for any reshape request unless new raid set */ 3990 /* Check for any reshape request unless new raid set */
3857 if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { 3991 if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) {
3858 /* Initiate a reshape. */ 3992 /* Initiate a reshape. */
3993 rs_set_rdev_sectors(rs);
3859 mddev_lock_nointr(mddev); 3994 mddev_lock_nointr(mddev);
3860 r = rs_start_reshape(rs); 3995 r = rs_start_reshape(rs);
3861 mddev_unlock(mddev); 3996 mddev_unlock(mddev);
@@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti)
3881 attempt_restore_of_faulty_devices(rs); 4016 attempt_restore_of_faulty_devices(rs);
3882 } 4017 }
3883 4018
3884 mddev->ro = 0;
3885 mddev->in_sync = 0;
3886
3887 /*
3888 * Keep the RAID set frozen if reshape/rebuild flags are set.
3889 * The RAID set is unfrozen once the next table load/resume,
3890 * which clears the reshape/rebuild flags, occurs.
3891 * This ensures that the constructor for the inactive table
3892 * retrieves an up-to-date reshape_position.
3893 */
3894 if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS))
3895 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
3896
3897 if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { 4019 if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) {
4020 /* Only reduce raid set size before running a disk removing reshape. */
4021 if (mddev->delta_disks < 0)
4022 rs_set_capacity(rs);
4023
3898 mddev_lock_nointr(mddev); 4024 mddev_lock_nointr(mddev);
4025 clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery);
4026 mddev->ro = 0;
4027 mddev->in_sync = 0;
3899 mddev_resume(mddev); 4028 mddev_resume(mddev);
3900 mddev_unlock(mddev); 4029 mddev_unlock(mddev);
3901 } 4030 }
@@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti)
3903 4032
3904static struct target_type raid_target = { 4033static struct target_type raid_target = {
3905 .name = "raid", 4034 .name = "raid",
3906 .version = {1, 13, 0}, 4035 .version = {1, 13, 2},
3907 .module = THIS_MODULE, 4036 .module = THIS_MODULE,
3908 .ctr = raid_ctr, 4037 .ctr = raid_ctr,
3909 .dtr = raid_dtr, 4038 .dtr = raid_dtr,
@@ -3912,7 +4041,6 @@ static struct target_type raid_target = {
3912 .message = raid_message, 4041 .message = raid_message,
3913 .iterate_devices = raid_iterate_devices, 4042 .iterate_devices = raid_iterate_devices,
3914 .io_hints = raid_io_hints, 4043 .io_hints = raid_io_hints,
3915 .presuspend = raid_presuspend,
3916 .postsuspend = raid_postsuspend, 4044 .postsuspend = raid_postsuspend,
3917 .preresume = raid_preresume, 4045 .preresume = raid_preresume,
3918 .resume = raid_resume, 4046 .resume = raid_resume,
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index b7d175e94a02..aeaaaef43eff 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped)
315 /* The target wants to requeue the I/O */ 315 /* The target wants to requeue the I/O */
316 dm_requeue_original_request(tio, false); 316 dm_requeue_original_request(tio, false);
317 break; 317 break;
318 case DM_ENDIO_DELAY_REQUEUE:
319 /* The target wants to requeue the I/O after a delay */
320 dm_requeue_original_request(tio, true);
321 break;
318 default: 322 default:
319 DMWARN("unimplemented target endio return value: %d", r); 323 DMWARN("unimplemented target endio return value: %d", r);
320 BUG(); 324 BUG();
@@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t)
713 /* disable dm_old_request_fn's merge heuristic by default */ 717 /* disable dm_old_request_fn's merge heuristic by default */
714 md->seq_rq_merge_deadline_usecs = 0; 718 md->seq_rq_merge_deadline_usecs = 0;
715 719
716 dm_init_normal_md_queue(md);
717 blk_queue_softirq_done(md->queue, dm_softirq_done); 720 blk_queue_softirq_done(md->queue, dm_softirq_done);
718 721
719 /* Initialize the request-based DM worker thread */ 722 /* Initialize the request-based DM worker thread */
@@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t)
821 err = PTR_ERR(q); 824 err = PTR_ERR(q);
822 goto out_tag_set; 825 goto out_tag_set;
823 } 826 }
824 dm_init_md_queue(md);
825 827
826 return 0; 828 return 0;
827 829
diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c
index 7b8642045c55..f006a9005593 100644
--- a/drivers/md/dm-service-time.c
+++ b/drivers/md/dm-service-time.c
@@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
282 if (list_empty(&s->valid_paths)) 282 if (list_empty(&s->valid_paths))
283 goto out; 283 goto out;
284 284
285 /* Change preferred (first in list) path to evenly balance. */
286 list_move_tail(s->valid_paths.next, &s->valid_paths);
287
288 list_for_each_entry(pi, &s->valid_paths, list) 285 list_for_each_entry(pi, &s->valid_paths, list)
289 if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) 286 if (!best || (st_compare_load(pi, best, nr_bytes) < 0))
290 best = pi; 287 best = pi;
@@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes)
292 if (!best) 289 if (!best)
293 goto out; 290 goto out;
294 291
292 /* Move most recently used to least preferred to evenly balance. */
293 list_move_tail(&best->list, &s->valid_paths);
294
295 ret = best->path; 295 ret = best->path;
296out: 296out:
297 spin_unlock_irqrestore(&s->lock, flags); 297 spin_unlock_irqrestore(&s->lock, flags);
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c
index a0613bd8ed00..216035be5661 100644
--- a/drivers/md/dm-snap.c
+++ b/drivers/md/dm-snap.c
@@ -47,7 +47,7 @@ struct dm_exception_table {
47}; 47};
48 48
49struct dm_snapshot { 49struct dm_snapshot {
50 struct rw_semaphore lock; 50 struct mutex lock;
51 51
52 struct dm_dev *origin; 52 struct dm_dev *origin;
53 struct dm_dev *cow; 53 struct dm_dev *cow;
@@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap,
439 if (!bdev_equal(s->cow->bdev, snap->cow->bdev)) 439 if (!bdev_equal(s->cow->bdev, snap->cow->bdev))
440 continue; 440 continue;
441 441
442 down_read(&s->lock); 442 mutex_lock(&s->lock);
443 active = s->active; 443 active = s->active;
444 up_read(&s->lock); 444 mutex_unlock(&s->lock);
445 445
446 if (active) { 446 if (active) {
447 if (snap_src) 447 if (snap_src)
@@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
909 int r; 909 int r;
910 chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1; 910 chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1;
911 911
912 down_write(&s->lock); 912 mutex_lock(&s->lock);
913 913
914 /* 914 /*
915 * Process chunks (and associated exceptions) in reverse order 915 * Process chunks (and associated exceptions) in reverse order
@@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s)
924 b = __release_queued_bios_after_merge(s); 924 b = __release_queued_bios_after_merge(s);
925 925
926out: 926out:
927 up_write(&s->lock); 927 mutex_unlock(&s->lock);
928 if (b) 928 if (b)
929 flush_bios(b); 929 flush_bios(b);
930 930
@@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
983 if (linear_chunks < 0) { 983 if (linear_chunks < 0) {
984 DMERR("Read error in exception store: " 984 DMERR("Read error in exception store: "
985 "shutting down merge"); 985 "shutting down merge");
986 down_write(&s->lock); 986 mutex_lock(&s->lock);
987 s->merge_failed = 1; 987 s->merge_failed = 1;
988 up_write(&s->lock); 988 mutex_unlock(&s->lock);
989 } 989 }
990 goto shut; 990 goto shut;
991 } 991 }
@@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s)
1026 previous_count = read_pending_exceptions_done_count(); 1026 previous_count = read_pending_exceptions_done_count();
1027 } 1027 }
1028 1028
1029 down_write(&s->lock); 1029 mutex_lock(&s->lock);
1030 s->first_merging_chunk = old_chunk; 1030 s->first_merging_chunk = old_chunk;
1031 s->num_merging_chunks = linear_chunks; 1031 s->num_merging_chunks = linear_chunks;
1032 up_write(&s->lock); 1032 mutex_unlock(&s->lock);
1033 1033
1034 /* Wait until writes to all 'linear_chunks' drain */ 1034 /* Wait until writes to all 'linear_chunks' drain */
1035 for (i = 0; i < linear_chunks; i++) 1035 for (i = 0; i < linear_chunks; i++)
@@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context)
1071 return; 1071 return;
1072 1072
1073shut: 1073shut:
1074 down_write(&s->lock); 1074 mutex_lock(&s->lock);
1075 s->merge_failed = 1; 1075 s->merge_failed = 1;
1076 b = __release_queued_bios_after_merge(s); 1076 b = __release_queued_bios_after_merge(s);
1077 up_write(&s->lock); 1077 mutex_unlock(&s->lock);
1078 error_bios(b); 1078 error_bios(b);
1079 1079
1080 merge_shutdown(s); 1080 merge_shutdown(s);
@@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv)
1173 s->exception_start_sequence = 0; 1173 s->exception_start_sequence = 0;
1174 s->exception_complete_sequence = 0; 1174 s->exception_complete_sequence = 0;
1175 INIT_LIST_HEAD(&s->out_of_order_list); 1175 INIT_LIST_HEAD(&s->out_of_order_list);
1176 init_rwsem(&s->lock); 1176 mutex_init(&s->lock);
1177 INIT_LIST_HEAD(&s->list); 1177 INIT_LIST_HEAD(&s->list);
1178 spin_lock_init(&s->pe_lock); 1178 spin_lock_init(&s->pe_lock);
1179 s->state_bits = 0; 1179 s->state_bits = 0;
@@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti)
1338 /* Check whether exception handover must be cancelled */ 1338 /* Check whether exception handover must be cancelled */
1339 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1339 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
1340 if (snap_src && snap_dest && (s == snap_src)) { 1340 if (snap_src && snap_dest && (s == snap_src)) {
1341 down_write(&snap_dest->lock); 1341 mutex_lock(&snap_dest->lock);
1342 snap_dest->valid = 0; 1342 snap_dest->valid = 0;
1343 up_write(&snap_dest->lock); 1343 mutex_unlock(&snap_dest->lock);
1344 DMERR("Cancelling snapshot handover."); 1344 DMERR("Cancelling snapshot handover.");
1345 } 1345 }
1346 up_read(&_origins_lock); 1346 up_read(&_origins_lock);
@@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti)
1371 1371
1372 dm_exception_store_destroy(s->store); 1372 dm_exception_store_destroy(s->store);
1373 1373
1374 mutex_destroy(&s->lock);
1375
1374 dm_put_device(ti, s->cow); 1376 dm_put_device(ti, s->cow);
1375 1377
1376 dm_put_device(ti, s->origin); 1378 dm_put_device(ti, s->origin);
@@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success)
1458 1460
1459 if (!success) { 1461 if (!success) {
1460 /* Read/write error - snapshot is unusable */ 1462 /* Read/write error - snapshot is unusable */
1461 down_write(&s->lock); 1463 mutex_lock(&s->lock);
1462 __invalidate_snapshot(s, -EIO); 1464 __invalidate_snapshot(s, -EIO);
1463 error = 1; 1465 error = 1;
1464 goto out; 1466 goto out;
@@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success)
1466 1468
1467 e = alloc_completed_exception(GFP_NOIO); 1469 e = alloc_completed_exception(GFP_NOIO);
1468 if (!e) { 1470 if (!e) {
1469 down_write(&s->lock); 1471 mutex_lock(&s->lock);
1470 __invalidate_snapshot(s, -ENOMEM); 1472 __invalidate_snapshot(s, -ENOMEM);
1471 error = 1; 1473 error = 1;
1472 goto out; 1474 goto out;
1473 } 1475 }
1474 *e = pe->e; 1476 *e = pe->e;
1475 1477
1476 down_write(&s->lock); 1478 mutex_lock(&s->lock);
1477 if (!s->valid) { 1479 if (!s->valid) {
1478 free_completed_exception(e); 1480 free_completed_exception(e);
1479 error = 1; 1481 error = 1;
@@ -1498,7 +1500,7 @@ out:
1498 full_bio->bi_end_io = pe->full_bio_end_io; 1500 full_bio->bi_end_io = pe->full_bio_end_io;
1499 increment_pending_exceptions_done_count(); 1501 increment_pending_exceptions_done_count();
1500 1502
1501 up_write(&s->lock); 1503 mutex_unlock(&s->lock);
1502 1504
1503 /* Submit any pending write bios */ 1505 /* Submit any pending write bios */
1504 if (error) { 1506 if (error) {
@@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
1694 1696
1695 /* FIXME: should only take write lock if we need 1697 /* FIXME: should only take write lock if we need
1696 * to copy an exception */ 1698 * to copy an exception */
1697 down_write(&s->lock); 1699 mutex_lock(&s->lock);
1698 1700
1699 if (!s->valid || (unlikely(s->snapshot_overflowed) && 1701 if (!s->valid || (unlikely(s->snapshot_overflowed) &&
1700 bio_data_dir(bio) == WRITE)) { 1702 bio_data_dir(bio) == WRITE)) {
@@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
1717 if (bio_data_dir(bio) == WRITE) { 1719 if (bio_data_dir(bio) == WRITE) {
1718 pe = __lookup_pending_exception(s, chunk); 1720 pe = __lookup_pending_exception(s, chunk);
1719 if (!pe) { 1721 if (!pe) {
1720 up_write(&s->lock); 1722 mutex_unlock(&s->lock);
1721 pe = alloc_pending_exception(s); 1723 pe = alloc_pending_exception(s);
1722 down_write(&s->lock); 1724 mutex_lock(&s->lock);
1723 1725
1724 if (!s->valid || s->snapshot_overflowed) { 1726 if (!s->valid || s->snapshot_overflowed) {
1725 free_pending_exception(pe); 1727 free_pending_exception(pe);
@@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
1754 bio->bi_iter.bi_size == 1756 bio->bi_iter.bi_size ==
1755 (s->store->chunk_size << SECTOR_SHIFT)) { 1757 (s->store->chunk_size << SECTOR_SHIFT)) {
1756 pe->started = 1; 1758 pe->started = 1;
1757 up_write(&s->lock); 1759 mutex_unlock(&s->lock);
1758 start_full_bio(pe, bio); 1760 start_full_bio(pe, bio);
1759 goto out; 1761 goto out;
1760 } 1762 }
@@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
1764 if (!pe->started) { 1766 if (!pe->started) {
1765 /* this is protected by snap->lock */ 1767 /* this is protected by snap->lock */
1766 pe->started = 1; 1768 pe->started = 1;
1767 up_write(&s->lock); 1769 mutex_unlock(&s->lock);
1768 start_copy(pe); 1770 start_copy(pe);
1769 goto out; 1771 goto out;
1770 } 1772 }
@@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio)
1774 } 1776 }
1775 1777
1776out_unlock: 1778out_unlock:
1777 up_write(&s->lock); 1779 mutex_unlock(&s->lock);
1778out: 1780out:
1779 return r; 1781 return r;
1780} 1782}
@@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio)
1810 1812
1811 chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); 1813 chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector);
1812 1814
1813 down_write(&s->lock); 1815 mutex_lock(&s->lock);
1814 1816
1815 /* Full merging snapshots are redirected to the origin */ 1817 /* Full merging snapshots are redirected to the origin */
1816 if (!s->valid) 1818 if (!s->valid)
@@ -1841,12 +1843,12 @@ redirect_to_origin:
1841 bio_set_dev(bio, s->origin->bdev); 1843 bio_set_dev(bio, s->origin->bdev);
1842 1844
1843 if (bio_data_dir(bio) == WRITE) { 1845 if (bio_data_dir(bio) == WRITE) {
1844 up_write(&s->lock); 1846 mutex_unlock(&s->lock);
1845 return do_origin(s->origin, bio); 1847 return do_origin(s->origin, bio);
1846 } 1848 }
1847 1849
1848out_unlock: 1850out_unlock:
1849 up_write(&s->lock); 1851 mutex_unlock(&s->lock);
1850 1852
1851 return r; 1853 return r;
1852} 1854}
@@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti)
1878 down_read(&_origins_lock); 1880 down_read(&_origins_lock);
1879 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1881 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
1880 if (snap_src && snap_dest) { 1882 if (snap_src && snap_dest) {
1881 down_read(&snap_src->lock); 1883 mutex_lock(&snap_src->lock);
1882 if (s == snap_src) { 1884 if (s == snap_src) {
1883 DMERR("Unable to resume snapshot source until " 1885 DMERR("Unable to resume snapshot source until "
1884 "handover completes."); 1886 "handover completes.");
@@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti)
1888 "source is suspended."); 1890 "source is suspended.");
1889 r = -EINVAL; 1891 r = -EINVAL;
1890 } 1892 }
1891 up_read(&snap_src->lock); 1893 mutex_unlock(&snap_src->lock);
1892 } 1894 }
1893 up_read(&_origins_lock); 1895 up_read(&_origins_lock);
1894 1896
@@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti)
1934 1936
1935 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); 1937 (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL);
1936 if (snap_src && snap_dest) { 1938 if (snap_src && snap_dest) {
1937 down_write(&snap_src->lock); 1939 mutex_lock(&snap_src->lock);
1938 down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); 1940 mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING);
1939 __handover_exceptions(snap_src, snap_dest); 1941 __handover_exceptions(snap_src, snap_dest);
1940 up_write(&snap_dest->lock); 1942 mutex_unlock(&snap_dest->lock);
1941 up_write(&snap_src->lock); 1943 mutex_unlock(&snap_src->lock);
1942 } 1944 }
1943 1945
1944 up_read(&_origins_lock); 1946 up_read(&_origins_lock);
@@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti)
1953 /* Now we have correct chunk size, reregister */ 1955 /* Now we have correct chunk size, reregister */
1954 reregister_snapshot(s); 1956 reregister_snapshot(s);
1955 1957
1956 down_write(&s->lock); 1958 mutex_lock(&s->lock);
1957 s->active = 1; 1959 s->active = 1;
1958 up_write(&s->lock); 1960 mutex_unlock(&s->lock);
1959} 1961}
1960 1962
1961static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) 1963static uint32_t get_origin_minimum_chunksize(struct block_device *bdev)
@@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
1995 switch (type) { 1997 switch (type) {
1996 case STATUSTYPE_INFO: 1998 case STATUSTYPE_INFO:
1997 1999
1998 down_write(&snap->lock); 2000 mutex_lock(&snap->lock);
1999 2001
2000 if (!snap->valid) 2002 if (!snap->valid)
2001 DMEMIT("Invalid"); 2003 DMEMIT("Invalid");
@@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type,
2020 DMEMIT("Unknown"); 2022 DMEMIT("Unknown");
2021 } 2023 }
2022 2024
2023 up_write(&snap->lock); 2025 mutex_unlock(&snap->lock);
2024 2026
2025 break; 2027 break;
2026 2028
@@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
2086 if (dm_target_is_snapshot_merge(snap->ti)) 2088 if (dm_target_is_snapshot_merge(snap->ti))
2087 continue; 2089 continue;
2088 2090
2089 down_write(&snap->lock); 2091 mutex_lock(&snap->lock);
2090 2092
2091 /* Only deal with valid and active snapshots */ 2093 /* Only deal with valid and active snapshots */
2092 if (!snap->valid || !snap->active) 2094 if (!snap->valid || !snap->active)
@@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
2113 2115
2114 pe = __lookup_pending_exception(snap, chunk); 2116 pe = __lookup_pending_exception(snap, chunk);
2115 if (!pe) { 2117 if (!pe) {
2116 up_write(&snap->lock); 2118 mutex_unlock(&snap->lock);
2117 pe = alloc_pending_exception(snap); 2119 pe = alloc_pending_exception(snap);
2118 down_write(&snap->lock); 2120 mutex_lock(&snap->lock);
2119 2121
2120 if (!snap->valid) { 2122 if (!snap->valid) {
2121 free_pending_exception(pe); 2123 free_pending_exception(pe);
@@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector,
2158 } 2160 }
2159 2161
2160next_snapshot: 2162next_snapshot:
2161 up_write(&snap->lock); 2163 mutex_unlock(&snap->lock);
2162 2164
2163 if (pe_to_start_now) { 2165 if (pe_to_start_now) {
2164 start_copy(pe_to_start_now); 2166 start_copy(pe_to_start_now);
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
index 29bc51084c82..56059fb56e2d 100644
--- a/drivers/md/dm-stats.c
+++ b/drivers/md/dm-stats.c
@@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats)
228 dm_stat_free(&s->rcu_head); 228 dm_stat_free(&s->rcu_head);
229 } 229 }
230 free_percpu(stats->last); 230 free_percpu(stats->last);
231 mutex_destroy(&stats->mutex);
231} 232}
232 233
233static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end, 234static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index aaffd0c0ee9a..5fe7ec356c33 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args);
866static bool __table_type_bio_based(enum dm_queue_mode table_type) 866static bool __table_type_bio_based(enum dm_queue_mode table_type)
867{ 867{
868 return (table_type == DM_TYPE_BIO_BASED || 868 return (table_type == DM_TYPE_BIO_BASED ||
869 table_type == DM_TYPE_DAX_BIO_BASED); 869 table_type == DM_TYPE_DAX_BIO_BASED ||
870 table_type == DM_TYPE_NVME_BIO_BASED);
870} 871}
871 872
872static bool __table_type_request_based(enum dm_queue_mode table_type) 873static bool __table_type_request_based(enum dm_queue_mode table_type)
@@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t)
909 return true; 910 return true;
910} 911}
911 912
913static bool dm_table_does_not_support_partial_completion(struct dm_table *t);
914
915struct verify_rq_based_data {
916 unsigned sq_count;
917 unsigned mq_count;
918};
919
920static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev,
921 sector_t start, sector_t len, void *data)
922{
923 struct request_queue *q = bdev_get_queue(dev->bdev);
924 struct verify_rq_based_data *v = data;
925
926 if (q->mq_ops)
927 v->mq_count++;
928 else
929 v->sq_count++;
930
931 return queue_is_rq_based(q);
932}
933
912static int dm_table_determine_type(struct dm_table *t) 934static int dm_table_determine_type(struct dm_table *t)
913{ 935{
914 unsigned i; 936 unsigned i;
915 unsigned bio_based = 0, request_based = 0, hybrid = 0; 937 unsigned bio_based = 0, request_based = 0, hybrid = 0;
916 unsigned sq_count = 0, mq_count = 0; 938 struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0};
917 struct dm_target *tgt; 939 struct dm_target *tgt;
918 struct dm_dev_internal *dd;
919 struct list_head *devices = dm_table_get_devices(t); 940 struct list_head *devices = dm_table_get_devices(t);
920 enum dm_queue_mode live_md_type = dm_get_md_type(t->md); 941 enum dm_queue_mode live_md_type = dm_get_md_type(t->md);
921 942
@@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t)
923 /* target already set the table's type */ 944 /* target already set the table's type */
924 if (t->type == DM_TYPE_BIO_BASED) 945 if (t->type == DM_TYPE_BIO_BASED)
925 return 0; 946 return 0;
947 else if (t->type == DM_TYPE_NVME_BIO_BASED) {
948 if (!dm_table_does_not_support_partial_completion(t)) {
949 DMERR("nvme bio-based is only possible with devices"
950 " that don't support partial completion");
951 return -EINVAL;
952 }
953 /* Fallthru, also verify all devices are blk-mq */
954 }
926 BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); 955 BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED);
927 goto verify_rq_based; 956 goto verify_rq_based;
928 } 957 }
@@ -937,8 +966,8 @@ static int dm_table_determine_type(struct dm_table *t)
937 bio_based = 1; 966 bio_based = 1;
938 967
939 if (bio_based && request_based) { 968 if (bio_based && request_based) {
940 DMWARN("Inconsistent table: different target types" 969 DMERR("Inconsistent table: different target types"
941 " can't be mixed up"); 970 " can't be mixed up");
942 return -EINVAL; 971 return -EINVAL;
943 } 972 }
944 } 973 }
@@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t)
959 /* We must use this table as bio-based */ 988 /* We must use this table as bio-based */
960 t->type = DM_TYPE_BIO_BASED; 989 t->type = DM_TYPE_BIO_BASED;
961 if (dm_table_supports_dax(t) || 990 if (dm_table_supports_dax(t) ||
962 (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) 991 (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) {
963 t->type = DM_TYPE_DAX_BIO_BASED; 992 t->type = DM_TYPE_DAX_BIO_BASED;
993 } else {
994 /* Check if upgrading to NVMe bio-based is valid or required */
995 tgt = dm_table_get_immutable_target(t);
996 if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) {
997 t->type = DM_TYPE_NVME_BIO_BASED;
998 goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */
999 } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) {
1000 t->type = DM_TYPE_NVME_BIO_BASED;
1001 }
1002 }
964 return 0; 1003 return 0;
965 } 1004 }
966 1005
@@ -980,7 +1019,8 @@ verify_rq_based:
980 * (e.g. request completion process for partial completion.) 1019 * (e.g. request completion process for partial completion.)
981 */ 1020 */
982 if (t->num_targets > 1) { 1021 if (t->num_targets > 1) {
983 DMWARN("Request-based dm doesn't support multiple targets yet"); 1022 DMERR("%s DM doesn't support multiple targets",
1023 t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based");
984 return -EINVAL; 1024 return -EINVAL;
985 } 1025 }
986 1026
@@ -997,28 +1037,29 @@ verify_rq_based:
997 return 0; 1037 return 0;
998 } 1038 }
999 1039
1000 /* Non-request-stackable devices can't be used for request-based dm */ 1040 tgt = dm_table_get_immutable_target(t);
1001 list_for_each_entry(dd, devices, list) { 1041 if (!tgt) {
1002 struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev); 1042 DMERR("table load rejected: immutable target is required");
1003 1043 return -EINVAL;
1004 if (!queue_is_rq_based(q)) { 1044 } else if (tgt->max_io_len) {
1005 DMERR("table load rejected: including" 1045 DMERR("table load rejected: immutable target that splits IO is not supported");
1006 " non-request-stackable devices"); 1046 return -EINVAL;
1007 return -EINVAL; 1047 }
1008 }
1009 1048
1010 if (q->mq_ops) 1049 /* Non-request-stackable devices can't be used for request-based dm */
1011 mq_count++; 1050 if (!tgt->type->iterate_devices ||
1012 else 1051 !tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) {
1013 sq_count++; 1052 DMERR("table load rejected: including non-request-stackable devices");
1053 return -EINVAL;
1014 } 1054 }
1015 if (sq_count && mq_count) { 1055 if (v.sq_count && v.mq_count) {
1016 DMERR("table load rejected: not all devices are blk-mq request-stackable"); 1056 DMERR("table load rejected: not all devices are blk-mq request-stackable");
1017 return -EINVAL; 1057 return -EINVAL;
1018 } 1058 }
1019 t->all_blk_mq = mq_count > 0; 1059 t->all_blk_mq = v.mq_count > 0;
1020 1060
1021 if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) { 1061 if (!t->all_blk_mq &&
1062 (t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) {
1022 DMERR("table load rejected: all devices are not blk-mq request-stackable"); 1063 DMERR("table load rejected: all devices are not blk-mq request-stackable");
1023 return -EINVAL; 1064 return -EINVAL;
1024 } 1065 }
@@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
1079{ 1120{
1080 enum dm_queue_mode type = dm_table_get_type(t); 1121 enum dm_queue_mode type = dm_table_get_type(t);
1081 unsigned per_io_data_size = 0; 1122 unsigned per_io_data_size = 0;
1082 struct dm_target *tgt; 1123 unsigned min_pool_size = 0;
1124 struct dm_target *ti;
1083 unsigned i; 1125 unsigned i;
1084 1126
1085 if (unlikely(type == DM_TYPE_NONE)) { 1127 if (unlikely(type == DM_TYPE_NONE)) {
@@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device *
1089 1131
1090 if (__table_type_bio_based(type)) 1132 if (__table_type_bio_based(type))
1091 for (i = 0; i < t->num_targets; i++) { 1133 for (i = 0; i < t->num_targets; i++) {
1092 tgt = t->targets + i; 1134 ti = t->targets + i;
1093 per_io_data_size = max(per_io_data_size, tgt->per_io_data_size); 1135 per_io_data_size = max(per_io_data_size, ti->per_io_data_size);
1136 min_pool_size = max(min_pool_size, ti->num_flush_bios);
1094 } 1137 }
1095 1138
1096 t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size); 1139 t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported,
1140 per_io_data_size, min_pool_size);
1097 if (!t->mempools) 1141 if (!t->mempools)
1098 return -ENOMEM; 1142 return -ENOMEM;
1099 1143
@@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t,
1705 return true; 1749 return true;
1706} 1750}
1707 1751
1752static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev,
1753 sector_t start, sector_t len, void *data)
1754{
1755 char b[BDEVNAME_SIZE];
1756
1757 /* For now, NVMe devices are the only devices of this class */
1758 return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0);
1759}
1760
1761static bool dm_table_does_not_support_partial_completion(struct dm_table *t)
1762{
1763 return dm_table_all_devices_attribute(t, device_no_partial_completion);
1764}
1765
1708static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, 1766static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev,
1709 sector_t start, sector_t len, void *data) 1767 sector_t start, sector_t len, void *data)
1710{ 1768{
@@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
1820 } 1878 }
1821 blk_queue_write_cache(q, wc, fua); 1879 blk_queue_write_cache(q, wc, fua);
1822 1880
1881 if (dm_table_supports_dax(t))
1882 queue_flag_set_unlocked(QUEUE_FLAG_DAX, q);
1823 if (dm_table_supports_dax_write_cache(t)) 1883 if (dm_table_supports_dax_write_cache(t))
1824 dax_write_cache(t->md->dax_dev, true); 1884 dax_write_cache(t->md->dax_dev, true);
1825 1885
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index f91d771fff4b..629c555890c1 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -492,6 +492,11 @@ static void pool_table_init(void)
492 INIT_LIST_HEAD(&dm_thin_pool_table.pools); 492 INIT_LIST_HEAD(&dm_thin_pool_table.pools);
493} 493}
494 494
495static void pool_table_exit(void)
496{
497 mutex_destroy(&dm_thin_pool_table.mutex);
498}
499
495static void __pool_table_insert(struct pool *pool) 500static void __pool_table_insert(struct pool *pool)
496{ 501{
497 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex)); 502 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
@@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context,
1717 bio_op(bio) == REQ_OP_DISCARD) 1722 bio_op(bio) == REQ_OP_DISCARD)
1718 bio_list_add(&info->defer_bios, bio); 1723 bio_list_add(&info->defer_bios, bio);
1719 else { 1724 else {
1720 struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));; 1725 struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));
1721 1726
1722 h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds); 1727 h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds);
1723 inc_all_io_entry(info->tc->pool, bio); 1728 inc_all_io_entry(info->tc->pool, bio);
@@ -4387,6 +4392,8 @@ static void dm_thin_exit(void)
4387 dm_unregister_target(&pool_target); 4392 dm_unregister_target(&pool_target);
4388 4393
4389 kmem_cache_destroy(_new_mapping_cache); 4394 kmem_cache_destroy(_new_mapping_cache);
4395
4396 pool_table_exit();
4390} 4397}
4391 4398
4392module_init(dm_thin_init); 4399module_init(dm_thin_init);
diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c
new file mode 100644
index 000000000000..65f838fa2e99
--- /dev/null
+++ b/drivers/md/dm-unstripe.c
@@ -0,0 +1,219 @@
1/*
2 * Copyright (C) 2017 Intel Corporation.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm.h"
8
9#include <linux/module.h>
10#include <linux/init.h>
11#include <linux/blkdev.h>
12#include <linux/bio.h>
13#include <linux/slab.h>
14#include <linux/bitops.h>
15#include <linux/device-mapper.h>
16
17struct unstripe_c {
18 struct dm_dev *dev;
19 sector_t physical_start;
20
21 uint32_t stripes;
22
23 uint32_t unstripe;
24 sector_t unstripe_width;
25 sector_t unstripe_offset;
26
27 uint32_t chunk_size;
28 u8 chunk_shift;
29};
30
31#define DM_MSG_PREFIX "unstriped"
32
33static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti)
34{
35 if (uc->dev)
36 dm_put_device(ti, uc->dev);
37 kfree(uc);
38}
39
40/*
41 * Contruct an unstriped mapping.
42 * <number of stripes> <chunk size> <stripe #> <dev_path> <offset>
43 */
44static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv)
45{
46 struct unstripe_c *uc;
47 sector_t tmp_len;
48 unsigned long long start;
49 char dummy;
50
51 if (argc != 5) {
52 ti->error = "Invalid number of arguments";
53 return -EINVAL;
54 }
55
56 uc = kzalloc(sizeof(*uc), GFP_KERNEL);
57 if (!uc) {
58 ti->error = "Memory allocation for unstriped context failed";
59 return -ENOMEM;
60 }
61
62 if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) {
63 ti->error = "Invalid stripe count";
64 goto err;
65 }
66
67 if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) {
68 ti->error = "Invalid chunk_size";
69 goto err;
70 }
71
72 // FIXME: must support non power of 2 chunk_size, dm-stripe.c does
73 if (!is_power_of_2(uc->chunk_size)) {
74 ti->error = "Non power of 2 chunk_size is not supported yet";
75 goto err;
76 }
77
78 if (kstrtouint(argv[2], 10, &uc->unstripe)) {
79 ti->error = "Invalid stripe number";
80 goto err;
81 }
82
83 if (uc->unstripe > uc->stripes && uc->stripes > 1) {
84 ti->error = "Please provide stripe between [0, # of stripes]";
85 goto err;
86 }
87
88 if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) {
89 ti->error = "Couldn't get striped device";
90 goto err;
91 }
92
93 if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) {
94 ti->error = "Invalid striped device offset";
95 goto err;
96 }
97 uc->physical_start = start;
98
99 uc->unstripe_offset = uc->unstripe * uc->chunk_size;
100 uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size;
101 uc->chunk_shift = fls(uc->chunk_size) - 1;
102
103 tmp_len = ti->len;
104 if (sector_div(tmp_len, uc->chunk_size)) {
105 ti->error = "Target length not divisible by chunk size";
106 goto err;
107 }
108
109 if (dm_set_target_max_io_len(ti, uc->chunk_size)) {
110 ti->error = "Failed to set max io len";
111 goto err;
112 }
113
114 ti->private = uc;
115 return 0;
116err:
117 cleanup_unstripe(uc, ti);
118 return -EINVAL;
119}
120
121static void unstripe_dtr(struct dm_target *ti)
122{
123 struct unstripe_c *uc = ti->private;
124
125 cleanup_unstripe(uc, ti);
126}
127
128static sector_t map_to_core(struct dm_target *ti, struct bio *bio)
129{
130 struct unstripe_c *uc = ti->private;
131 sector_t sector = bio->bi_iter.bi_sector;
132
133 /* Shift us up to the right "row" on the stripe */
134 sector += uc->unstripe_width * (sector >> uc->chunk_shift);
135
136 /* Account for what stripe we're operating on */
137 sector += uc->unstripe_offset;
138
139 return sector;
140}
141
142static int unstripe_map(struct dm_target *ti, struct bio *bio)
143{
144 struct unstripe_c *uc = ti->private;
145
146 bio_set_dev(bio, uc->dev->bdev);
147 bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start;
148
149 return DM_MAPIO_REMAPPED;
150}
151
152static void unstripe_status(struct dm_target *ti, status_type_t type,
153 unsigned int status_flags, char *result, unsigned int maxlen)
154{
155 struct unstripe_c *uc = ti->private;
156 unsigned int sz = 0;
157
158 switch (type) {
159 case STATUSTYPE_INFO:
160 break;
161
162 case STATUSTYPE_TABLE:
163 DMEMIT("%d %llu %d %s %llu",
164 uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe,
165 uc->dev->name, (unsigned long long)uc->physical_start);
166 break;
167 }
168}
169
170static int unstripe_iterate_devices(struct dm_target *ti,
171 iterate_devices_callout_fn fn, void *data)
172{
173 struct unstripe_c *uc = ti->private;
174
175 return fn(ti, uc->dev, uc->physical_start, ti->len, data);
176}
177
178static void unstripe_io_hints(struct dm_target *ti,
179 struct queue_limits *limits)
180{
181 struct unstripe_c *uc = ti->private;
182
183 limits->chunk_sectors = uc->chunk_size;
184}
185
186static struct target_type unstripe_target = {
187 .name = "unstriped",
188 .version = {1, 0, 0},
189 .module = THIS_MODULE,
190 .ctr = unstripe_ctr,
191 .dtr = unstripe_dtr,
192 .map = unstripe_map,
193 .status = unstripe_status,
194 .iterate_devices = unstripe_iterate_devices,
195 .io_hints = unstripe_io_hints,
196};
197
198static int __init dm_unstripe_init(void)
199{
200 int r;
201
202 r = dm_register_target(&unstripe_target);
203 if (r < 0)
204 DMERR("target registration failed");
205
206 return r;
207}
208
209static void __exit dm_unstripe_exit(void)
210{
211 dm_unregister_target(&unstripe_target);
212}
213
214module_init(dm_unstripe_init);
215module_exit(dm_unstripe_exit);
216
217MODULE_DESCRIPTION(DM_NAME " unstriped target");
218MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>");
219MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c
index 70485de37b66..969954915566 100644
--- a/drivers/md/dm-zoned-metadata.c
+++ b/drivers/md/dm-zoned-metadata.c
@@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd)
2333 2333
2334 /* Free the zone descriptors */ 2334 /* Free the zone descriptors */
2335 dmz_drop_zones(zmd); 2335 dmz_drop_zones(zmd);
2336
2337 mutex_destroy(&zmd->mblk_flush_lock);
2338 mutex_destroy(&zmd->map_lock);
2336} 2339}
2337 2340
2338/* 2341/*
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c
index 6d7bda6f8190..caff02caf083 100644
--- a/drivers/md/dm-zoned-target.c
+++ b/drivers/md/dm-zoned-target.c
@@ -827,6 +827,7 @@ err_fwq:
827err_cwq: 827err_cwq:
828 destroy_workqueue(dmz->chunk_wq); 828 destroy_workqueue(dmz->chunk_wq);
829err_bio: 829err_bio:
830 mutex_destroy(&dmz->chunk_lock);
830 bioset_free(dmz->bio_set); 831 bioset_free(dmz->bio_set);
831err_meta: 832err_meta:
832 dmz_dtr_metadata(dmz->metadata); 833 dmz_dtr_metadata(dmz->metadata);
@@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti)
861 862
862 dmz_put_zoned_device(ti); 863 dmz_put_zoned_device(ti);
863 864
865 mutex_destroy(&dmz->chunk_lock);
866
864 kfree(dmz); 867 kfree(dmz);
865} 868}
866 869
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 8c26bfc35335..d6de00f367ef 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -60,18 +60,73 @@ void dm_issue_global_event(void)
60} 60}
61 61
62/* 62/*
63 * One of these is allocated per bio. 63 * One of these is allocated (on-stack) per original bio.
64 */ 64 */
65struct clone_info {
66 struct dm_table *map;
67 struct bio *bio;
68 struct dm_io *io;
69 sector_t sector;
70 unsigned sector_count;
71};
72
73/*
74 * One of these is allocated per clone bio.
75 */
76#define DM_TIO_MAGIC 7282014
77struct dm_target_io {
78 unsigned magic;
79 struct dm_io *io;
80 struct dm_target *ti;
81 unsigned target_bio_nr;
82 unsigned *len_ptr;
83 bool inside_dm_io;
84 struct bio clone;
85};
86
87/*
88 * One of these is allocated per original bio.
89 * It contains the first clone used for that original.
90 */
91#define DM_IO_MAGIC 5191977
65struct dm_io { 92struct dm_io {
93 unsigned magic;
66 struct mapped_device *md; 94 struct mapped_device *md;
67 blk_status_t status; 95 blk_status_t status;
68 atomic_t io_count; 96 atomic_t io_count;
69 struct bio *bio; 97 struct bio *orig_bio;
70 unsigned long start_time; 98 unsigned long start_time;
71 spinlock_t endio_lock; 99 spinlock_t endio_lock;
72 struct dm_stats_aux stats_aux; 100 struct dm_stats_aux stats_aux;
101 /* last member of dm_target_io is 'struct bio' */
102 struct dm_target_io tio;
73}; 103};
74 104
105void *dm_per_bio_data(struct bio *bio, size_t data_size)
106{
107 struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
108 if (!tio->inside_dm_io)
109 return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
110 return (char *)bio - offsetof(struct dm_target_io, clone) - offsetof(struct dm_io, tio) - data_size;
111}
112EXPORT_SYMBOL_GPL(dm_per_bio_data);
113
114struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
115{
116 struct dm_io *io = (struct dm_io *)((char *)data + data_size);
117 if (io->magic == DM_IO_MAGIC)
118 return (struct bio *)((char *)io + offsetof(struct dm_io, tio) + offsetof(struct dm_target_io, clone));
119 BUG_ON(io->magic != DM_TIO_MAGIC);
120 return (struct bio *)((char *)io + offsetof(struct dm_target_io, clone));
121}
122EXPORT_SYMBOL_GPL(dm_bio_from_per_bio_data);
123
124unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
125{
126 return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
127}
128EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr);
129
75#define MINOR_ALLOCED ((void *)-1) 130#define MINOR_ALLOCED ((void *)-1)
76 131
77/* 132/*
@@ -93,8 +148,8 @@ static int dm_numa_node = DM_NUMA_NODE;
93 * For mempools pre-allocation at the table loading time. 148 * For mempools pre-allocation at the table loading time.
94 */ 149 */
95struct dm_md_mempools { 150struct dm_md_mempools {
96 mempool_t *io_pool;
97 struct bio_set *bs; 151 struct bio_set *bs;
152 struct bio_set *io_bs;
98}; 153};
99 154
100struct table_device { 155struct table_device {
@@ -103,7 +158,6 @@ struct table_device {
103 struct dm_dev dm_dev; 158 struct dm_dev dm_dev;
104}; 159};
105 160
106static struct kmem_cache *_io_cache;
107static struct kmem_cache *_rq_tio_cache; 161static struct kmem_cache *_rq_tio_cache;
108static struct kmem_cache *_rq_cache; 162static struct kmem_cache *_rq_cache;
109 163
@@ -170,14 +224,9 @@ static int __init local_init(void)
170{ 224{
171 int r = -ENOMEM; 225 int r = -ENOMEM;
172 226
173 /* allocate a slab for the dm_ios */
174 _io_cache = KMEM_CACHE(dm_io, 0);
175 if (!_io_cache)
176 return r;
177
178 _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0); 227 _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0);
179 if (!_rq_tio_cache) 228 if (!_rq_tio_cache)
180 goto out_free_io_cache; 229 return r;
181 230
182 _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request), 231 _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request),
183 __alignof__(struct request), 0, NULL); 232 __alignof__(struct request), 0, NULL);
@@ -212,8 +261,6 @@ out_free_rq_cache:
212 kmem_cache_destroy(_rq_cache); 261 kmem_cache_destroy(_rq_cache);
213out_free_rq_tio_cache: 262out_free_rq_tio_cache:
214 kmem_cache_destroy(_rq_tio_cache); 263 kmem_cache_destroy(_rq_tio_cache);
215out_free_io_cache:
216 kmem_cache_destroy(_io_cache);
217 264
218 return r; 265 return r;
219} 266}
@@ -225,7 +272,6 @@ static void local_exit(void)
225 272
226 kmem_cache_destroy(_rq_cache); 273 kmem_cache_destroy(_rq_cache);
227 kmem_cache_destroy(_rq_tio_cache); 274 kmem_cache_destroy(_rq_tio_cache);
228 kmem_cache_destroy(_io_cache);
229 unregister_blkdev(_major, _name); 275 unregister_blkdev(_major, _name);
230 dm_uevent_exit(); 276 dm_uevent_exit();
231 277
@@ -486,18 +532,69 @@ out:
486 return r; 532 return r;
487} 533}
488 534
489static struct dm_io *alloc_io(struct mapped_device *md) 535static void start_io_acct(struct dm_io *io);
536
537static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio)
490{ 538{
491 return mempool_alloc(md->io_pool, GFP_NOIO); 539 struct dm_io *io;
540 struct dm_target_io *tio;
541 struct bio *clone;
542
543 clone = bio_alloc_bioset(GFP_NOIO, 0, md->io_bs);
544 if (!clone)
545 return NULL;
546
547 tio = container_of(clone, struct dm_target_io, clone);
548 tio->inside_dm_io = true;
549 tio->io = NULL;
550
551 io = container_of(tio, struct dm_io, tio);
552 io->magic = DM_IO_MAGIC;
553 io->status = 0;
554 atomic_set(&io->io_count, 1);
555 io->orig_bio = bio;
556 io->md = md;
557 spin_lock_init(&io->endio_lock);
558
559 start_io_acct(io);
560
561 return io;
492} 562}
493 563
494static void free_io(struct mapped_device *md, struct dm_io *io) 564static void free_io(struct mapped_device *md, struct dm_io *io)
495{ 565{
496 mempool_free(io, md->io_pool); 566 bio_put(&io->tio.clone);
567}
568
569static struct dm_target_io *alloc_tio(struct clone_info *ci, struct dm_target *ti,
570 unsigned target_bio_nr, gfp_t gfp_mask)
571{
572 struct dm_target_io *tio;
573
574 if (!ci->io->tio.io) {
575 /* the dm_target_io embedded in ci->io is available */
576 tio = &ci->io->tio;
577 } else {
578 struct bio *clone = bio_alloc_bioset(gfp_mask, 0, ci->io->md->bs);
579 if (!clone)
580 return NULL;
581
582 tio = container_of(clone, struct dm_target_io, clone);
583 tio->inside_dm_io = false;
584 }
585
586 tio->magic = DM_TIO_MAGIC;
587 tio->io = ci->io;
588 tio->ti = ti;
589 tio->target_bio_nr = target_bio_nr;
590
591 return tio;
497} 592}
498 593
499static void free_tio(struct dm_target_io *tio) 594static void free_tio(struct dm_target_io *tio)
500{ 595{
596 if (tio->inside_dm_io)
597 return;
501 bio_put(&tio->clone); 598 bio_put(&tio->clone);
502} 599}
503 600
@@ -510,17 +607,15 @@ int md_in_flight(struct mapped_device *md)
510static void start_io_acct(struct dm_io *io) 607static void start_io_acct(struct dm_io *io)
511{ 608{
512 struct mapped_device *md = io->md; 609 struct mapped_device *md = io->md;
513 struct bio *bio = io->bio; 610 struct bio *bio = io->orig_bio;
514 int cpu;
515 int rw = bio_data_dir(bio); 611 int rw = bio_data_dir(bio);
516 612
517 io->start_time = jiffies; 613 io->start_time = jiffies;
518 614
519 cpu = part_stat_lock(); 615 generic_start_io_acct(md->queue, rw, bio_sectors(bio), &dm_disk(md)->part0);
520 part_round_stats(md->queue, cpu, &dm_disk(md)->part0); 616
521 part_stat_unlock();
522 atomic_set(&dm_disk(md)->part0.in_flight[rw], 617 atomic_set(&dm_disk(md)->part0.in_flight[rw],
523 atomic_inc_return(&md->pending[rw])); 618 atomic_inc_return(&md->pending[rw]));
524 619
525 if (unlikely(dm_stats_used(&md->stats))) 620 if (unlikely(dm_stats_used(&md->stats)))
526 dm_stats_account_io(&md->stats, bio_data_dir(bio), 621 dm_stats_account_io(&md->stats, bio_data_dir(bio),
@@ -531,7 +626,7 @@ static void start_io_acct(struct dm_io *io)
531static void end_io_acct(struct dm_io *io) 626static void end_io_acct(struct dm_io *io)
532{ 627{
533 struct mapped_device *md = io->md; 628 struct mapped_device *md = io->md;
534 struct bio *bio = io->bio; 629 struct bio *bio = io->orig_bio;
535 unsigned long duration = jiffies - io->start_time; 630 unsigned long duration = jiffies - io->start_time;
536 int pending; 631 int pending;
537 int rw = bio_data_dir(bio); 632 int rw = bio_data_dir(bio);
@@ -752,15 +847,6 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo)
752 return 0; 847 return 0;
753} 848}
754 849
755/*-----------------------------------------------------------------
756 * CRUD START:
757 * A more elegant soln is in the works that uses the queue
758 * merge fn, unfortunately there are a couple of changes to
759 * the block layer that I want to make for this. So in the
760 * interests of getting something for people to use I give
761 * you this clearly demarcated crap.
762 *---------------------------------------------------------------*/
763
764static int __noflush_suspending(struct mapped_device *md) 850static int __noflush_suspending(struct mapped_device *md)
765{ 851{
766 return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); 852 return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags);
@@ -780,8 +866,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
780 /* Push-back supersedes any I/O errors */ 866 /* Push-back supersedes any I/O errors */
781 if (unlikely(error)) { 867 if (unlikely(error)) {
782 spin_lock_irqsave(&io->endio_lock, flags); 868 spin_lock_irqsave(&io->endio_lock, flags);
783 if (!(io->status == BLK_STS_DM_REQUEUE && 869 if (!(io->status == BLK_STS_DM_REQUEUE && __noflush_suspending(md)))
784 __noflush_suspending(md)))
785 io->status = error; 870 io->status = error;
786 spin_unlock_irqrestore(&io->endio_lock, flags); 871 spin_unlock_irqrestore(&io->endio_lock, flags);
787 } 872 }
@@ -793,7 +878,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
793 */ 878 */
794 spin_lock_irqsave(&md->deferred_lock, flags); 879 spin_lock_irqsave(&md->deferred_lock, flags);
795 if (__noflush_suspending(md)) 880 if (__noflush_suspending(md))
796 bio_list_add_head(&md->deferred, io->bio); 881 /* NOTE early return due to BLK_STS_DM_REQUEUE below */
882 bio_list_add_head(&md->deferred, io->orig_bio);
797 else 883 else
798 /* noflush suspend was interrupted. */ 884 /* noflush suspend was interrupted. */
799 io->status = BLK_STS_IOERR; 885 io->status = BLK_STS_IOERR;
@@ -801,7 +887,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error)
801 } 887 }
802 888
803 io_error = io->status; 889 io_error = io->status;
804 bio = io->bio; 890 bio = io->orig_bio;
805 end_io_acct(io); 891 end_io_acct(io);
806 free_io(md, io); 892 free_io(md, io);
807 893
@@ -847,7 +933,7 @@ static void clone_endio(struct bio *bio)
847 struct mapped_device *md = tio->io->md; 933 struct mapped_device *md = tio->io->md;
848 dm_endio_fn endio = tio->ti->type->end_io; 934 dm_endio_fn endio = tio->ti->type->end_io;
849 935
850 if (unlikely(error == BLK_STS_TARGET)) { 936 if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) {
851 if (bio_op(bio) == REQ_OP_WRITE_SAME && 937 if (bio_op(bio) == REQ_OP_WRITE_SAME &&
852 !bio->bi_disk->queue->limits.max_write_same_sectors) 938 !bio->bi_disk->queue->limits.max_write_same_sectors)
853 disable_write_same(md); 939 disable_write_same(md);
@@ -1005,7 +1091,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff,
1005 1091
1006/* 1092/*
1007 * A target may call dm_accept_partial_bio only from the map routine. It is 1093 * A target may call dm_accept_partial_bio only from the map routine. It is
1008 * allowed for all bio types except REQ_PREFLUSH. 1094 * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET.
1009 * 1095 *
1010 * dm_accept_partial_bio informs the dm that the target only wants to process 1096 * dm_accept_partial_bio informs the dm that the target only wants to process
1011 * additional n_sectors sectors of the bio and the rest of the data should be 1097 * additional n_sectors sectors of the bio and the rest of the data should be
@@ -1055,7 +1141,7 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
1055{ 1141{
1056#ifdef CONFIG_BLK_DEV_ZONED 1142#ifdef CONFIG_BLK_DEV_ZONED
1057 struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); 1143 struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone);
1058 struct bio *report_bio = tio->io->bio; 1144 struct bio *report_bio = tio->io->orig_bio;
1059 struct blk_zone_report_hdr *hdr = NULL; 1145 struct blk_zone_report_hdr *hdr = NULL;
1060 struct blk_zone *zone; 1146 struct blk_zone *zone;
1061 unsigned int nr_rep = 0; 1147 unsigned int nr_rep = 0;
@@ -1122,67 +1208,15 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start)
1122} 1208}
1123EXPORT_SYMBOL_GPL(dm_remap_zone_report); 1209EXPORT_SYMBOL_GPL(dm_remap_zone_report);
1124 1210
1125/* 1211static blk_qc_t __map_bio(struct dm_target_io *tio)
1126 * Flush current->bio_list when the target map method blocks.
1127 * This fixes deadlocks in snapshot and possibly in other targets.
1128 */
1129struct dm_offload {
1130 struct blk_plug plug;
1131 struct blk_plug_cb cb;
1132};
1133
1134static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule)
1135{
1136 struct dm_offload *o = container_of(cb, struct dm_offload, cb);
1137 struct bio_list list;
1138 struct bio *bio;
1139 int i;
1140
1141 INIT_LIST_HEAD(&o->cb.list);
1142
1143 if (unlikely(!current->bio_list))
1144 return;
1145
1146 for (i = 0; i < 2; i++) {
1147 list = current->bio_list[i];
1148 bio_list_init(&current->bio_list[i]);
1149
1150 while ((bio = bio_list_pop(&list))) {
1151 struct bio_set *bs = bio->bi_pool;
1152 if (unlikely(!bs) || bs == fs_bio_set ||
1153 !bs->rescue_workqueue) {
1154 bio_list_add(&current->bio_list[i], bio);
1155 continue;
1156 }
1157
1158 spin_lock(&bs->rescue_lock);
1159 bio_list_add(&bs->rescue_list, bio);
1160 queue_work(bs->rescue_workqueue, &bs->rescue_work);
1161 spin_unlock(&bs->rescue_lock);
1162 }
1163 }
1164}
1165
1166static void dm_offload_start(struct dm_offload *o)
1167{
1168 blk_start_plug(&o->plug);
1169 o->cb.callback = flush_current_bio_list;
1170 list_add(&o->cb.list, &current->plug->cb_list);
1171}
1172
1173static void dm_offload_end(struct dm_offload *o)
1174{
1175 list_del(&o->cb.list);
1176 blk_finish_plug(&o->plug);
1177}
1178
1179static void __map_bio(struct dm_target_io *tio)
1180{ 1212{
1181 int r; 1213 int r;
1182 sector_t sector; 1214 sector_t sector;
1183 struct dm_offload o;
1184 struct bio *clone = &tio->clone; 1215 struct bio *clone = &tio->clone;
1216 struct dm_io *io = tio->io;
1217 struct mapped_device *md = io->md;
1185 struct dm_target *ti = tio->ti; 1218 struct dm_target *ti = tio->ti;
1219 blk_qc_t ret = BLK_QC_T_NONE;
1186 1220
1187 clone->bi_end_io = clone_endio; 1221 clone->bi_end_io = clone_endio;
1188 1222
@@ -1191,44 +1225,37 @@ static void __map_bio(struct dm_target_io *tio)
1191 * anything, the target has assumed ownership of 1225 * anything, the target has assumed ownership of
1192 * this io. 1226 * this io.
1193 */ 1227 */
1194 atomic_inc(&tio->io->io_count); 1228 atomic_inc(&io->io_count);
1195 sector = clone->bi_iter.bi_sector; 1229 sector = clone->bi_iter.bi_sector;
1196 1230
1197 dm_offload_start(&o);
1198 r = ti->type->map(ti, clone); 1231 r = ti->type->map(ti, clone);
1199 dm_offload_end(&o);
1200
1201 switch (r) { 1232 switch (r) {
1202 case DM_MAPIO_SUBMITTED: 1233 case DM_MAPIO_SUBMITTED:
1203 break; 1234 break;
1204 case DM_MAPIO_REMAPPED: 1235 case DM_MAPIO_REMAPPED:
1205 /* the bio has been remapped so dispatch it */ 1236 /* the bio has been remapped so dispatch it */
1206 trace_block_bio_remap(clone->bi_disk->queue, clone, 1237 trace_block_bio_remap(clone->bi_disk->queue, clone,
1207 bio_dev(tio->io->bio), sector); 1238 bio_dev(io->orig_bio), sector);
1208 generic_make_request(clone); 1239 if (md->type == DM_TYPE_NVME_BIO_BASED)
1240 ret = direct_make_request(clone);
1241 else
1242 ret = generic_make_request(clone);
1209 break; 1243 break;
1210 case DM_MAPIO_KILL: 1244 case DM_MAPIO_KILL:
1211 dec_pending(tio->io, BLK_STS_IOERR);
1212 free_tio(tio); 1245 free_tio(tio);
1246 dec_pending(io, BLK_STS_IOERR);
1213 break; 1247 break;
1214 case DM_MAPIO_REQUEUE: 1248 case DM_MAPIO_REQUEUE:
1215 dec_pending(tio->io, BLK_STS_DM_REQUEUE);
1216 free_tio(tio); 1249 free_tio(tio);
1250 dec_pending(io, BLK_STS_DM_REQUEUE);
1217 break; 1251 break;
1218 default: 1252 default:
1219 DMWARN("unimplemented target map return value: %d", r); 1253 DMWARN("unimplemented target map return value: %d", r);
1220 BUG(); 1254 BUG();
1221 } 1255 }
1222}
1223 1256
1224struct clone_info { 1257 return ret;
1225 struct mapped_device *md; 1258}
1226 struct dm_table *map;
1227 struct bio *bio;
1228 struct dm_io *io;
1229 sector_t sector;
1230 unsigned sector_count;
1231};
1232 1259
1233static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len) 1260static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len)
1234{ 1261{
@@ -1272,28 +1299,49 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio,
1272 return 0; 1299 return 0;
1273} 1300}
1274 1301
1275static struct dm_target_io *alloc_tio(struct clone_info *ci, 1302static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci,
1276 struct dm_target *ti, 1303 struct dm_target *ti, unsigned num_bios)
1277 unsigned target_bio_nr)
1278{ 1304{
1279 struct dm_target_io *tio; 1305 struct dm_target_io *tio;
1280 struct bio *clone; 1306 int try;
1281 1307
1282 clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs); 1308 if (!num_bios)
1283 tio = container_of(clone, struct dm_target_io, clone); 1309 return;
1284 1310
1285 tio->io = ci->io; 1311 if (num_bios == 1) {
1286 tio->ti = ti; 1312 tio = alloc_tio(ci, ti, 0, GFP_NOIO);
1287 tio->target_bio_nr = target_bio_nr; 1313 bio_list_add(blist, &tio->clone);
1314 return;
1315 }
1288 1316
1289 return tio; 1317 for (try = 0; try < 2; try++) {
1318 int bio_nr;
1319 struct bio *bio;
1320
1321 if (try)
1322 mutex_lock(&ci->io->md->table_devices_lock);
1323 for (bio_nr = 0; bio_nr < num_bios; bio_nr++) {
1324 tio = alloc_tio(ci, ti, bio_nr, try ? GFP_NOIO : GFP_NOWAIT);
1325 if (!tio)
1326 break;
1327
1328 bio_list_add(blist, &tio->clone);
1329 }
1330 if (try)
1331 mutex_unlock(&ci->io->md->table_devices_lock);
1332 if (bio_nr == num_bios)
1333 return;
1334
1335 while ((bio = bio_list_pop(blist))) {
1336 tio = container_of(bio, struct dm_target_io, clone);
1337 free_tio(tio);
1338 }
1339 }
1290} 1340}
1291 1341
1292static void __clone_and_map_simple_bio(struct clone_info *ci, 1342static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci,
1293 struct dm_target *ti, 1343 struct dm_target_io *tio, unsigned *len)
1294 unsigned target_bio_nr, unsigned *len)
1295{ 1344{
1296 struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr);
1297 struct bio *clone = &tio->clone; 1345 struct bio *clone = &tio->clone;
1298 1346
1299 tio->len_ptr = len; 1347 tio->len_ptr = len;
@@ -1302,16 +1350,22 @@ static void __clone_and_map_simple_bio(struct clone_info *ci,
1302 if (len) 1350 if (len)
1303 bio_setup_sector(clone, ci->sector, *len); 1351 bio_setup_sector(clone, ci->sector, *len);
1304 1352
1305 __map_bio(tio); 1353 return __map_bio(tio);
1306} 1354}
1307 1355
1308static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, 1356static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti,
1309 unsigned num_bios, unsigned *len) 1357 unsigned num_bios, unsigned *len)
1310{ 1358{
1311 unsigned target_bio_nr; 1359 struct bio_list blist = BIO_EMPTY_LIST;
1360 struct bio *bio;
1361 struct dm_target_io *tio;
1362
1363 alloc_multiple_bios(&blist, ci, ti, num_bios);
1312 1364
1313 for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) 1365 while ((bio = bio_list_pop(&blist))) {
1314 __clone_and_map_simple_bio(ci, ti, target_bio_nr, len); 1366 tio = container_of(bio, struct dm_target_io, clone);
1367 (void) __clone_and_map_simple_bio(ci, tio, len);
1368 }
1315} 1369}
1316 1370
1317static int __send_empty_flush(struct clone_info *ci) 1371static int __send_empty_flush(struct clone_info *ci)
@@ -1327,32 +1381,22 @@ static int __send_empty_flush(struct clone_info *ci)
1327} 1381}
1328 1382
1329static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, 1383static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti,
1330 sector_t sector, unsigned *len) 1384 sector_t sector, unsigned *len)
1331{ 1385{
1332 struct bio *bio = ci->bio; 1386 struct bio *bio = ci->bio;
1333 struct dm_target_io *tio; 1387 struct dm_target_io *tio;
1334 unsigned target_bio_nr; 1388 int r;
1335 unsigned num_target_bios = 1;
1336 int r = 0;
1337 1389
1338 /* 1390 tio = alloc_tio(ci, ti, 0, GFP_NOIO);
1339 * Does the target want to receive duplicate copies of the bio? 1391 tio->len_ptr = len;
1340 */ 1392 r = clone_bio(tio, bio, sector, *len);
1341 if (bio_data_dir(bio) == WRITE && ti->num_write_bios) 1393 if (r < 0) {
1342 num_target_bios = ti->num_write_bios(ti, bio); 1394 free_tio(tio);
1343 1395 return r;
1344 for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) {
1345 tio = alloc_tio(ci, ti, target_bio_nr);
1346 tio->len_ptr = len;
1347 r = clone_bio(tio, bio, sector, *len);
1348 if (r < 0) {
1349 free_tio(tio);
1350 break;
1351 }
1352 __map_bio(tio);
1353 } 1396 }
1397 (void) __map_bio(tio);
1354 1398
1355 return r; 1399 return 0;
1356} 1400}
1357 1401
1358typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); 1402typedef unsigned (*get_num_bios_fn)(struct dm_target *ti);
@@ -1379,56 +1423,50 @@ static bool is_split_required_for_discard(struct dm_target *ti)
1379 return ti->split_discard_bios; 1423 return ti->split_discard_bios;
1380} 1424}
1381 1425
1382static int __send_changing_extent_only(struct clone_info *ci, 1426static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti,
1383 get_num_bios_fn get_num_bios, 1427 get_num_bios_fn get_num_bios,
1384 is_split_required_fn is_split_required) 1428 is_split_required_fn is_split_required)
1385{ 1429{
1386 struct dm_target *ti;
1387 unsigned len; 1430 unsigned len;
1388 unsigned num_bios; 1431 unsigned num_bios;
1389 1432
1390 do { 1433 /*
1391 ti = dm_table_find_target(ci->map, ci->sector); 1434 * Even though the device advertised support for this type of
1392 if (!dm_target_is_valid(ti)) 1435 * request, that does not mean every target supports it, and
1393 return -EIO; 1436 * reconfiguration might also have changed that since the
1394 1437 * check was performed.
1395 /* 1438 */
1396 * Even though the device advertised support for this type of 1439 num_bios = get_num_bios ? get_num_bios(ti) : 0;
1397 * request, that does not mean every target supports it, and 1440 if (!num_bios)
1398 * reconfiguration might also have changed that since the 1441 return -EOPNOTSUPP;
1399 * check was performed.
1400 */
1401 num_bios = get_num_bios ? get_num_bios(ti) : 0;
1402 if (!num_bios)
1403 return -EOPNOTSUPP;
1404 1442
1405 if (is_split_required && !is_split_required(ti)) 1443 if (is_split_required && !is_split_required(ti))
1406 len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); 1444 len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti));
1407 else 1445 else
1408 len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti)); 1446 len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti));
1409 1447
1410 __send_duplicate_bios(ci, ti, num_bios, &len); 1448 __send_duplicate_bios(ci, ti, num_bios, &len);
1411 1449
1412 ci->sector += len; 1450 ci->sector += len;
1413 } while (ci->sector_count -= len); 1451 ci->sector_count -= len;
1414 1452
1415 return 0; 1453 return 0;
1416} 1454}
1417 1455
1418static int __send_discard(struct clone_info *ci) 1456static int __send_discard(struct clone_info *ci, struct dm_target *ti)
1419{ 1457{
1420 return __send_changing_extent_only(ci, get_num_discard_bios, 1458 return __send_changing_extent_only(ci, ti, get_num_discard_bios,
1421 is_split_required_for_discard); 1459 is_split_required_for_discard);
1422} 1460}
1423 1461
1424static int __send_write_same(struct clone_info *ci) 1462static int __send_write_same(struct clone_info *ci, struct dm_target *ti)
1425{ 1463{
1426 return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); 1464 return __send_changing_extent_only(ci, ti, get_num_write_same_bios, NULL);
1427} 1465}
1428 1466
1429static int __send_write_zeroes(struct clone_info *ci) 1467static int __send_write_zeroes(struct clone_info *ci, struct dm_target *ti)
1430{ 1468{
1431 return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL); 1469 return __send_changing_extent_only(ci, ti, get_num_write_zeroes_bios, NULL);
1432} 1470}
1433 1471
1434/* 1472/*
@@ -1441,17 +1479,17 @@ static int __split_and_process_non_flush(struct clone_info *ci)
1441 unsigned len; 1479 unsigned len;
1442 int r; 1480 int r;
1443 1481
1444 if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
1445 return __send_discard(ci);
1446 else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
1447 return __send_write_same(ci);
1448 else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
1449 return __send_write_zeroes(ci);
1450
1451 ti = dm_table_find_target(ci->map, ci->sector); 1482 ti = dm_table_find_target(ci->map, ci->sector);
1452 if (!dm_target_is_valid(ti)) 1483 if (!dm_target_is_valid(ti))
1453 return -EIO; 1484 return -EIO;
1454 1485
1486 if (unlikely(bio_op(bio) == REQ_OP_DISCARD))
1487 return __send_discard(ci, ti);
1488 else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME))
1489 return __send_write_same(ci, ti);
1490 else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES))
1491 return __send_write_zeroes(ci, ti);
1492
1455 if (bio_op(bio) == REQ_OP_ZONE_REPORT) 1493 if (bio_op(bio) == REQ_OP_ZONE_REPORT)
1456 len = ci->sector_count; 1494 len = ci->sector_count;
1457 else 1495 else
@@ -1468,34 +1506,33 @@ static int __split_and_process_non_flush(struct clone_info *ci)
1468 return 0; 1506 return 0;
1469} 1507}
1470 1508
1509static void init_clone_info(struct clone_info *ci, struct mapped_device *md,
1510 struct dm_table *map, struct bio *bio)
1511{
1512 ci->map = map;
1513 ci->io = alloc_io(md, bio);
1514 ci->sector = bio->bi_iter.bi_sector;
1515}
1516
1471/* 1517/*
1472 * Entry point to split a bio into clones and submit them to the targets. 1518 * Entry point to split a bio into clones and submit them to the targets.
1473 */ 1519 */
1474static void __split_and_process_bio(struct mapped_device *md, 1520static blk_qc_t __split_and_process_bio(struct mapped_device *md,
1475 struct dm_table *map, struct bio *bio) 1521 struct dm_table *map, struct bio *bio)
1476{ 1522{
1477 struct clone_info ci; 1523 struct clone_info ci;
1524 blk_qc_t ret = BLK_QC_T_NONE;
1478 int error = 0; 1525 int error = 0;
1479 1526
1480 if (unlikely(!map)) { 1527 if (unlikely(!map)) {
1481 bio_io_error(bio); 1528 bio_io_error(bio);
1482 return; 1529 return ret;
1483 } 1530 }
1484 1531
1485 ci.map = map; 1532 init_clone_info(&ci, md, map, bio);
1486 ci.md = md;
1487 ci.io = alloc_io(md);
1488 ci.io->status = 0;
1489 atomic_set(&ci.io->io_count, 1);
1490 ci.io->bio = bio;
1491 ci.io->md = md;
1492 spin_lock_init(&ci.io->endio_lock);
1493 ci.sector = bio->bi_iter.bi_sector;
1494
1495 start_io_acct(ci.io);
1496 1533
1497 if (bio->bi_opf & REQ_PREFLUSH) { 1534 if (bio->bi_opf & REQ_PREFLUSH) {
1498 ci.bio = &ci.md->flush_bio; 1535 ci.bio = &ci.io->md->flush_bio;
1499 ci.sector_count = 0; 1536 ci.sector_count = 0;
1500 error = __send_empty_flush(&ci); 1537 error = __send_empty_flush(&ci);
1501 /* dec_pending submits any data associated with flush */ 1538 /* dec_pending submits any data associated with flush */
@@ -1506,32 +1543,95 @@ static void __split_and_process_bio(struct mapped_device *md,
1506 } else { 1543 } else {
1507 ci.bio = bio; 1544 ci.bio = bio;
1508 ci.sector_count = bio_sectors(bio); 1545 ci.sector_count = bio_sectors(bio);
1509 while (ci.sector_count && !error) 1546 while (ci.sector_count && !error) {
1510 error = __split_and_process_non_flush(&ci); 1547 error = __split_and_process_non_flush(&ci);
1548 if (current->bio_list && ci.sector_count && !error) {
1549 /*
1550 * Remainder must be passed to generic_make_request()
1551 * so that it gets handled *after* bios already submitted
1552 * have been completely processed.
1553 * We take a clone of the original to store in
1554 * ci.io->orig_bio to be used by end_io_acct() and
1555 * for dec_pending to use for completion handling.
1556 * As this path is not used for REQ_OP_ZONE_REPORT,
1557 * the usage of io->orig_bio in dm_remap_zone_report()
1558 * won't be affected by this reassignment.
1559 */
1560 struct bio *b = bio_clone_bioset(bio, GFP_NOIO,
1561 md->queue->bio_split);
1562 ci.io->orig_bio = b;
1563 bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9);
1564 bio_chain(b, bio);
1565 ret = generic_make_request(bio);
1566 break;
1567 }
1568 }
1511 } 1569 }
1512 1570
1513 /* drop the extra reference count */ 1571 /* drop the extra reference count */
1514 dec_pending(ci.io, errno_to_blk_status(error)); 1572 dec_pending(ci.io, errno_to_blk_status(error));
1573 return ret;
1515} 1574}
1516/*-----------------------------------------------------------------
1517 * CRUD END
1518 *---------------------------------------------------------------*/
1519 1575
1520/* 1576/*
1521 * The request function that just remaps the bio built up by 1577 * Optimized variant of __split_and_process_bio that leverages the
1522 * dm_merge_bvec. 1578 * fact that targets that use it do _not_ have a need to split bios.
1523 */ 1579 */
1524static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) 1580static blk_qc_t __process_bio(struct mapped_device *md,
1581 struct dm_table *map, struct bio *bio)
1582{
1583 struct clone_info ci;
1584 blk_qc_t ret = BLK_QC_T_NONE;
1585 int error = 0;
1586
1587 if (unlikely(!map)) {
1588 bio_io_error(bio);
1589 return ret;
1590 }
1591
1592 init_clone_info(&ci, md, map, bio);
1593
1594 if (bio->bi_opf & REQ_PREFLUSH) {
1595 ci.bio = &ci.io->md->flush_bio;
1596 ci.sector_count = 0;
1597 error = __send_empty_flush(&ci);
1598 /* dec_pending submits any data associated with flush */
1599 } else {
1600 struct dm_target *ti = md->immutable_target;
1601 struct dm_target_io *tio;
1602
1603 /*
1604 * Defend against IO still getting in during teardown
1605 * - as was seen for a time with nvme-fcloop
1606 */
1607 if (unlikely(WARN_ON_ONCE(!ti || !dm_target_is_valid(ti)))) {
1608 error = -EIO;
1609 goto out;
1610 }
1611
1612 tio = alloc_tio(&ci, ti, 0, GFP_NOIO);
1613 ci.bio = bio;
1614 ci.sector_count = bio_sectors(bio);
1615 ret = __clone_and_map_simple_bio(&ci, tio, NULL);
1616 }
1617out:
1618 /* drop the extra reference count */
1619 dec_pending(ci.io, errno_to_blk_status(error));
1620 return ret;
1621}
1622
1623typedef blk_qc_t (process_bio_fn)(struct mapped_device *, struct dm_table *, struct bio *);
1624
1625static blk_qc_t __dm_make_request(struct request_queue *q, struct bio *bio,
1626 process_bio_fn process_bio)
1525{ 1627{
1526 int rw = bio_data_dir(bio);
1527 struct mapped_device *md = q->queuedata; 1628 struct mapped_device *md = q->queuedata;
1629 blk_qc_t ret = BLK_QC_T_NONE;
1528 int srcu_idx; 1630 int srcu_idx;
1529 struct dm_table *map; 1631 struct dm_table *map;
1530 1632
1531 map = dm_get_live_table(md, &srcu_idx); 1633 map = dm_get_live_table(md, &srcu_idx);
1532 1634
1533 generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0);
1534
1535 /* if we're suspended, we have to queue this io for later */ 1635 /* if we're suspended, we have to queue this io for later */
1536 if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { 1636 if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) {
1537 dm_put_live_table(md, srcu_idx); 1637 dm_put_live_table(md, srcu_idx);
@@ -1540,12 +1640,27 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
1540 queue_io(md, bio); 1640 queue_io(md, bio);
1541 else 1641 else
1542 bio_io_error(bio); 1642 bio_io_error(bio);
1543 return BLK_QC_T_NONE; 1643 return ret;
1544 } 1644 }
1545 1645
1546 __split_and_process_bio(md, map, bio); 1646 ret = process_bio(md, map, bio);
1647
1547 dm_put_live_table(md, srcu_idx); 1648 dm_put_live_table(md, srcu_idx);
1548 return BLK_QC_T_NONE; 1649 return ret;
1650}
1651
1652/*
1653 * The request function that remaps the bio to one target and
1654 * splits off any remainder.
1655 */
1656static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio)
1657{
1658 return __dm_make_request(q, bio, __split_and_process_bio);
1659}
1660
1661static blk_qc_t dm_make_request_nvme(struct request_queue *q, struct bio *bio)
1662{
1663 return __dm_make_request(q, bio, __process_bio);
1549} 1664}
1550 1665
1551static int dm_any_congested(void *congested_data, int bdi_bits) 1666static int dm_any_congested(void *congested_data, int bdi_bits)
@@ -1626,20 +1741,9 @@ static const struct dax_operations dm_dax_ops;
1626 1741
1627static void dm_wq_work(struct work_struct *work); 1742static void dm_wq_work(struct work_struct *work);
1628 1743
1629void dm_init_md_queue(struct mapped_device *md) 1744static void dm_init_normal_md_queue(struct mapped_device *md)
1630{
1631 /*
1632 * Initialize data that will only be used by a non-blk-mq DM queue
1633 * - must do so here (in alloc_dev callchain) before queue is used
1634 */
1635 md->queue->queuedata = md;
1636 md->queue->backing_dev_info->congested_data = md;
1637}
1638
1639void dm_init_normal_md_queue(struct mapped_device *md)
1640{ 1745{
1641 md->use_blk_mq = false; 1746 md->use_blk_mq = false;
1642 dm_init_md_queue(md);
1643 1747
1644 /* 1748 /*
1645 * Initialize aspects of queue that aren't relevant for blk-mq 1749 * Initialize aspects of queue that aren't relevant for blk-mq
@@ -1653,9 +1757,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
1653 destroy_workqueue(md->wq); 1757 destroy_workqueue(md->wq);
1654 if (md->kworker_task) 1758 if (md->kworker_task)
1655 kthread_stop(md->kworker_task); 1759 kthread_stop(md->kworker_task);
1656 mempool_destroy(md->io_pool);
1657 if (md->bs) 1760 if (md->bs)
1658 bioset_free(md->bs); 1761 bioset_free(md->bs);
1762 if (md->io_bs)
1763 bioset_free(md->io_bs);
1659 1764
1660 if (md->dax_dev) { 1765 if (md->dax_dev) {
1661 kill_dax(md->dax_dev); 1766 kill_dax(md->dax_dev);
@@ -1681,6 +1786,10 @@ static void cleanup_mapped_device(struct mapped_device *md)
1681 md->bdev = NULL; 1786 md->bdev = NULL;
1682 } 1787 }
1683 1788
1789 mutex_destroy(&md->suspend_lock);
1790 mutex_destroy(&md->type_lock);
1791 mutex_destroy(&md->table_devices_lock);
1792
1684 dm_mq_cleanup_mapped_device(md); 1793 dm_mq_cleanup_mapped_device(md);
1685} 1794}
1686 1795
@@ -1734,10 +1843,10 @@ static struct mapped_device *alloc_dev(int minor)
1734 md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); 1843 md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id);
1735 if (!md->queue) 1844 if (!md->queue)
1736 goto bad; 1845 goto bad;
1846 md->queue->queuedata = md;
1847 md->queue->backing_dev_info->congested_data = md;
1737 1848
1738 dm_init_md_queue(md); 1849 md->disk = alloc_disk_node(1, md->numa_node_id);
1739
1740 md->disk = alloc_disk_node(1, numa_node_id);
1741 if (!md->disk) 1850 if (!md->disk)
1742 goto bad; 1851 goto bad;
1743 1852
@@ -1820,17 +1929,22 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
1820{ 1929{
1821 struct dm_md_mempools *p = dm_table_get_md_mempools(t); 1930 struct dm_md_mempools *p = dm_table_get_md_mempools(t);
1822 1931
1823 if (md->bs) { 1932 if (dm_table_bio_based(t)) {
1824 /* The md already has necessary mempools. */ 1933 /*
1825 if (dm_table_bio_based(t)) { 1934 * The md may already have mempools that need changing.
1826 /* 1935 * If so, reload bioset because front_pad may have changed
1827 * Reload bioset because front_pad may have changed 1936 * because a different table was loaded.
1828 * because a different table was loaded. 1937 */
1829 */ 1938 if (md->bs) {
1830 bioset_free(md->bs); 1939 bioset_free(md->bs);
1831 md->bs = p->bs; 1940 md->bs = NULL;
1832 p->bs = NULL; 1941 }
1942 if (md->io_bs) {
1943 bioset_free(md->io_bs);
1944 md->io_bs = NULL;
1833 } 1945 }
1946
1947 } else if (md->bs) {
1834 /* 1948 /*
1835 * There's no need to reload with request-based dm 1949 * There's no need to reload with request-based dm
1836 * because the size of front_pad doesn't change. 1950 * because the size of front_pad doesn't change.
@@ -1842,13 +1956,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t)
1842 goto out; 1956 goto out;
1843 } 1957 }
1844 1958
1845 BUG_ON(!p || md->io_pool || md->bs); 1959 BUG_ON(!p || md->bs || md->io_bs);
1846 1960
1847 md->io_pool = p->io_pool;
1848 p->io_pool = NULL;
1849 md->bs = p->bs; 1961 md->bs = p->bs;
1850 p->bs = NULL; 1962 p->bs = NULL;
1851 1963 md->io_bs = p->io_bs;
1964 p->io_bs = NULL;
1852out: 1965out:
1853 /* mempool bind completed, no longer need any mempools in the table */ 1966 /* mempool bind completed, no longer need any mempools in the table */
1854 dm_table_free_md_mempools(t); 1967 dm_table_free_md_mempools(t);
@@ -1894,6 +2007,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
1894{ 2007{
1895 struct dm_table *old_map; 2008 struct dm_table *old_map;
1896 struct request_queue *q = md->queue; 2009 struct request_queue *q = md->queue;
2010 bool request_based = dm_table_request_based(t);
1897 sector_t size; 2011 sector_t size;
1898 2012
1899 lockdep_assert_held(&md->suspend_lock); 2013 lockdep_assert_held(&md->suspend_lock);
@@ -1917,12 +2031,15 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
1917 * This must be done before setting the queue restrictions, 2031 * This must be done before setting the queue restrictions,
1918 * because request-based dm may be run just after the setting. 2032 * because request-based dm may be run just after the setting.
1919 */ 2033 */
1920 if (dm_table_request_based(t)) { 2034 if (request_based)
1921 dm_stop_queue(q); 2035 dm_stop_queue(q);
2036
2037 if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) {
1922 /* 2038 /*
1923 * Leverage the fact that request-based DM targets are 2039 * Leverage the fact that request-based DM targets and
1924 * immutable singletons and establish md->immutable_target 2040 * NVMe bio based targets are immutable singletons
1925 * - used to optimize both dm_request_fn and dm_mq_queue_rq 2041 * - used to optimize both dm_request_fn and dm_mq_queue_rq;
2042 * and __process_bio.
1926 */ 2043 */
1927 md->immutable_target = dm_table_get_immutable_target(t); 2044 md->immutable_target = dm_table_get_immutable_target(t);
1928 } 2045 }
@@ -1962,13 +2079,18 @@ static struct dm_table *__unbind(struct mapped_device *md)
1962 */ 2079 */
1963int dm_create(int minor, struct mapped_device **result) 2080int dm_create(int minor, struct mapped_device **result)
1964{ 2081{
2082 int r;
1965 struct mapped_device *md; 2083 struct mapped_device *md;
1966 2084
1967 md = alloc_dev(minor); 2085 md = alloc_dev(minor);
1968 if (!md) 2086 if (!md)
1969 return -ENXIO; 2087 return -ENXIO;
1970 2088
1971 dm_sysfs_init(md); 2089 r = dm_sysfs_init(md);
2090 if (r) {
2091 free_dev(md);
2092 return r;
2093 }
1972 2094
1973 *result = md; 2095 *result = md;
1974 return 0; 2096 return 0;
@@ -2026,6 +2148,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
2026 2148
2027 switch (type) { 2149 switch (type) {
2028 case DM_TYPE_REQUEST_BASED: 2150 case DM_TYPE_REQUEST_BASED:
2151 dm_init_normal_md_queue(md);
2029 r = dm_old_init_request_queue(md, t); 2152 r = dm_old_init_request_queue(md, t);
2030 if (r) { 2153 if (r) {
2031 DMERR("Cannot initialize queue for request-based mapped device"); 2154 DMERR("Cannot initialize queue for request-based mapped device");
@@ -2043,15 +2166,10 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t)
2043 case DM_TYPE_DAX_BIO_BASED: 2166 case DM_TYPE_DAX_BIO_BASED:
2044 dm_init_normal_md_queue(md); 2167 dm_init_normal_md_queue(md);
2045 blk_queue_make_request(md->queue, dm_make_request); 2168 blk_queue_make_request(md->queue, dm_make_request);
2046 /* 2169 break;
2047 * DM handles splitting bios as needed. Free the bio_split bioset 2170 case DM_TYPE_NVME_BIO_BASED:
2048 * since it won't be used (saves 1 process per bio-based DM device). 2171 dm_init_normal_md_queue(md);
2049 */ 2172 blk_queue_make_request(md->queue, dm_make_request_nvme);
2050 bioset_free(md->queue->bio_split);
2051 md->queue->bio_split = NULL;
2052
2053 if (type == DM_TYPE_DAX_BIO_BASED)
2054 queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue);
2055 break; 2173 break;
2056 case DM_TYPE_NONE: 2174 case DM_TYPE_NONE:
2057 WARN_ON_ONCE(true); 2175 WARN_ON_ONCE(true);
@@ -2130,7 +2248,6 @@ EXPORT_SYMBOL_GPL(dm_device_name);
2130 2248
2131static void __dm_destroy(struct mapped_device *md, bool wait) 2249static void __dm_destroy(struct mapped_device *md, bool wait)
2132{ 2250{
2133 struct request_queue *q = dm_get_md_queue(md);
2134 struct dm_table *map; 2251 struct dm_table *map;
2135 int srcu_idx; 2252 int srcu_idx;
2136 2253
@@ -2141,7 +2258,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait)
2141 set_bit(DMF_FREEING, &md->flags); 2258 set_bit(DMF_FREEING, &md->flags);
2142 spin_unlock(&_minor_lock); 2259 spin_unlock(&_minor_lock);
2143 2260
2144 blk_set_queue_dying(q); 2261 blk_set_queue_dying(md->queue);
2145 2262
2146 if (dm_request_based(md) && md->kworker_task) 2263 if (dm_request_based(md) && md->kworker_task)
2147 kthread_flush_worker(&md->kworker); 2264 kthread_flush_worker(&md->kworker);
@@ -2752,11 +2869,12 @@ int dm_noflush_suspending(struct dm_target *ti)
2752EXPORT_SYMBOL_GPL(dm_noflush_suspending); 2869EXPORT_SYMBOL_GPL(dm_noflush_suspending);
2753 2870
2754struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, 2871struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
2755 unsigned integrity, unsigned per_io_data_size) 2872 unsigned integrity, unsigned per_io_data_size,
2873 unsigned min_pool_size)
2756{ 2874{
2757 struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id); 2875 struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id);
2758 unsigned int pool_size = 0; 2876 unsigned int pool_size = 0;
2759 unsigned int front_pad; 2877 unsigned int front_pad, io_front_pad;
2760 2878
2761 if (!pools) 2879 if (!pools)
2762 return NULL; 2880 return NULL;
@@ -2764,16 +2882,19 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
2764 switch (type) { 2882 switch (type) {
2765 case DM_TYPE_BIO_BASED: 2883 case DM_TYPE_BIO_BASED:
2766 case DM_TYPE_DAX_BIO_BASED: 2884 case DM_TYPE_DAX_BIO_BASED:
2767 pool_size = dm_get_reserved_bio_based_ios(); 2885 case DM_TYPE_NVME_BIO_BASED:
2886 pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size);
2768 front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); 2887 front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone);
2769 2888 io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio);
2770 pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache); 2889 pools->io_bs = bioset_create(pool_size, io_front_pad, 0);
2771 if (!pools->io_pool) 2890 if (!pools->io_bs)
2891 goto out;
2892 if (integrity && bioset_integrity_create(pools->io_bs, pool_size))
2772 goto out; 2893 goto out;
2773 break; 2894 break;
2774 case DM_TYPE_REQUEST_BASED: 2895 case DM_TYPE_REQUEST_BASED:
2775 case DM_TYPE_MQ_REQUEST_BASED: 2896 case DM_TYPE_MQ_REQUEST_BASED:
2776 pool_size = dm_get_reserved_rq_based_ios(); 2897 pool_size = max(dm_get_reserved_rq_based_ios(), min_pool_size);
2777 front_pad = offsetof(struct dm_rq_clone_bio_info, clone); 2898 front_pad = offsetof(struct dm_rq_clone_bio_info, clone);
2778 /* per_io_data_size is used for blk-mq pdu at queue allocation */ 2899 /* per_io_data_size is used for blk-mq pdu at queue allocation */
2779 break; 2900 break;
@@ -2781,7 +2902,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu
2781 BUG(); 2902 BUG();
2782 } 2903 }
2783 2904
2784 pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER); 2905 pools->bs = bioset_create(pool_size, front_pad, 0);
2785 if (!pools->bs) 2906 if (!pools->bs)
2786 goto out; 2907 goto out;
2787 2908
@@ -2801,10 +2922,10 @@ void dm_free_md_mempools(struct dm_md_mempools *pools)
2801 if (!pools) 2922 if (!pools)
2802 return; 2923 return;
2803 2924
2804 mempool_destroy(pools->io_pool);
2805
2806 if (pools->bs) 2925 if (pools->bs)
2807 bioset_free(pools->bs); 2926 bioset_free(pools->bs);
2927 if (pools->io_bs)
2928 bioset_free(pools->io_bs);
2808 2929
2809 kfree(pools); 2930 kfree(pools);
2810} 2931}
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 36399bb875dd..114a81b27c37 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -49,7 +49,6 @@ struct dm_md_mempools;
49/*----------------------------------------------------------------- 49/*-----------------------------------------------------------------
50 * Internal table functions. 50 * Internal table functions.
51 *---------------------------------------------------------------*/ 51 *---------------------------------------------------------------*/
52void dm_table_destroy(struct dm_table *t);
53void dm_table_event_callback(struct dm_table *t, 52void dm_table_event_callback(struct dm_table *t,
54 void (*fn)(void *), void *context); 53 void (*fn)(void *), void *context);
55struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); 54struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index);
@@ -206,7 +205,8 @@ void dm_kcopyd_exit(void);
206 * Mempool operations 205 * Mempool operations
207 */ 206 */
208struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, 207struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type,
209 unsigned integrity, unsigned per_bio_data_size); 208 unsigned integrity, unsigned per_bio_data_size,
209 unsigned min_pool_size);
210void dm_free_md_mempools(struct dm_md_mempools *pools); 210void dm_free_md_mempools(struct dm_md_mempools *pools);
211 211
212/* 212/*
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index a5538433c927..da83f64952e7 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -28,6 +28,7 @@ enum dm_queue_mode {
28 DM_TYPE_REQUEST_BASED = 2, 28 DM_TYPE_REQUEST_BASED = 2,
29 DM_TYPE_MQ_REQUEST_BASED = 3, 29 DM_TYPE_MQ_REQUEST_BASED = 3,
30 DM_TYPE_DAX_BIO_BASED = 4, 30 DM_TYPE_DAX_BIO_BASED = 4,
31 DM_TYPE_NVME_BIO_BASED = 5,
31}; 32};
32 33
33typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; 34typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t;
@@ -221,14 +222,6 @@ struct target_type {
221#define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD) 222#define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD)
222 223
223/* 224/*
224 * Some targets need to be sent the same WRITE bio severals times so
225 * that they can send copies of it to different devices. This function
226 * examines any supplied bio and returns the number of copies of it the
227 * target requires.
228 */
229typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio);
230
231/*
232 * A target implements own bio data integrity. 225 * A target implements own bio data integrity.
233 */ 226 */
234#define DM_TARGET_INTEGRITY 0x00000010 227#define DM_TARGET_INTEGRITY 0x00000010
@@ -291,13 +284,6 @@ struct dm_target {
291 */ 284 */
292 unsigned per_io_data_size; 285 unsigned per_io_data_size;
293 286
294 /*
295 * If defined, this function is called to find out how many
296 * duplicate bios should be sent to the target when writing
297 * data.
298 */
299 dm_num_write_bios_fn num_write_bios;
300
301 /* target specific data */ 287 /* target specific data */
302 void *private; 288 void *private;
303 289
@@ -329,35 +315,9 @@ struct dm_target_callbacks {
329 int (*congested_fn) (struct dm_target_callbacks *, int); 315 int (*congested_fn) (struct dm_target_callbacks *, int);
330}; 316};
331 317
332/* 318void *dm_per_bio_data(struct bio *bio, size_t data_size);
333 * For bio-based dm. 319struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size);
334 * One of these is allocated for each bio. 320unsigned dm_bio_get_target_bio_nr(const struct bio *bio);
335 * This structure shouldn't be touched directly by target drivers.
336 * It is here so that we can inline dm_per_bio_data and
337 * dm_bio_from_per_bio_data
338 */
339struct dm_target_io {
340 struct dm_io *io;
341 struct dm_target *ti;
342 unsigned target_bio_nr;
343 unsigned *len_ptr;
344 struct bio clone;
345};
346
347static inline void *dm_per_bio_data(struct bio *bio, size_t data_size)
348{
349 return (char *)bio - offsetof(struct dm_target_io, clone) - data_size;
350}
351
352static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size)
353{
354 return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone));
355}
356
357static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio)
358{
359 return container_of(bio, struct dm_target_io, clone)->target_bio_nr;
360}
361 321
362int dm_register_target(struct target_type *t); 322int dm_register_target(struct target_type *t);
363void dm_unregister_target(struct target_type *t); 323void dm_unregister_target(struct target_type *t);
@@ -500,6 +460,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type);
500int dm_table_complete(struct dm_table *t); 460int dm_table_complete(struct dm_table *t);
501 461
502/* 462/*
463 * Destroy the table when finished.
464 */
465void dm_table_destroy(struct dm_table *t);
466
467/*
503 * Target may require that it is never sent I/O larger than len. 468 * Target may require that it is never sent I/O larger than len.
504 */ 469 */
505int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len); 470int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len);
@@ -585,6 +550,7 @@ do { \
585#define DM_ENDIO_DONE 0 550#define DM_ENDIO_DONE 0
586#define DM_ENDIO_INCOMPLETE 1 551#define DM_ENDIO_INCOMPLETE 1
587#define DM_ENDIO_REQUEUE 2 552#define DM_ENDIO_REQUEUE 2
553#define DM_ENDIO_DELAY_REQUEUE 3
588 554
589/* 555/*
590 * Definitions of return values from target map function. 556 * Definitions of return values from target map function.
@@ -592,7 +558,7 @@ do { \
592#define DM_MAPIO_SUBMITTED 0 558#define DM_MAPIO_SUBMITTED 0
593#define DM_MAPIO_REMAPPED 1 559#define DM_MAPIO_REMAPPED 1
594#define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE 560#define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE
595#define DM_MAPIO_DELAY_REQUEUE 3 561#define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE
596#define DM_MAPIO_KILL 4 562#define DM_MAPIO_KILL 4
597 563
598#define dm_sector_div64(x, y)( \ 564#define dm_sector_div64(x, y)( \