diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2018-01-31 14:05:47 -0500 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2018-01-31 14:05:47 -0500 |
commit | 0be600a5add76e8e8b9e1119f2a7426ff849aca8 (patch) | |
tree | d5fcc2b119f03143f9bed1b9aa5cb85458c8bd03 | |
parent | 040639b7fcf73ee39c15d38257f652a2048e96f2 (diff) | |
parent | 9614e2ba9161c7f5419f4212fa6057d2a65f6ae6 (diff) |
Merge tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm
Pull device mapper updates from Mike Snitzer:
- DM core fixes to ensure that bio submission follows a depth-first
tree walk; this is critical to allow forward progress without the
need to use the bioset's BIOSET_NEED_RESCUER.
- Remove DM core's BIOSET_NEED_RESCUER based dm_offload infrastructure.
- DM core cleanups and improvements to make bio-based DM more efficient
(e.g. reduced memory footprint as well leveraging per-bio-data more).
- Introduce new bio-based mode (DM_TYPE_NVME_BIO_BASED) that leverages
the more direct IO submission path in the block layer; this mode is
used by DM multipath and also optimizes targets like DM thin-pool
that stack directly on NVMe data device.
- DM multipath improvements to factor out legacy SCSI-only (e.g.
scsi_dh) code paths to allow for more optimized support for NVMe
multipath.
- A fix for DM multipath path selectors (service-time and queue-length)
to select paths in a more balanced way; largely academic but doesn't
hurt.
- Numerous DM raid target fixes and improvements.
- Add a new DM "unstriped" target that enables Intel to workaround
firmware limitations in some NVMe drives that are striped internally
(this target also works when stacked above the DM "striped" target).
- Various Documentation fixes and improvements.
- Misc cleanups and fixes across various DM infrastructure and targets
(e.g. bufio, flakey, log-writes, snapshot).
* tag 'for-4.16/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (69 commits)
dm cache: Documentation: update default migration_throttling value
dm mpath selector: more evenly distribute ties
dm unstripe: fix target length versus number of stripes size check
dm thin: fix trailing semicolon in __remap_and_issue_shared_cell
dm table: fix NVMe bio-based dm_table_determine_type() validation
dm: various cleanups to md->queue initialization code
dm mpath: delay the retry of a request if the target responded as busy
dm mpath: return DM_MAPIO_DELAY_REQUEUE if QUEUE_IO or PG_INIT_REQUIRED
dm mpath: return DM_MAPIO_REQUEUE on blk-mq rq allocation failure
dm log writes: fix max length used for kstrndup
dm: backfill missing calls to mutex_destroy()
dm snapshot: use mutex instead of rw_semaphore
dm flakey: check for null arg_name in parse_features()
dm thin: extend thinpool status format string with omitted fields
dm thin: fixes in thin-provisioning.txt
dm thin: document representation of <highest mapped sector> when there is none
dm thin: fix documentation relative to low water mark threshold
dm cache: be consistent in specifying sectors and SI units in cache.txt
dm cache: delete obsoleted paragraph in cache.txt
dm cache: fix grammar in cache-policies.txt
...
31 files changed, 1409 insertions, 671 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt index d3ca8af21a31..86786d87d9a8 100644 --- a/Documentation/device-mapper/cache-policies.txt +++ b/Documentation/device-mapper/cache-policies.txt | |||
@@ -60,7 +60,7 @@ Memory usage: | |||
60 | The mq policy used a lot of memory; 88 bytes per cache block on a 64 | 60 | The mq policy used a lot of memory; 88 bytes per cache block on a 64 |
61 | bit machine. | 61 | bit machine. |
62 | 62 | ||
63 | smq uses 28bit indexes to implement it's data structures rather than | 63 | smq uses 28bit indexes to implement its data structures rather than |
64 | pointers. It avoids storing an explicit hit count for each block. It | 64 | pointers. It avoids storing an explicit hit count for each block. It |
65 | has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of | 65 | has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of |
66 | the entries (each hotspot block covers a larger area than a single | 66 | the entries (each hotspot block covers a larger area than a single |
@@ -84,7 +84,7 @@ resulting in better promotion/demotion decisions. | |||
84 | 84 | ||
85 | Adaptability: | 85 | Adaptability: |
86 | The mq policy maintained a hit count for each cache block. For a | 86 | The mq policy maintained a hit count for each cache block. For a |
87 | different block to get promoted to the cache it's hit count has to | 87 | different block to get promoted to the cache its hit count has to |
88 | exceed the lowest currently in the cache. This meant it could take a | 88 | exceed the lowest currently in the cache. This meant it could take a |
89 | long time for the cache to adapt between varying IO patterns. | 89 | long time for the cache to adapt between varying IO patterns. |
90 | 90 | ||
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt index cdfd0feb294e..ff0841711fd5 100644 --- a/Documentation/device-mapper/cache.txt +++ b/Documentation/device-mapper/cache.txt | |||
@@ -59,7 +59,7 @@ Fixed block size | |||
59 | The origin is divided up into blocks of a fixed size. This block size | 59 | The origin is divided up into blocks of a fixed size. This block size |
60 | is configurable when you first create the cache. Typically we've been | 60 | is configurable when you first create the cache. Typically we've been |
61 | using block sizes of 256KB - 1024KB. The block size must be between 64 | 61 | using block sizes of 256KB - 1024KB. The block size must be between 64 |
62 | (32KB) and 2097152 (1GB) and a multiple of 64 (32KB). | 62 | sectors (32KB) and 2097152 sectors (1GB) and a multiple of 64 sectors (32KB). |
63 | 63 | ||
64 | Having a fixed block size simplifies the target a lot. But it is | 64 | Having a fixed block size simplifies the target a lot. But it is |
65 | something of a compromise. For instance, a small part of a block may be | 65 | something of a compromise. For instance, a small part of a block may be |
@@ -119,7 +119,7 @@ doing here to avoid migrating during those peak io moments. | |||
119 | 119 | ||
120 | For the time being, a message "migration_threshold <#sectors>" | 120 | For the time being, a message "migration_threshold <#sectors>" |
121 | can be used to set the maximum number of sectors being migrated, | 121 | can be used to set the maximum number of sectors being migrated, |
122 | the default being 204800 sectors (or 100MB). | 122 | the default being 2048 sectors (1MB). |
123 | 123 | ||
124 | Updating on-disk metadata | 124 | Updating on-disk metadata |
125 | ------------------------- | 125 | ------------------------- |
@@ -143,11 +143,6 @@ the policy how big this chunk is, but it should be kept small. Like the | |||
143 | dirty flags this data is lost if there's a crash so a safe fallback | 143 | dirty flags this data is lost if there's a crash so a safe fallback |
144 | value should always be possible. | 144 | value should always be possible. |
145 | 145 | ||
146 | For instance, the 'mq' policy, which is currently the default policy, | ||
147 | uses this facility to store the hit count of the cache blocks. If | ||
148 | there's a crash this information will be lost, which means the cache | ||
149 | may be less efficient until those hit counts are regenerated. | ||
150 | |||
151 | Policy hints affect performance, not correctness. | 146 | Policy hints affect performance, not correctness. |
152 | 147 | ||
153 | Policy messaging | 148 | Policy messaging |
diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt index 32df07e29f68..390c145f01d7 100644 --- a/Documentation/device-mapper/dm-raid.txt +++ b/Documentation/device-mapper/dm-raid.txt | |||
@@ -343,5 +343,8 @@ Version History | |||
343 | 1.11.0 Fix table line argument order | 343 | 1.11.0 Fix table line argument order |
344 | (wrong raid10_copies/raid10_format sequence) | 344 | (wrong raid10_copies/raid10_format sequence) |
345 | 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option | 345 | 1.11.1 Add raid4/5/6 journal write-back support via journal_mode option |
346 | 1.12.1 fix for MD deadlock between mddev_suspend() and md_write_start() available | 346 | 1.12.1 Fix for MD deadlock between mddev_suspend() and md_write_start() available |
347 | 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') | 347 | 1.13.0 Fix dev_health status at end of "recover" (was 'a', now 'A') |
348 | 1.13.1 Fix deadlock caused by early md_stop_writes(). Also fix size an | ||
349 | state races. | ||
350 | 1.13.2 Fix raid redundancy validation and avoid keeping raid set frozen | ||
diff --git a/Documentation/device-mapper/snapshot.txt b/Documentation/device-mapper/snapshot.txt index ad6949bff2e3..b8bbb516f989 100644 --- a/Documentation/device-mapper/snapshot.txt +++ b/Documentation/device-mapper/snapshot.txt | |||
@@ -49,6 +49,10 @@ The difference between persistent and transient is with transient | |||
49 | snapshots less metadata must be saved on disk - they can be kept in | 49 | snapshots less metadata must be saved on disk - they can be kept in |
50 | memory by the kernel. | 50 | memory by the kernel. |
51 | 51 | ||
52 | When loading or unloading the snapshot target, the corresponding | ||
53 | snapshot-origin or snapshot-merge target must be suspended. A failure to | ||
54 | suspend the origin target could result in data corruption. | ||
55 | |||
52 | 56 | ||
53 | * snapshot-merge <origin> <COW device> <persistent> <chunksize> | 57 | * snapshot-merge <origin> <COW device> <persistent> <chunksize> |
54 | 58 | ||
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt index 1699a55b7b70..4bcd4b7f79f9 100644 --- a/Documentation/device-mapper/thin-provisioning.txt +++ b/Documentation/device-mapper/thin-provisioning.txt | |||
@@ -112,9 +112,11 @@ $low_water_mark is expressed in blocks of size $data_block_size. If | |||
112 | free space on the data device drops below this level then a dm event | 112 | free space on the data device drops below this level then a dm event |
113 | will be triggered which a userspace daemon should catch allowing it to | 113 | will be triggered which a userspace daemon should catch allowing it to |
114 | extend the pool device. Only one such event will be sent. | 114 | extend the pool device. Only one such event will be sent. |
115 | Resuming a device with a new table itself triggers an event so the | 115 | |
116 | userspace daemon can use this to detect a situation where a new table | 116 | No special event is triggered if a just resumed device's free space is below |
117 | already exceeds the threshold. | 117 | the low water mark. However, resuming a device always triggers an |
118 | event; a userspace daemon should verify that free space exceeds the low | ||
119 | water mark when handling this event. | ||
118 | 120 | ||
119 | A low water mark for the metadata device is maintained in the kernel and | 121 | A low water mark for the metadata device is maintained in the kernel and |
120 | will trigger a dm event if free space on the metadata device drops below | 122 | will trigger a dm event if free space on the metadata device drops below |
@@ -274,7 +276,8 @@ ii) Status | |||
274 | 276 | ||
275 | <transaction id> <used metadata blocks>/<total metadata blocks> | 277 | <transaction id> <used metadata blocks>/<total metadata blocks> |
276 | <used data blocks>/<total data blocks> <held metadata root> | 278 | <used data blocks>/<total data blocks> <held metadata root> |
277 | [no_]discard_passdown ro|rw | 279 | ro|rw|out_of_data_space [no_]discard_passdown [error|queue]_if_no_space |
280 | needs_check|- | ||
278 | 281 | ||
279 | transaction id: | 282 | transaction id: |
280 | A 64-bit number used by userspace to help synchronise with metadata | 283 | A 64-bit number used by userspace to help synchronise with metadata |
@@ -394,3 +397,6 @@ ii) Status | |||
394 | If the pool has encountered device errors and failed, the status | 397 | If the pool has encountered device errors and failed, the status |
395 | will just contain the string 'Fail'. The userspace recovery | 398 | will just contain the string 'Fail'. The userspace recovery |
396 | tools should then be used. | 399 | tools should then be used. |
400 | |||
401 | In the case where <nr mapped sectors> is 0, there is no highest | ||
402 | mapped sector and the value of <highest mapped sector> is unspecified. | ||
diff --git a/Documentation/device-mapper/unstriped.txt b/Documentation/device-mapper/unstriped.txt new file mode 100644 index 000000000000..0b2a306c54ee --- /dev/null +++ b/Documentation/device-mapper/unstriped.txt | |||
@@ -0,0 +1,124 @@ | |||
1 | Introduction | ||
2 | ============ | ||
3 | |||
4 | The device-mapper "unstriped" target provides a transparent mechanism to | ||
5 | unstripe a device-mapper "striped" target to access the underlying disks | ||
6 | without having to touch the true backing block-device. It can also be | ||
7 | used to unstripe a hardware RAID-0 to access backing disks. | ||
8 | |||
9 | Parameters: | ||
10 | <number of stripes> <chunk size> <stripe #> <dev_path> <offset> | ||
11 | |||
12 | <number of stripes> | ||
13 | The number of stripes in the RAID 0. | ||
14 | |||
15 | <chunk size> | ||
16 | The amount of 512B sectors in the chunk striping. | ||
17 | |||
18 | <dev_path> | ||
19 | The block device you wish to unstripe. | ||
20 | |||
21 | <stripe #> | ||
22 | The stripe number within the device that corresponds to physical | ||
23 | drive you wish to unstripe. This must be 0 indexed. | ||
24 | |||
25 | |||
26 | Why use this module? | ||
27 | ==================== | ||
28 | |||
29 | An example of undoing an existing dm-stripe | ||
30 | ------------------------------------------- | ||
31 | |||
32 | This small bash script will setup 4 loop devices and use the existing | ||
33 | striped target to combine the 4 devices into one. It then will use | ||
34 | the unstriped target ontop of the striped device to access the | ||
35 | individual backing loop devices. We write data to the newly exposed | ||
36 | unstriped devices and verify the data written matches the correct | ||
37 | underlying device on the striped array. | ||
38 | |||
39 | #!/bin/bash | ||
40 | |||
41 | MEMBER_SIZE=$((128 * 1024 * 1024)) | ||
42 | NUM=4 | ||
43 | SEQ_END=$((${NUM}-1)) | ||
44 | CHUNK=256 | ||
45 | BS=4096 | ||
46 | |||
47 | RAID_SIZE=$((${MEMBER_SIZE}*${NUM}/512)) | ||
48 | DM_PARMS="0 ${RAID_SIZE} striped ${NUM} ${CHUNK}" | ||
49 | COUNT=$((${MEMBER_SIZE} / ${BS})) | ||
50 | |||
51 | for i in $(seq 0 ${SEQ_END}); do | ||
52 | dd if=/dev/zero of=member-${i} bs=${MEMBER_SIZE} count=1 oflag=direct | ||
53 | losetup /dev/loop${i} member-${i} | ||
54 | DM_PARMS+=" /dev/loop${i} 0" | ||
55 | done | ||
56 | |||
57 | echo $DM_PARMS | dmsetup create raid0 | ||
58 | for i in $(seq 0 ${SEQ_END}); do | ||
59 | echo "0 1 unstriped ${NUM} ${CHUNK} ${i} /dev/mapper/raid0 0" | dmsetup create set-${i} | ||
60 | done; | ||
61 | |||
62 | for i in $(seq 0 ${SEQ_END}); do | ||
63 | dd if=/dev/urandom of=/dev/mapper/set-${i} bs=${BS} count=${COUNT} oflag=direct | ||
64 | diff /dev/mapper/set-${i} member-${i} | ||
65 | done; | ||
66 | |||
67 | for i in $(seq 0 ${SEQ_END}); do | ||
68 | dmsetup remove set-${i} | ||
69 | done | ||
70 | |||
71 | dmsetup remove raid0 | ||
72 | |||
73 | for i in $(seq 0 ${SEQ_END}); do | ||
74 | losetup -d /dev/loop${i} | ||
75 | rm -f member-${i} | ||
76 | done | ||
77 | |||
78 | Another example | ||
79 | --------------- | ||
80 | |||
81 | Intel NVMe drives contain two cores on the physical device. | ||
82 | Each core of the drive has segregated access to its LBA range. | ||
83 | The current LBA model has a RAID 0 128k chunk on each core, resulting | ||
84 | in a 256k stripe across the two cores: | ||
85 | |||
86 | Core 0: Core 1: | ||
87 | __________ __________ | ||
88 | | LBA 512| | LBA 768| | ||
89 | | LBA 0 | | LBA 256| | ||
90 | ---------- ---------- | ||
91 | |||
92 | The purpose of this unstriping is to provide better QoS in noisy | ||
93 | neighbor environments. When two partitions are created on the | ||
94 | aggregate drive without this unstriping, reads on one partition | ||
95 | can affect writes on another partition. This is because the partitions | ||
96 | are striped across the two cores. When we unstripe this hardware RAID 0 | ||
97 | and make partitions on each new exposed device the two partitions are now | ||
98 | physically separated. | ||
99 | |||
100 | With the dm-unstriped target we're able to segregate an fio script that | ||
101 | has read and write jobs that are independent of each other. Compared to | ||
102 | when we run the test on a combined drive with partitions, we were able | ||
103 | to get a 92% reduction in read latency using this device mapper target. | ||
104 | |||
105 | |||
106 | Example dmsetup usage | ||
107 | ===================== | ||
108 | |||
109 | unstriped ontop of Intel NVMe device that has 2 cores | ||
110 | ----------------------------------------------------- | ||
111 | dmsetup create nvmset0 --table '0 512 unstriped 2 256 0 /dev/nvme0n1 0' | ||
112 | dmsetup create nvmset1 --table '0 512 unstriped 2 256 1 /dev/nvme0n1 0' | ||
113 | |||
114 | There will now be two devices that expose Intel NVMe core 0 and 1 | ||
115 | respectively: | ||
116 | /dev/mapper/nvmset0 | ||
117 | /dev/mapper/nvmset1 | ||
118 | |||
119 | unstriped ontop of striped with 4 drives using 128K chunk size | ||
120 | -------------------------------------------------------------- | ||
121 | dmsetup create raid_disk0 --table '0 512 unstriped 4 256 0 /dev/mapper/striped 0' | ||
122 | dmsetup create raid_disk1 --table '0 512 unstriped 4 256 1 /dev/mapper/striped 0' | ||
123 | dmsetup create raid_disk2 --table '0 512 unstriped 4 256 2 /dev/mapper/striped 0' | ||
124 | dmsetup create raid_disk3 --table '0 512 unstriped 4 256 3 /dev/mapper/striped 0' | ||
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 83b9362be09c..2c8ac3688815 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig | |||
@@ -269,6 +269,13 @@ config DM_BIO_PRISON | |||
269 | 269 | ||
270 | source "drivers/md/persistent-data/Kconfig" | 270 | source "drivers/md/persistent-data/Kconfig" |
271 | 271 | ||
272 | config DM_UNSTRIPED | ||
273 | tristate "Unstriped target" | ||
274 | depends on BLK_DEV_DM | ||
275 | ---help--- | ||
276 | Unstripes I/O so it is issued solely on a single drive in a HW | ||
277 | RAID0 or dm-striped target. | ||
278 | |||
272 | config DM_CRYPT | 279 | config DM_CRYPT |
273 | tristate "Crypt target support" | 280 | tristate "Crypt target support" |
274 | depends on BLK_DEV_DM | 281 | depends on BLK_DEV_DM |
diff --git a/drivers/md/Makefile b/drivers/md/Makefile index f701bb211783..63255f3ebd97 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile | |||
@@ -43,6 +43,7 @@ obj-$(CONFIG_BCACHE) += bcache/ | |||
43 | obj-$(CONFIG_BLK_DEV_MD) += md-mod.o | 43 | obj-$(CONFIG_BLK_DEV_MD) += md-mod.o |
44 | obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o | 44 | obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o |
45 | obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o | 45 | obj-$(CONFIG_BLK_DEV_DM_BUILTIN) += dm-builtin.o |
46 | obj-$(CONFIG_DM_UNSTRIPED) += dm-unstripe.o | ||
46 | obj-$(CONFIG_DM_BUFIO) += dm-bufio.o | 47 | obj-$(CONFIG_DM_BUFIO) += dm-bufio.o |
47 | obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o | 48 | obj-$(CONFIG_DM_BIO_PRISON) += dm-bio-prison.o |
48 | obj-$(CONFIG_DM_CRYPT) += dm-crypt.o | 49 | obj-$(CONFIG_DM_CRYPT) += dm-crypt.o |
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c index c546b567f3b5..414c9af54ded 100644 --- a/drivers/md/dm-bufio.c +++ b/drivers/md/dm-bufio.c | |||
@@ -662,7 +662,7 @@ static void submit_io(struct dm_buffer *b, int rw, bio_end_io_t *end_io) | |||
662 | 662 | ||
663 | sector = (b->block << b->c->sectors_per_block_bits) + b->c->start; | 663 | sector = (b->block << b->c->sectors_per_block_bits) + b->c->start; |
664 | 664 | ||
665 | if (rw != WRITE) { | 665 | if (rw != REQ_OP_WRITE) { |
666 | n_sectors = 1 << b->c->sectors_per_block_bits; | 666 | n_sectors = 1 << b->c->sectors_per_block_bits; |
667 | offset = 0; | 667 | offset = 0; |
668 | } else { | 668 | } else { |
@@ -740,7 +740,7 @@ static void __write_dirty_buffer(struct dm_buffer *b, | |||
740 | b->write_end = b->dirty_end; | 740 | b->write_end = b->dirty_end; |
741 | 741 | ||
742 | if (!write_list) | 742 | if (!write_list) |
743 | submit_io(b, WRITE, write_endio); | 743 | submit_io(b, REQ_OP_WRITE, write_endio); |
744 | else | 744 | else |
745 | list_add_tail(&b->write_list, write_list); | 745 | list_add_tail(&b->write_list, write_list); |
746 | } | 746 | } |
@@ -753,7 +753,7 @@ static void __flush_write_list(struct list_head *write_list) | |||
753 | struct dm_buffer *b = | 753 | struct dm_buffer *b = |
754 | list_entry(write_list->next, struct dm_buffer, write_list); | 754 | list_entry(write_list->next, struct dm_buffer, write_list); |
755 | list_del(&b->write_list); | 755 | list_del(&b->write_list); |
756 | submit_io(b, WRITE, write_endio); | 756 | submit_io(b, REQ_OP_WRITE, write_endio); |
757 | cond_resched(); | 757 | cond_resched(); |
758 | } | 758 | } |
759 | blk_finish_plug(&plug); | 759 | blk_finish_plug(&plug); |
@@ -1123,7 +1123,7 @@ static void *new_read(struct dm_bufio_client *c, sector_t block, | |||
1123 | return NULL; | 1123 | return NULL; |
1124 | 1124 | ||
1125 | if (need_submit) | 1125 | if (need_submit) |
1126 | submit_io(b, READ, read_endio); | 1126 | submit_io(b, REQ_OP_READ, read_endio); |
1127 | 1127 | ||
1128 | wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE); | 1128 | wait_on_bit_io(&b->state, B_READING, TASK_UNINTERRUPTIBLE); |
1129 | 1129 | ||
@@ -1193,7 +1193,7 @@ void dm_bufio_prefetch(struct dm_bufio_client *c, | |||
1193 | dm_bufio_unlock(c); | 1193 | dm_bufio_unlock(c); |
1194 | 1194 | ||
1195 | if (need_submit) | 1195 | if (need_submit) |
1196 | submit_io(b, READ, read_endio); | 1196 | submit_io(b, REQ_OP_READ, read_endio); |
1197 | dm_bufio_release(b); | 1197 | dm_bufio_release(b); |
1198 | 1198 | ||
1199 | cond_resched(); | 1199 | cond_resched(); |
@@ -1454,7 +1454,7 @@ retry: | |||
1454 | old_block = b->block; | 1454 | old_block = b->block; |
1455 | __unlink_buffer(b); | 1455 | __unlink_buffer(b); |
1456 | __link_buffer(b, new_block, b->list_mode); | 1456 | __link_buffer(b, new_block, b->list_mode); |
1457 | submit_io(b, WRITE, write_endio); | 1457 | submit_io(b, REQ_OP_WRITE, write_endio); |
1458 | wait_on_bit_io(&b->state, B_WRITING, | 1458 | wait_on_bit_io(&b->state, B_WRITING, |
1459 | TASK_UNINTERRUPTIBLE); | 1459 | TASK_UNINTERRUPTIBLE); |
1460 | __unlink_buffer(b); | 1460 | __unlink_buffer(b); |
@@ -1716,7 +1716,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign | |||
1716 | if (!DM_BUFIO_CACHE_NAME(c)) { | 1716 | if (!DM_BUFIO_CACHE_NAME(c)) { |
1717 | r = -ENOMEM; | 1717 | r = -ENOMEM; |
1718 | mutex_unlock(&dm_bufio_clients_lock); | 1718 | mutex_unlock(&dm_bufio_clients_lock); |
1719 | goto bad_cache; | 1719 | goto bad; |
1720 | } | 1720 | } |
1721 | } | 1721 | } |
1722 | 1722 | ||
@@ -1727,7 +1727,7 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign | |||
1727 | if (!DM_BUFIO_CACHE(c)) { | 1727 | if (!DM_BUFIO_CACHE(c)) { |
1728 | r = -ENOMEM; | 1728 | r = -ENOMEM; |
1729 | mutex_unlock(&dm_bufio_clients_lock); | 1729 | mutex_unlock(&dm_bufio_clients_lock); |
1730 | goto bad_cache; | 1730 | goto bad; |
1731 | } | 1731 | } |
1732 | } | 1732 | } |
1733 | } | 1733 | } |
@@ -1738,27 +1738,28 @@ struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsign | |||
1738 | 1738 | ||
1739 | if (!b) { | 1739 | if (!b) { |
1740 | r = -ENOMEM; | 1740 | r = -ENOMEM; |
1741 | goto bad_buffer; | 1741 | goto bad; |
1742 | } | 1742 | } |
1743 | __free_buffer_wake(b); | 1743 | __free_buffer_wake(b); |
1744 | } | 1744 | } |
1745 | 1745 | ||
1746 | c->shrinker.count_objects = dm_bufio_shrink_count; | ||
1747 | c->shrinker.scan_objects = dm_bufio_shrink_scan; | ||
1748 | c->shrinker.seeks = 1; | ||
1749 | c->shrinker.batch = 0; | ||
1750 | r = register_shrinker(&c->shrinker); | ||
1751 | if (r) | ||
1752 | goto bad; | ||
1753 | |||
1746 | mutex_lock(&dm_bufio_clients_lock); | 1754 | mutex_lock(&dm_bufio_clients_lock); |
1747 | dm_bufio_client_count++; | 1755 | dm_bufio_client_count++; |
1748 | list_add(&c->client_list, &dm_bufio_all_clients); | 1756 | list_add(&c->client_list, &dm_bufio_all_clients); |
1749 | __cache_size_refresh(); | 1757 | __cache_size_refresh(); |
1750 | mutex_unlock(&dm_bufio_clients_lock); | 1758 | mutex_unlock(&dm_bufio_clients_lock); |
1751 | 1759 | ||
1752 | c->shrinker.count_objects = dm_bufio_shrink_count; | ||
1753 | c->shrinker.scan_objects = dm_bufio_shrink_scan; | ||
1754 | c->shrinker.seeks = 1; | ||
1755 | c->shrinker.batch = 0; | ||
1756 | register_shrinker(&c->shrinker); | ||
1757 | |||
1758 | return c; | 1760 | return c; |
1759 | 1761 | ||
1760 | bad_buffer: | 1762 | bad: |
1761 | bad_cache: | ||
1762 | while (!list_empty(&c->reserved_buffers)) { | 1763 | while (!list_empty(&c->reserved_buffers)) { |
1763 | struct dm_buffer *b = list_entry(c->reserved_buffers.next, | 1764 | struct dm_buffer *b = list_entry(c->reserved_buffers.next, |
1764 | struct dm_buffer, lru_list); | 1765 | struct dm_buffer, lru_list); |
@@ -1767,6 +1768,7 @@ bad_cache: | |||
1767 | } | 1768 | } |
1768 | dm_io_client_destroy(c->dm_io); | 1769 | dm_io_client_destroy(c->dm_io); |
1769 | bad_dm_io: | 1770 | bad_dm_io: |
1771 | mutex_destroy(&c->lock); | ||
1770 | kfree(c); | 1772 | kfree(c); |
1771 | bad_client: | 1773 | bad_client: |
1772 | return ERR_PTR(r); | 1774 | return ERR_PTR(r); |
@@ -1811,6 +1813,7 @@ void dm_bufio_client_destroy(struct dm_bufio_client *c) | |||
1811 | BUG_ON(c->n_buffers[i]); | 1813 | BUG_ON(c->n_buffers[i]); |
1812 | 1814 | ||
1813 | dm_io_client_destroy(c->dm_io); | 1815 | dm_io_client_destroy(c->dm_io); |
1816 | mutex_destroy(&c->lock); | ||
1814 | kfree(c); | 1817 | kfree(c); |
1815 | } | 1818 | } |
1816 | EXPORT_SYMBOL_GPL(dm_bufio_client_destroy); | 1819 | EXPORT_SYMBOL_GPL(dm_bufio_client_destroy); |
diff --git a/drivers/md/dm-core.h b/drivers/md/dm-core.h index 6a14f945783c..3222e21cbbf8 100644 --- a/drivers/md/dm-core.h +++ b/drivers/md/dm-core.h | |||
@@ -91,8 +91,7 @@ struct mapped_device { | |||
91 | /* | 91 | /* |
92 | * io objects are allocated from here. | 92 | * io objects are allocated from here. |
93 | */ | 93 | */ |
94 | mempool_t *io_pool; | 94 | struct bio_set *io_bs; |
95 | |||
96 | struct bio_set *bs; | 95 | struct bio_set *bs; |
97 | 96 | ||
98 | /* | 97 | /* |
@@ -130,8 +129,6 @@ struct mapped_device { | |||
130 | struct srcu_struct io_barrier; | 129 | struct srcu_struct io_barrier; |
131 | }; | 130 | }; |
132 | 131 | ||
133 | void dm_init_md_queue(struct mapped_device *md); | ||
134 | void dm_init_normal_md_queue(struct mapped_device *md); | ||
135 | int md_in_flight(struct mapped_device *md); | 132 | int md_in_flight(struct mapped_device *md); |
136 | void disable_write_same(struct mapped_device *md); | 133 | void disable_write_same(struct mapped_device *md); |
137 | void disable_write_zeroes(struct mapped_device *md); | 134 | void disable_write_zeroes(struct mapped_device *md); |
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index 2ad429100d25..8168f737590e 100644 --- a/drivers/md/dm-crypt.c +++ b/drivers/md/dm-crypt.c | |||
@@ -2193,6 +2193,8 @@ static void crypt_dtr(struct dm_target *ti) | |||
2193 | kzfree(cc->cipher_auth); | 2193 | kzfree(cc->cipher_auth); |
2194 | kzfree(cc->authenc_key); | 2194 | kzfree(cc->authenc_key); |
2195 | 2195 | ||
2196 | mutex_destroy(&cc->bio_alloc_lock); | ||
2197 | |||
2196 | /* Must zero key material before freeing */ | 2198 | /* Must zero key material before freeing */ |
2197 | kzfree(cc); | 2199 | kzfree(cc); |
2198 | } | 2200 | } |
@@ -2702,8 +2704,7 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv) | |||
2702 | goto bad; | 2704 | goto bad; |
2703 | } | 2705 | } |
2704 | 2706 | ||
2705 | cc->bs = bioset_create(MIN_IOS, 0, (BIOSET_NEED_BVECS | | 2707 | cc->bs = bioset_create(MIN_IOS, 0, BIOSET_NEED_BVECS); |
2706 | BIOSET_NEED_RESCUER)); | ||
2707 | if (!cc->bs) { | 2708 | if (!cc->bs) { |
2708 | ti->error = "Cannot allocate crypt bioset"; | 2709 | ti->error = "Cannot allocate crypt bioset"; |
2709 | goto bad; | 2710 | goto bad; |
diff --git a/drivers/md/dm-delay.c b/drivers/md/dm-delay.c index 288386bfbfb5..1783d80c9cad 100644 --- a/drivers/md/dm-delay.c +++ b/drivers/md/dm-delay.c | |||
@@ -229,6 +229,8 @@ static void delay_dtr(struct dm_target *ti) | |||
229 | if (dc->dev_write) | 229 | if (dc->dev_write) |
230 | dm_put_device(ti, dc->dev_write); | 230 | dm_put_device(ti, dc->dev_write); |
231 | 231 | ||
232 | mutex_destroy(&dc->timer_lock); | ||
233 | |||
232 | kfree(dc); | 234 | kfree(dc); |
233 | } | 235 | } |
234 | 236 | ||
diff --git a/drivers/md/dm-flakey.c b/drivers/md/dm-flakey.c index b82cb1ab1eaa..1b907b15f5c3 100644 --- a/drivers/md/dm-flakey.c +++ b/drivers/md/dm-flakey.c | |||
@@ -70,6 +70,11 @@ static int parse_features(struct dm_arg_set *as, struct flakey_c *fc, | |||
70 | arg_name = dm_shift_arg(as); | 70 | arg_name = dm_shift_arg(as); |
71 | argc--; | 71 | argc--; |
72 | 72 | ||
73 | if (!arg_name) { | ||
74 | ti->error = "Insufficient feature arguments"; | ||
75 | return -EINVAL; | ||
76 | } | ||
77 | |||
73 | /* | 78 | /* |
74 | * drop_writes | 79 | * drop_writes |
75 | */ | 80 | */ |
diff --git a/drivers/md/dm-io.c b/drivers/md/dm-io.c index b4357ed4d541..a8d914d5abbe 100644 --- a/drivers/md/dm-io.c +++ b/drivers/md/dm-io.c | |||
@@ -58,8 +58,7 @@ struct dm_io_client *dm_io_client_create(void) | |||
58 | if (!client->pool) | 58 | if (!client->pool) |
59 | goto bad; | 59 | goto bad; |
60 | 60 | ||
61 | client->bios = bioset_create(min_ios, 0, (BIOSET_NEED_BVECS | | 61 | client->bios = bioset_create(min_ios, 0, BIOSET_NEED_BVECS); |
62 | BIOSET_NEED_RESCUER)); | ||
63 | if (!client->bios) | 62 | if (!client->bios) |
64 | goto bad; | 63 | goto bad; |
65 | 64 | ||
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c index eb45cc3df31d..e6e7c686646d 100644 --- a/drivers/md/dm-kcopyd.c +++ b/drivers/md/dm-kcopyd.c | |||
@@ -477,8 +477,10 @@ static int run_complete_job(struct kcopyd_job *job) | |||
477 | * If this is the master job, the sub jobs have already | 477 | * If this is the master job, the sub jobs have already |
478 | * completed so we can free everything. | 478 | * completed so we can free everything. |
479 | */ | 479 | */ |
480 | if (job->master_job == job) | 480 | if (job->master_job == job) { |
481 | mutex_destroy(&job->lock); | ||
481 | mempool_free(job, kc->job_pool); | 482 | mempool_free(job, kc->job_pool); |
483 | } | ||
482 | fn(read_err, write_err, context); | 484 | fn(read_err, write_err, context); |
483 | 485 | ||
484 | if (atomic_dec_and_test(&kc->nr_jobs)) | 486 | if (atomic_dec_and_test(&kc->nr_jobs)) |
@@ -750,6 +752,7 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, | |||
750 | * followed by SPLIT_COUNT sub jobs. | 752 | * followed by SPLIT_COUNT sub jobs. |
751 | */ | 753 | */ |
752 | job = mempool_alloc(kc->job_pool, GFP_NOIO); | 754 | job = mempool_alloc(kc->job_pool, GFP_NOIO); |
755 | mutex_init(&job->lock); | ||
753 | 756 | ||
754 | /* | 757 | /* |
755 | * set up for the read. | 758 | * set up for the read. |
@@ -811,7 +814,6 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from, | |||
811 | if (job->source.count <= SUB_JOB_SIZE) | 814 | if (job->source.count <= SUB_JOB_SIZE) |
812 | dispatch_job(job); | 815 | dispatch_job(job); |
813 | else { | 816 | else { |
814 | mutex_init(&job->lock); | ||
815 | job->progress = 0; | 817 | job->progress = 0; |
816 | split_job(job); | 818 | split_job(job); |
817 | } | 819 | } |
diff --git a/drivers/md/dm-log-writes.c b/drivers/md/dm-log-writes.c index 189badbeddaf..3362d866793b 100644 --- a/drivers/md/dm-log-writes.c +++ b/drivers/md/dm-log-writes.c | |||
@@ -594,7 +594,7 @@ static int log_mark(struct log_writes_c *lc, char *data) | |||
594 | return -ENOMEM; | 594 | return -ENOMEM; |
595 | } | 595 | } |
596 | 596 | ||
597 | block->data = kstrndup(data, maxsize, GFP_KERNEL); | 597 | block->data = kstrndup(data, maxsize - 1, GFP_KERNEL); |
598 | if (!block->data) { | 598 | if (!block->data) { |
599 | DMERR("Error copying mark data"); | 599 | DMERR("Error copying mark data"); |
600 | kfree(block); | 600 | kfree(block); |
diff --git a/drivers/md/dm-mpath.c b/drivers/md/dm-mpath.c index ef57c6d1c887..7d3e572072f5 100644 --- a/drivers/md/dm-mpath.c +++ b/drivers/md/dm-mpath.c | |||
@@ -64,36 +64,30 @@ struct priority_group { | |||
64 | 64 | ||
65 | /* Multipath context */ | 65 | /* Multipath context */ |
66 | struct multipath { | 66 | struct multipath { |
67 | struct list_head list; | 67 | unsigned long flags; /* Multipath state flags */ |
68 | struct dm_target *ti; | ||
69 | |||
70 | const char *hw_handler_name; | ||
71 | char *hw_handler_params; | ||
72 | 68 | ||
73 | spinlock_t lock; | 69 | spinlock_t lock; |
74 | 70 | enum dm_queue_mode queue_mode; | |
75 | unsigned nr_priority_groups; | ||
76 | struct list_head priority_groups; | ||
77 | |||
78 | wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */ | ||
79 | 71 | ||
80 | struct pgpath *current_pgpath; | 72 | struct pgpath *current_pgpath; |
81 | struct priority_group *current_pg; | 73 | struct priority_group *current_pg; |
82 | struct priority_group *next_pg; /* Switch to this PG if set */ | 74 | struct priority_group *next_pg; /* Switch to this PG if set */ |
83 | 75 | ||
84 | unsigned long flags; /* Multipath state flags */ | 76 | atomic_t nr_valid_paths; /* Total number of usable paths */ |
77 | unsigned nr_priority_groups; | ||
78 | struct list_head priority_groups; | ||
85 | 79 | ||
80 | const char *hw_handler_name; | ||
81 | char *hw_handler_params; | ||
82 | wait_queue_head_t pg_init_wait; /* Wait for pg_init completion */ | ||
86 | unsigned pg_init_retries; /* Number of times to retry pg_init */ | 83 | unsigned pg_init_retries; /* Number of times to retry pg_init */ |
87 | unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ | 84 | unsigned pg_init_delay_msecs; /* Number of msecs before pg_init retry */ |
88 | |||
89 | atomic_t nr_valid_paths; /* Total number of usable paths */ | ||
90 | atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ | 85 | atomic_t pg_init_in_progress; /* Only one pg_init allowed at once */ |
91 | atomic_t pg_init_count; /* Number of times pg_init called */ | 86 | atomic_t pg_init_count; /* Number of times pg_init called */ |
92 | 87 | ||
93 | enum dm_queue_mode queue_mode; | ||
94 | |||
95 | struct mutex work_mutex; | 88 | struct mutex work_mutex; |
96 | struct work_struct trigger_event; | 89 | struct work_struct trigger_event; |
90 | struct dm_target *ti; | ||
97 | 91 | ||
98 | struct work_struct process_queued_bios; | 92 | struct work_struct process_queued_bios; |
99 | struct bio_list queued_bios; | 93 | struct bio_list queued_bios; |
@@ -135,10 +129,10 @@ static struct pgpath *alloc_pgpath(void) | |||
135 | { | 129 | { |
136 | struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL); | 130 | struct pgpath *pgpath = kzalloc(sizeof(*pgpath), GFP_KERNEL); |
137 | 131 | ||
138 | if (pgpath) { | 132 | if (!pgpath) |
139 | pgpath->is_active = true; | 133 | return NULL; |
140 | INIT_DELAYED_WORK(&pgpath->activate_path, activate_path_work); | 134 | |
141 | } | 135 | pgpath->is_active = true; |
142 | 136 | ||
143 | return pgpath; | 137 | return pgpath; |
144 | } | 138 | } |
@@ -193,13 +187,8 @@ static struct multipath *alloc_multipath(struct dm_target *ti) | |||
193 | if (m) { | 187 | if (m) { |
194 | INIT_LIST_HEAD(&m->priority_groups); | 188 | INIT_LIST_HEAD(&m->priority_groups); |
195 | spin_lock_init(&m->lock); | 189 | spin_lock_init(&m->lock); |
196 | set_bit(MPATHF_QUEUE_IO, &m->flags); | ||
197 | atomic_set(&m->nr_valid_paths, 0); | 190 | atomic_set(&m->nr_valid_paths, 0); |
198 | atomic_set(&m->pg_init_in_progress, 0); | ||
199 | atomic_set(&m->pg_init_count, 0); | ||
200 | m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; | ||
201 | INIT_WORK(&m->trigger_event, trigger_event); | 191 | INIT_WORK(&m->trigger_event, trigger_event); |
202 | init_waitqueue_head(&m->pg_init_wait); | ||
203 | mutex_init(&m->work_mutex); | 192 | mutex_init(&m->work_mutex); |
204 | 193 | ||
205 | m->queue_mode = DM_TYPE_NONE; | 194 | m->queue_mode = DM_TYPE_NONE; |
@@ -221,13 +210,26 @@ static int alloc_multipath_stage2(struct dm_target *ti, struct multipath *m) | |||
221 | m->queue_mode = DM_TYPE_MQ_REQUEST_BASED; | 210 | m->queue_mode = DM_TYPE_MQ_REQUEST_BASED; |
222 | else | 211 | else |
223 | m->queue_mode = DM_TYPE_REQUEST_BASED; | 212 | m->queue_mode = DM_TYPE_REQUEST_BASED; |
224 | } else if (m->queue_mode == DM_TYPE_BIO_BASED) { | 213 | |
214 | } else if (m->queue_mode == DM_TYPE_BIO_BASED || | ||
215 | m->queue_mode == DM_TYPE_NVME_BIO_BASED) { | ||
225 | INIT_WORK(&m->process_queued_bios, process_queued_bios); | 216 | INIT_WORK(&m->process_queued_bios, process_queued_bios); |
226 | /* | 217 | |
227 | * bio-based doesn't support any direct scsi_dh management; | 218 | if (m->queue_mode == DM_TYPE_BIO_BASED) { |
228 | * it just discovers if a scsi_dh is attached. | 219 | /* |
229 | */ | 220 | * bio-based doesn't support any direct scsi_dh management; |
230 | set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); | 221 | * it just discovers if a scsi_dh is attached. |
222 | */ | ||
223 | set_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags); | ||
224 | } | ||
225 | } | ||
226 | |||
227 | if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) { | ||
228 | set_bit(MPATHF_QUEUE_IO, &m->flags); | ||
229 | atomic_set(&m->pg_init_in_progress, 0); | ||
230 | atomic_set(&m->pg_init_count, 0); | ||
231 | m->pg_init_delay_msecs = DM_PG_INIT_DELAY_DEFAULT; | ||
232 | init_waitqueue_head(&m->pg_init_wait); | ||
231 | } | 233 | } |
232 | 234 | ||
233 | dm_table_set_type(ti->table, m->queue_mode); | 235 | dm_table_set_type(ti->table, m->queue_mode); |
@@ -246,6 +248,7 @@ static void free_multipath(struct multipath *m) | |||
246 | 248 | ||
247 | kfree(m->hw_handler_name); | 249 | kfree(m->hw_handler_name); |
248 | kfree(m->hw_handler_params); | 250 | kfree(m->hw_handler_params); |
251 | mutex_destroy(&m->work_mutex); | ||
249 | kfree(m); | 252 | kfree(m); |
250 | } | 253 | } |
251 | 254 | ||
@@ -264,29 +267,23 @@ static struct dm_mpath_io *get_mpio_from_bio(struct bio *bio) | |||
264 | return dm_per_bio_data(bio, multipath_per_bio_data_size()); | 267 | return dm_per_bio_data(bio, multipath_per_bio_data_size()); |
265 | } | 268 | } |
266 | 269 | ||
267 | static struct dm_bio_details *get_bio_details_from_bio(struct bio *bio) | 270 | static struct dm_bio_details *get_bio_details_from_mpio(struct dm_mpath_io *mpio) |
268 | { | 271 | { |
269 | /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */ | 272 | /* dm_bio_details is immediately after the dm_mpath_io in bio's per-bio-data */ |
270 | struct dm_mpath_io *mpio = get_mpio_from_bio(bio); | ||
271 | void *bio_details = mpio + 1; | 273 | void *bio_details = mpio + 1; |
272 | |||
273 | return bio_details; | 274 | return bio_details; |
274 | } | 275 | } |
275 | 276 | ||
276 | static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p, | 277 | static void multipath_init_per_bio_data(struct bio *bio, struct dm_mpath_io **mpio_p) |
277 | struct dm_bio_details **bio_details_p) | ||
278 | { | 278 | { |
279 | struct dm_mpath_io *mpio = get_mpio_from_bio(bio); | 279 | struct dm_mpath_io *mpio = get_mpio_from_bio(bio); |
280 | struct dm_bio_details *bio_details = get_bio_details_from_bio(bio); | 280 | struct dm_bio_details *bio_details = get_bio_details_from_mpio(mpio); |
281 | 281 | ||
282 | memset(mpio, 0, sizeof(*mpio)); | 282 | mpio->nr_bytes = bio->bi_iter.bi_size; |
283 | memset(bio_details, 0, sizeof(*bio_details)); | 283 | mpio->pgpath = NULL; |
284 | dm_bio_record(bio_details, bio); | 284 | *mpio_p = mpio; |
285 | 285 | ||
286 | if (mpio_p) | 286 | dm_bio_record(bio_details, bio); |
287 | *mpio_p = mpio; | ||
288 | if (bio_details_p) | ||
289 | *bio_details_p = bio_details; | ||
290 | } | 287 | } |
291 | 288 | ||
292 | /*----------------------------------------------- | 289 | /*----------------------------------------------- |
@@ -340,6 +337,9 @@ static void __switch_pg(struct multipath *m, struct priority_group *pg) | |||
340 | { | 337 | { |
341 | m->current_pg = pg; | 338 | m->current_pg = pg; |
342 | 339 | ||
340 | if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) | ||
341 | return; | ||
342 | |||
343 | /* Must we initialise the PG first, and queue I/O till it's ready? */ | 343 | /* Must we initialise the PG first, and queue I/O till it's ready? */ |
344 | if (m->hw_handler_name) { | 344 | if (m->hw_handler_name) { |
345 | set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); | 345 | set_bit(MPATHF_PG_INIT_REQUIRED, &m->flags); |
@@ -385,7 +385,8 @@ static struct pgpath *choose_pgpath(struct multipath *m, size_t nr_bytes) | |||
385 | unsigned bypassed = 1; | 385 | unsigned bypassed = 1; |
386 | 386 | ||
387 | if (!atomic_read(&m->nr_valid_paths)) { | 387 | if (!atomic_read(&m->nr_valid_paths)) { |
388 | clear_bit(MPATHF_QUEUE_IO, &m->flags); | 388 | if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) |
389 | clear_bit(MPATHF_QUEUE_IO, &m->flags); | ||
389 | goto failed; | 390 | goto failed; |
390 | } | 391 | } |
391 | 392 | ||
@@ -516,12 +517,10 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, | |||
516 | return DM_MAPIO_KILL; | 517 | return DM_MAPIO_KILL; |
517 | } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || | 518 | } else if (test_bit(MPATHF_QUEUE_IO, &m->flags) || |
518 | test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { | 519 | test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) { |
519 | if (pg_init_all_paths(m)) | 520 | pg_init_all_paths(m); |
520 | return DM_MAPIO_DELAY_REQUEUE; | 521 | return DM_MAPIO_DELAY_REQUEUE; |
521 | return DM_MAPIO_REQUEUE; | ||
522 | } | 522 | } |
523 | 523 | ||
524 | memset(mpio, 0, sizeof(*mpio)); | ||
525 | mpio->pgpath = pgpath; | 524 | mpio->pgpath = pgpath; |
526 | mpio->nr_bytes = nr_bytes; | 525 | mpio->nr_bytes = nr_bytes; |
527 | 526 | ||
@@ -530,12 +529,23 @@ static int multipath_clone_and_map(struct dm_target *ti, struct request *rq, | |||
530 | clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); | 529 | clone = blk_get_request(q, rq->cmd_flags | REQ_NOMERGE, GFP_ATOMIC); |
531 | if (IS_ERR(clone)) { | 530 | if (IS_ERR(clone)) { |
532 | /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ | 531 | /* EBUSY, ENODEV or EWOULDBLOCK: requeue */ |
533 | bool queue_dying = blk_queue_dying(q); | 532 | if (blk_queue_dying(q)) { |
534 | if (queue_dying) { | ||
535 | atomic_inc(&m->pg_init_in_progress); | 533 | atomic_inc(&m->pg_init_in_progress); |
536 | activate_or_offline_path(pgpath); | 534 | activate_or_offline_path(pgpath); |
535 | return DM_MAPIO_DELAY_REQUEUE; | ||
537 | } | 536 | } |
538 | return DM_MAPIO_DELAY_REQUEUE; | 537 | |
538 | /* | ||
539 | * blk-mq's SCHED_RESTART can cover this requeue, so we | ||
540 | * needn't deal with it by DELAY_REQUEUE. More importantly, | ||
541 | * we have to return DM_MAPIO_REQUEUE so that blk-mq can | ||
542 | * get the queue busy feedback (via BLK_STS_RESOURCE), | ||
543 | * otherwise I/O merging can suffer. | ||
544 | */ | ||
545 | if (q->mq_ops) | ||
546 | return DM_MAPIO_REQUEUE; | ||
547 | else | ||
548 | return DM_MAPIO_DELAY_REQUEUE; | ||
539 | } | 549 | } |
540 | clone->bio = clone->biotail = NULL; | 550 | clone->bio = clone->biotail = NULL; |
541 | clone->rq_disk = bdev->bd_disk; | 551 | clone->rq_disk = bdev->bd_disk; |
@@ -557,9 +567,9 @@ static void multipath_release_clone(struct request *clone) | |||
557 | /* | 567 | /* |
558 | * Map cloned bios (bio-based multipath) | 568 | * Map cloned bios (bio-based multipath) |
559 | */ | 569 | */ |
560 | static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_mpath_io *mpio) | 570 | |
571 | static struct pgpath *__map_bio(struct multipath *m, struct bio *bio) | ||
561 | { | 572 | { |
562 | size_t nr_bytes = bio->bi_iter.bi_size; | ||
563 | struct pgpath *pgpath; | 573 | struct pgpath *pgpath; |
564 | unsigned long flags; | 574 | unsigned long flags; |
565 | bool queue_io; | 575 | bool queue_io; |
@@ -568,7 +578,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m | |||
568 | pgpath = READ_ONCE(m->current_pgpath); | 578 | pgpath = READ_ONCE(m->current_pgpath); |
569 | queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); | 579 | queue_io = test_bit(MPATHF_QUEUE_IO, &m->flags); |
570 | if (!pgpath || !queue_io) | 580 | if (!pgpath || !queue_io) |
571 | pgpath = choose_pgpath(m, nr_bytes); | 581 | pgpath = choose_pgpath(m, bio->bi_iter.bi_size); |
572 | 582 | ||
573 | if ((pgpath && queue_io) || | 583 | if ((pgpath && queue_io) || |
574 | (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { | 584 | (!pgpath && test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags))) { |
@@ -576,14 +586,62 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m | |||
576 | spin_lock_irqsave(&m->lock, flags); | 586 | spin_lock_irqsave(&m->lock, flags); |
577 | bio_list_add(&m->queued_bios, bio); | 587 | bio_list_add(&m->queued_bios, bio); |
578 | spin_unlock_irqrestore(&m->lock, flags); | 588 | spin_unlock_irqrestore(&m->lock, flags); |
589 | |||
579 | /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */ | 590 | /* PG_INIT_REQUIRED cannot be set without QUEUE_IO */ |
580 | if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) | 591 | if (queue_io || test_bit(MPATHF_PG_INIT_REQUIRED, &m->flags)) |
581 | pg_init_all_paths(m); | 592 | pg_init_all_paths(m); |
582 | else if (!queue_io) | 593 | else if (!queue_io) |
583 | queue_work(kmultipathd, &m->process_queued_bios); | 594 | queue_work(kmultipathd, &m->process_queued_bios); |
584 | return DM_MAPIO_SUBMITTED; | 595 | |
596 | return ERR_PTR(-EAGAIN); | ||
585 | } | 597 | } |
586 | 598 | ||
599 | return pgpath; | ||
600 | } | ||
601 | |||
602 | static struct pgpath *__map_bio_nvme(struct multipath *m, struct bio *bio) | ||
603 | { | ||
604 | struct pgpath *pgpath; | ||
605 | unsigned long flags; | ||
606 | |||
607 | /* Do we need to select a new pgpath? */ | ||
608 | /* | ||
609 | * FIXME: currently only switching path if no path (due to failure, etc) | ||
610 | * - which negates the point of using a path selector | ||
611 | */ | ||
612 | pgpath = READ_ONCE(m->current_pgpath); | ||
613 | if (!pgpath) | ||
614 | pgpath = choose_pgpath(m, bio->bi_iter.bi_size); | ||
615 | |||
616 | if (!pgpath) { | ||
617 | if (test_bit(MPATHF_QUEUE_IF_NO_PATH, &m->flags)) { | ||
618 | /* Queue for the daemon to resubmit */ | ||
619 | spin_lock_irqsave(&m->lock, flags); | ||
620 | bio_list_add(&m->queued_bios, bio); | ||
621 | spin_unlock_irqrestore(&m->lock, flags); | ||
622 | queue_work(kmultipathd, &m->process_queued_bios); | ||
623 | |||
624 | return ERR_PTR(-EAGAIN); | ||
625 | } | ||
626 | return NULL; | ||
627 | } | ||
628 | |||
629 | return pgpath; | ||
630 | } | ||
631 | |||
632 | static int __multipath_map_bio(struct multipath *m, struct bio *bio, | ||
633 | struct dm_mpath_io *mpio) | ||
634 | { | ||
635 | struct pgpath *pgpath; | ||
636 | |||
637 | if (m->queue_mode == DM_TYPE_NVME_BIO_BASED) | ||
638 | pgpath = __map_bio_nvme(m, bio); | ||
639 | else | ||
640 | pgpath = __map_bio(m, bio); | ||
641 | |||
642 | if (IS_ERR(pgpath)) | ||
643 | return DM_MAPIO_SUBMITTED; | ||
644 | |||
587 | if (!pgpath) { | 645 | if (!pgpath) { |
588 | if (must_push_back_bio(m)) | 646 | if (must_push_back_bio(m)) |
589 | return DM_MAPIO_REQUEUE; | 647 | return DM_MAPIO_REQUEUE; |
@@ -592,7 +650,6 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m | |||
592 | } | 650 | } |
593 | 651 | ||
594 | mpio->pgpath = pgpath; | 652 | mpio->pgpath = pgpath; |
595 | mpio->nr_bytes = nr_bytes; | ||
596 | 653 | ||
597 | bio->bi_status = 0; | 654 | bio->bi_status = 0; |
598 | bio_set_dev(bio, pgpath->path.dev->bdev); | 655 | bio_set_dev(bio, pgpath->path.dev->bdev); |
@@ -601,7 +658,7 @@ static int __multipath_map_bio(struct multipath *m, struct bio *bio, struct dm_m | |||
601 | if (pgpath->pg->ps.type->start_io) | 658 | if (pgpath->pg->ps.type->start_io) |
602 | pgpath->pg->ps.type->start_io(&pgpath->pg->ps, | 659 | pgpath->pg->ps.type->start_io(&pgpath->pg->ps, |
603 | &pgpath->path, | 660 | &pgpath->path, |
604 | nr_bytes); | 661 | mpio->nr_bytes); |
605 | return DM_MAPIO_REMAPPED; | 662 | return DM_MAPIO_REMAPPED; |
606 | } | 663 | } |
607 | 664 | ||
@@ -610,8 +667,7 @@ static int multipath_map_bio(struct dm_target *ti, struct bio *bio) | |||
610 | struct multipath *m = ti->private; | 667 | struct multipath *m = ti->private; |
611 | struct dm_mpath_io *mpio = NULL; | 668 | struct dm_mpath_io *mpio = NULL; |
612 | 669 | ||
613 | multipath_init_per_bio_data(bio, &mpio, NULL); | 670 | multipath_init_per_bio_data(bio, &mpio); |
614 | |||
615 | return __multipath_map_bio(m, bio, mpio); | 671 | return __multipath_map_bio(m, bio, mpio); |
616 | } | 672 | } |
617 | 673 | ||
@@ -619,7 +675,8 @@ static void process_queued_io_list(struct multipath *m) | |||
619 | { | 675 | { |
620 | if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) | 676 | if (m->queue_mode == DM_TYPE_MQ_REQUEST_BASED) |
621 | dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table)); | 677 | dm_mq_kick_requeue_list(dm_table_get_md(m->ti->table)); |
622 | else if (m->queue_mode == DM_TYPE_BIO_BASED) | 678 | else if (m->queue_mode == DM_TYPE_BIO_BASED || |
679 | m->queue_mode == DM_TYPE_NVME_BIO_BASED) | ||
623 | queue_work(kmultipathd, &m->process_queued_bios); | 680 | queue_work(kmultipathd, &m->process_queued_bios); |
624 | } | 681 | } |
625 | 682 | ||
@@ -649,7 +706,9 @@ static void process_queued_bios(struct work_struct *work) | |||
649 | 706 | ||
650 | blk_start_plug(&plug); | 707 | blk_start_plug(&plug); |
651 | while ((bio = bio_list_pop(&bios))) { | 708 | while ((bio = bio_list_pop(&bios))) { |
652 | r = __multipath_map_bio(m, bio, get_mpio_from_bio(bio)); | 709 | struct dm_mpath_io *mpio = get_mpio_from_bio(bio); |
710 | dm_bio_restore(get_bio_details_from_mpio(mpio), bio); | ||
711 | r = __multipath_map_bio(m, bio, mpio); | ||
653 | switch (r) { | 712 | switch (r) { |
654 | case DM_MAPIO_KILL: | 713 | case DM_MAPIO_KILL: |
655 | bio->bi_status = BLK_STS_IOERR; | 714 | bio->bi_status = BLK_STS_IOERR; |
@@ -752,34 +811,11 @@ static int parse_path_selector(struct dm_arg_set *as, struct priority_group *pg, | |||
752 | return 0; | 811 | return 0; |
753 | } | 812 | } |
754 | 813 | ||
755 | static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, | 814 | static int setup_scsi_dh(struct block_device *bdev, struct multipath *m, char **error) |
756 | struct dm_target *ti) | ||
757 | { | 815 | { |
758 | int r; | 816 | struct request_queue *q = bdev_get_queue(bdev); |
759 | struct pgpath *p; | ||
760 | struct multipath *m = ti->private; | ||
761 | struct request_queue *q = NULL; | ||
762 | const char *attached_handler_name; | 817 | const char *attached_handler_name; |
763 | 818 | int r; | |
764 | /* we need at least a path arg */ | ||
765 | if (as->argc < 1) { | ||
766 | ti->error = "no device given"; | ||
767 | return ERR_PTR(-EINVAL); | ||
768 | } | ||
769 | |||
770 | p = alloc_pgpath(); | ||
771 | if (!p) | ||
772 | return ERR_PTR(-ENOMEM); | ||
773 | |||
774 | r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table), | ||
775 | &p->path.dev); | ||
776 | if (r) { | ||
777 | ti->error = "error getting device"; | ||
778 | goto bad; | ||
779 | } | ||
780 | |||
781 | if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags) || m->hw_handler_name) | ||
782 | q = bdev_get_queue(p->path.dev->bdev); | ||
783 | 819 | ||
784 | if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { | 820 | if (test_bit(MPATHF_RETAIN_ATTACHED_HW_HANDLER, &m->flags)) { |
785 | retain: | 821 | retain: |
@@ -811,26 +847,59 @@ retain: | |||
811 | char b[BDEVNAME_SIZE]; | 847 | char b[BDEVNAME_SIZE]; |
812 | 848 | ||
813 | printk(KERN_INFO "dm-mpath: retaining handler on device %s\n", | 849 | printk(KERN_INFO "dm-mpath: retaining handler on device %s\n", |
814 | bdevname(p->path.dev->bdev, b)); | 850 | bdevname(bdev, b)); |
815 | goto retain; | 851 | goto retain; |
816 | } | 852 | } |
817 | if (r < 0) { | 853 | if (r < 0) { |
818 | ti->error = "error attaching hardware handler"; | 854 | *error = "error attaching hardware handler"; |
819 | dm_put_device(ti, p->path.dev); | 855 | return r; |
820 | goto bad; | ||
821 | } | 856 | } |
822 | 857 | ||
823 | if (m->hw_handler_params) { | 858 | if (m->hw_handler_params) { |
824 | r = scsi_dh_set_params(q, m->hw_handler_params); | 859 | r = scsi_dh_set_params(q, m->hw_handler_params); |
825 | if (r < 0) { | 860 | if (r < 0) { |
826 | ti->error = "unable to set hardware " | 861 | *error = "unable to set hardware handler parameters"; |
827 | "handler parameters"; | 862 | return r; |
828 | dm_put_device(ti, p->path.dev); | ||
829 | goto bad; | ||
830 | } | 863 | } |
831 | } | 864 | } |
832 | } | 865 | } |
833 | 866 | ||
867 | return 0; | ||
868 | } | ||
869 | |||
870 | static struct pgpath *parse_path(struct dm_arg_set *as, struct path_selector *ps, | ||
871 | struct dm_target *ti) | ||
872 | { | ||
873 | int r; | ||
874 | struct pgpath *p; | ||
875 | struct multipath *m = ti->private; | ||
876 | |||
877 | /* we need at least a path arg */ | ||
878 | if (as->argc < 1) { | ||
879 | ti->error = "no device given"; | ||
880 | return ERR_PTR(-EINVAL); | ||
881 | } | ||
882 | |||
883 | p = alloc_pgpath(); | ||
884 | if (!p) | ||
885 | return ERR_PTR(-ENOMEM); | ||
886 | |||
887 | r = dm_get_device(ti, dm_shift_arg(as), dm_table_get_mode(ti->table), | ||
888 | &p->path.dev); | ||
889 | if (r) { | ||
890 | ti->error = "error getting device"; | ||
891 | goto bad; | ||
892 | } | ||
893 | |||
894 | if (m->queue_mode != DM_TYPE_NVME_BIO_BASED) { | ||
895 | INIT_DELAYED_WORK(&p->activate_path, activate_path_work); | ||
896 | r = setup_scsi_dh(p->path.dev->bdev, m, &ti->error); | ||
897 | if (r) { | ||
898 | dm_put_device(ti, p->path.dev); | ||
899 | goto bad; | ||
900 | } | ||
901 | } | ||
902 | |||
834 | r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error); | 903 | r = ps->type->add_path(ps, &p->path, as->argc, as->argv, &ti->error); |
835 | if (r) { | 904 | if (r) { |
836 | dm_put_device(ti, p->path.dev); | 905 | dm_put_device(ti, p->path.dev); |
@@ -838,7 +907,6 @@ retain: | |||
838 | } | 907 | } |
839 | 908 | ||
840 | return p; | 909 | return p; |
841 | |||
842 | bad: | 910 | bad: |
843 | free_pgpath(p); | 911 | free_pgpath(p); |
844 | return ERR_PTR(r); | 912 | return ERR_PTR(r); |
@@ -933,7 +1001,8 @@ static int parse_hw_handler(struct dm_arg_set *as, struct multipath *m) | |||
933 | if (!hw_argc) | 1001 | if (!hw_argc) |
934 | return 0; | 1002 | return 0; |
935 | 1003 | ||
936 | if (m->queue_mode == DM_TYPE_BIO_BASED) { | 1004 | if (m->queue_mode == DM_TYPE_BIO_BASED || |
1005 | m->queue_mode == DM_TYPE_NVME_BIO_BASED) { | ||
937 | dm_consume_args(as, hw_argc); | 1006 | dm_consume_args(as, hw_argc); |
938 | DMERR("bio-based multipath doesn't allow hardware handler args"); | 1007 | DMERR("bio-based multipath doesn't allow hardware handler args"); |
939 | return 0; | 1008 | return 0; |
@@ -1022,6 +1091,8 @@ static int parse_features(struct dm_arg_set *as, struct multipath *m) | |||
1022 | 1091 | ||
1023 | if (!strcasecmp(queue_mode_name, "bio")) | 1092 | if (!strcasecmp(queue_mode_name, "bio")) |
1024 | m->queue_mode = DM_TYPE_BIO_BASED; | 1093 | m->queue_mode = DM_TYPE_BIO_BASED; |
1094 | else if (!strcasecmp(queue_mode_name, "nvme")) | ||
1095 | m->queue_mode = DM_TYPE_NVME_BIO_BASED; | ||
1025 | else if (!strcasecmp(queue_mode_name, "rq")) | 1096 | else if (!strcasecmp(queue_mode_name, "rq")) |
1026 | m->queue_mode = DM_TYPE_REQUEST_BASED; | 1097 | m->queue_mode = DM_TYPE_REQUEST_BASED; |
1027 | else if (!strcasecmp(queue_mode_name, "mq")) | 1098 | else if (!strcasecmp(queue_mode_name, "mq")) |
@@ -1122,7 +1193,7 @@ static int multipath_ctr(struct dm_target *ti, unsigned argc, char **argv) | |||
1122 | ti->num_discard_bios = 1; | 1193 | ti->num_discard_bios = 1; |
1123 | ti->num_write_same_bios = 1; | 1194 | ti->num_write_same_bios = 1; |
1124 | ti->num_write_zeroes_bios = 1; | 1195 | ti->num_write_zeroes_bios = 1; |
1125 | if (m->queue_mode == DM_TYPE_BIO_BASED) | 1196 | if (m->queue_mode == DM_TYPE_BIO_BASED || m->queue_mode == DM_TYPE_NVME_BIO_BASED) |
1126 | ti->per_io_data_size = multipath_per_bio_data_size(); | 1197 | ti->per_io_data_size = multipath_per_bio_data_size(); |
1127 | else | 1198 | else |
1128 | ti->per_io_data_size = sizeof(struct dm_mpath_io); | 1199 | ti->per_io_data_size = sizeof(struct dm_mpath_io); |
@@ -1151,16 +1222,19 @@ static void multipath_wait_for_pg_init_completion(struct multipath *m) | |||
1151 | 1222 | ||
1152 | static void flush_multipath_work(struct multipath *m) | 1223 | static void flush_multipath_work(struct multipath *m) |
1153 | { | 1224 | { |
1154 | set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); | 1225 | if (m->hw_handler_name) { |
1155 | smp_mb__after_atomic(); | 1226 | set_bit(MPATHF_PG_INIT_DISABLED, &m->flags); |
1227 | smp_mb__after_atomic(); | ||
1228 | |||
1229 | flush_workqueue(kmpath_handlerd); | ||
1230 | multipath_wait_for_pg_init_completion(m); | ||
1231 | |||
1232 | clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); | ||
1233 | smp_mb__after_atomic(); | ||
1234 | } | ||
1156 | 1235 | ||
1157 | flush_workqueue(kmpath_handlerd); | ||
1158 | multipath_wait_for_pg_init_completion(m); | ||
1159 | flush_workqueue(kmultipathd); | 1236 | flush_workqueue(kmultipathd); |
1160 | flush_work(&m->trigger_event); | 1237 | flush_work(&m->trigger_event); |
1161 | |||
1162 | clear_bit(MPATHF_PG_INIT_DISABLED, &m->flags); | ||
1163 | smp_mb__after_atomic(); | ||
1164 | } | 1238 | } |
1165 | 1239 | ||
1166 | static void multipath_dtr(struct dm_target *ti) | 1240 | static void multipath_dtr(struct dm_target *ti) |
@@ -1496,7 +1570,10 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, | |||
1496 | if (error && blk_path_error(error)) { | 1570 | if (error && blk_path_error(error)) { |
1497 | struct multipath *m = ti->private; | 1571 | struct multipath *m = ti->private; |
1498 | 1572 | ||
1499 | r = DM_ENDIO_REQUEUE; | 1573 | if (error == BLK_STS_RESOURCE) |
1574 | r = DM_ENDIO_DELAY_REQUEUE; | ||
1575 | else | ||
1576 | r = DM_ENDIO_REQUEUE; | ||
1500 | 1577 | ||
1501 | if (pgpath) | 1578 | if (pgpath) |
1502 | fail_path(pgpath); | 1579 | fail_path(pgpath); |
@@ -1521,7 +1598,7 @@ static int multipath_end_io(struct dm_target *ti, struct request *clone, | |||
1521 | } | 1598 | } |
1522 | 1599 | ||
1523 | static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, | 1600 | static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, |
1524 | blk_status_t *error) | 1601 | blk_status_t *error) |
1525 | { | 1602 | { |
1526 | struct multipath *m = ti->private; | 1603 | struct multipath *m = ti->private; |
1527 | struct dm_mpath_io *mpio = get_mpio_from_bio(clone); | 1604 | struct dm_mpath_io *mpio = get_mpio_from_bio(clone); |
@@ -1546,9 +1623,6 @@ static int multipath_end_io_bio(struct dm_target *ti, struct bio *clone, | |||
1546 | goto done; | 1623 | goto done; |
1547 | } | 1624 | } |
1548 | 1625 | ||
1549 | /* Queue for the daemon to resubmit */ | ||
1550 | dm_bio_restore(get_bio_details_from_bio(clone), clone); | ||
1551 | |||
1552 | spin_lock_irqsave(&m->lock, flags); | 1626 | spin_lock_irqsave(&m->lock, flags); |
1553 | bio_list_add(&m->queued_bios, clone); | 1627 | bio_list_add(&m->queued_bios, clone); |
1554 | spin_unlock_irqrestore(&m->lock, flags); | 1628 | spin_unlock_irqrestore(&m->lock, flags); |
@@ -1656,6 +1730,9 @@ static void multipath_status(struct dm_target *ti, status_type_t type, | |||
1656 | case DM_TYPE_BIO_BASED: | 1730 | case DM_TYPE_BIO_BASED: |
1657 | DMEMIT("queue_mode bio "); | 1731 | DMEMIT("queue_mode bio "); |
1658 | break; | 1732 | break; |
1733 | case DM_TYPE_NVME_BIO_BASED: | ||
1734 | DMEMIT("queue_mode nvme "); | ||
1735 | break; | ||
1659 | case DM_TYPE_MQ_REQUEST_BASED: | 1736 | case DM_TYPE_MQ_REQUEST_BASED: |
1660 | DMEMIT("queue_mode mq "); | 1737 | DMEMIT("queue_mode mq "); |
1661 | break; | 1738 | break; |
diff --git a/drivers/md/dm-queue-length.c b/drivers/md/dm-queue-length.c index 23f178641794..969c4f1a3633 100644 --- a/drivers/md/dm-queue-length.c +++ b/drivers/md/dm-queue-length.c | |||
@@ -195,9 +195,6 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) | |||
195 | if (list_empty(&s->valid_paths)) | 195 | if (list_empty(&s->valid_paths)) |
196 | goto out; | 196 | goto out; |
197 | 197 | ||
198 | /* Change preferred (first in list) path to evenly balance. */ | ||
199 | list_move_tail(s->valid_paths.next, &s->valid_paths); | ||
200 | |||
201 | list_for_each_entry(pi, &s->valid_paths, list) { | 198 | list_for_each_entry(pi, &s->valid_paths, list) { |
202 | if (!best || | 199 | if (!best || |
203 | (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) | 200 | (atomic_read(&pi->qlen) < atomic_read(&best->qlen))) |
@@ -210,6 +207,9 @@ static struct dm_path *ql_select_path(struct path_selector *ps, size_t nr_bytes) | |||
210 | if (!best) | 207 | if (!best) |
211 | goto out; | 208 | goto out; |
212 | 209 | ||
210 | /* Move most recently used to least preferred to evenly balance. */ | ||
211 | list_move_tail(&best->list, &s->valid_paths); | ||
212 | |||
213 | ret = best->path; | 213 | ret = best->path; |
214 | out: | 214 | out: |
215 | spin_unlock_irqrestore(&s->lock, flags); | 215 | spin_unlock_irqrestore(&s->lock, flags); |
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c index e5ef0757fe23..7ef469e902c6 100644 --- a/drivers/md/dm-raid.c +++ b/drivers/md/dm-raid.c | |||
@@ -29,6 +29,9 @@ | |||
29 | */ | 29 | */ |
30 | #define MIN_RAID456_JOURNAL_SPACE (4*2048) | 30 | #define MIN_RAID456_JOURNAL_SPACE (4*2048) |
31 | 31 | ||
32 | /* Global list of all raid sets */ | ||
33 | static LIST_HEAD(raid_sets); | ||
34 | |||
32 | static bool devices_handle_discard_safely = false; | 35 | static bool devices_handle_discard_safely = false; |
33 | 36 | ||
34 | /* | 37 | /* |
@@ -105,8 +108,6 @@ struct raid_dev { | |||
105 | #define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV) | 108 | #define CTR_FLAG_JOURNAL_DEV (1 << __CTR_FLAG_JOURNAL_DEV) |
106 | #define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE) | 109 | #define CTR_FLAG_JOURNAL_MODE (1 << __CTR_FLAG_JOURNAL_MODE) |
107 | 110 | ||
108 | #define RESUME_STAY_FROZEN_FLAGS (CTR_FLAG_DELTA_DISKS | CTR_FLAG_DATA_OFFSET) | ||
109 | |||
110 | /* | 111 | /* |
111 | * Definitions of various constructor flags to | 112 | * Definitions of various constructor flags to |
112 | * be used in checks of valid / invalid flags | 113 | * be used in checks of valid / invalid flags |
@@ -209,6 +210,8 @@ struct raid_dev { | |||
209 | #define RT_FLAG_UPDATE_SBS 3 | 210 | #define RT_FLAG_UPDATE_SBS 3 |
210 | #define RT_FLAG_RESHAPE_RS 4 | 211 | #define RT_FLAG_RESHAPE_RS 4 |
211 | #define RT_FLAG_RS_SUSPENDED 5 | 212 | #define RT_FLAG_RS_SUSPENDED 5 |
213 | #define RT_FLAG_RS_IN_SYNC 6 | ||
214 | #define RT_FLAG_RS_RESYNCING 7 | ||
212 | 215 | ||
213 | /* Array elements of 64 bit needed for rebuild/failed disk bits */ | 216 | /* Array elements of 64 bit needed for rebuild/failed disk bits */ |
214 | #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) | 217 | #define DISKS_ARRAY_ELEMS ((MAX_RAID_DEVICES + (sizeof(uint64_t) * 8 - 1)) / sizeof(uint64_t) / 8) |
@@ -224,8 +227,8 @@ struct rs_layout { | |||
224 | 227 | ||
225 | struct raid_set { | 228 | struct raid_set { |
226 | struct dm_target *ti; | 229 | struct dm_target *ti; |
230 | struct list_head list; | ||
227 | 231 | ||
228 | uint32_t bitmap_loaded; | ||
229 | uint32_t stripe_cache_entries; | 232 | uint32_t stripe_cache_entries; |
230 | unsigned long ctr_flags; | 233 | unsigned long ctr_flags; |
231 | unsigned long runtime_flags; | 234 | unsigned long runtime_flags; |
@@ -270,6 +273,19 @@ static void rs_config_restore(struct raid_set *rs, struct rs_layout *l) | |||
270 | mddev->new_chunk_sectors = l->new_chunk_sectors; | 273 | mddev->new_chunk_sectors = l->new_chunk_sectors; |
271 | } | 274 | } |
272 | 275 | ||
276 | /* Find any raid_set in active slot for @rs on global list */ | ||
277 | static struct raid_set *rs_find_active(struct raid_set *rs) | ||
278 | { | ||
279 | struct raid_set *r; | ||
280 | struct mapped_device *md = dm_table_get_md(rs->ti->table); | ||
281 | |||
282 | list_for_each_entry(r, &raid_sets, list) | ||
283 | if (r != rs && dm_table_get_md(r->ti->table) == md) | ||
284 | return r; | ||
285 | |||
286 | return NULL; | ||
287 | } | ||
288 | |||
273 | /* raid10 algorithms (i.e. formats) */ | 289 | /* raid10 algorithms (i.e. formats) */ |
274 | #define ALGORITHM_RAID10_DEFAULT 0 | 290 | #define ALGORITHM_RAID10_DEFAULT 0 |
275 | #define ALGORITHM_RAID10_NEAR 1 | 291 | #define ALGORITHM_RAID10_NEAR 1 |
@@ -572,7 +588,7 @@ static const char *raid10_md_layout_to_format(int layout) | |||
572 | } | 588 | } |
573 | 589 | ||
574 | /* Return md raid10 algorithm for @name */ | 590 | /* Return md raid10 algorithm for @name */ |
575 | static int raid10_name_to_format(const char *name) | 591 | static const int raid10_name_to_format(const char *name) |
576 | { | 592 | { |
577 | if (!strcasecmp(name, "near")) | 593 | if (!strcasecmp(name, "near")) |
578 | return ALGORITHM_RAID10_NEAR; | 594 | return ALGORITHM_RAID10_NEAR; |
@@ -675,15 +691,11 @@ static struct raid_type *get_raid_type_by_ll(const int level, const int layout) | |||
675 | return NULL; | 691 | return NULL; |
676 | } | 692 | } |
677 | 693 | ||
678 | /* | 694 | /* Adjust rdev sectors */ |
679 | * Conditionally change bdev capacity of @rs | 695 | static void rs_set_rdev_sectors(struct raid_set *rs) |
680 | * in case of a disk add/remove reshape | ||
681 | */ | ||
682 | static void rs_set_capacity(struct raid_set *rs) | ||
683 | { | 696 | { |
684 | struct mddev *mddev = &rs->md; | 697 | struct mddev *mddev = &rs->md; |
685 | struct md_rdev *rdev; | 698 | struct md_rdev *rdev; |
686 | struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table)); | ||
687 | 699 | ||
688 | /* | 700 | /* |
689 | * raid10 sets rdev->sector to the device size, which | 701 | * raid10 sets rdev->sector to the device size, which |
@@ -692,8 +704,16 @@ static void rs_set_capacity(struct raid_set *rs) | |||
692 | rdev_for_each(rdev, mddev) | 704 | rdev_for_each(rdev, mddev) |
693 | if (!test_bit(Journal, &rdev->flags)) | 705 | if (!test_bit(Journal, &rdev->flags)) |
694 | rdev->sectors = mddev->dev_sectors; | 706 | rdev->sectors = mddev->dev_sectors; |
707 | } | ||
695 | 708 | ||
696 | set_capacity(gendisk, mddev->array_sectors); | 709 | /* |
710 | * Change bdev capacity of @rs in case of a disk add/remove reshape | ||
711 | */ | ||
712 | static void rs_set_capacity(struct raid_set *rs) | ||
713 | { | ||
714 | struct gendisk *gendisk = dm_disk(dm_table_get_md(rs->ti->table)); | ||
715 | |||
716 | set_capacity(gendisk, rs->md.array_sectors); | ||
697 | revalidate_disk(gendisk); | 717 | revalidate_disk(gendisk); |
698 | } | 718 | } |
699 | 719 | ||
@@ -744,6 +764,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r | |||
744 | 764 | ||
745 | mddev_init(&rs->md); | 765 | mddev_init(&rs->md); |
746 | 766 | ||
767 | INIT_LIST_HEAD(&rs->list); | ||
747 | rs->raid_disks = raid_devs; | 768 | rs->raid_disks = raid_devs; |
748 | rs->delta_disks = 0; | 769 | rs->delta_disks = 0; |
749 | 770 | ||
@@ -761,6 +782,9 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r | |||
761 | for (i = 0; i < raid_devs; i++) | 782 | for (i = 0; i < raid_devs; i++) |
762 | md_rdev_init(&rs->dev[i].rdev); | 783 | md_rdev_init(&rs->dev[i].rdev); |
763 | 784 | ||
785 | /* Add @rs to global list. */ | ||
786 | list_add(&rs->list, &raid_sets); | ||
787 | |||
764 | /* | 788 | /* |
765 | * Remaining items to be initialized by further RAID params: | 789 | * Remaining items to be initialized by further RAID params: |
766 | * rs->md.persistent | 790 | * rs->md.persistent |
@@ -773,6 +797,7 @@ static struct raid_set *raid_set_alloc(struct dm_target *ti, struct raid_type *r | |||
773 | return rs; | 797 | return rs; |
774 | } | 798 | } |
775 | 799 | ||
800 | /* Free all @rs allocations and remove it from global list. */ | ||
776 | static void raid_set_free(struct raid_set *rs) | 801 | static void raid_set_free(struct raid_set *rs) |
777 | { | 802 | { |
778 | int i; | 803 | int i; |
@@ -790,6 +815,8 @@ static void raid_set_free(struct raid_set *rs) | |||
790 | dm_put_device(rs->ti, rs->dev[i].data_dev); | 815 | dm_put_device(rs->ti, rs->dev[i].data_dev); |
791 | } | 816 | } |
792 | 817 | ||
818 | list_del(&rs->list); | ||
819 | |||
793 | kfree(rs); | 820 | kfree(rs); |
794 | } | 821 | } |
795 | 822 | ||
@@ -1002,7 +1029,7 @@ static int validate_raid_redundancy(struct raid_set *rs) | |||
1002 | !rs->dev[i].rdev.sb_page) | 1029 | !rs->dev[i].rdev.sb_page) |
1003 | rebuild_cnt++; | 1030 | rebuild_cnt++; |
1004 | 1031 | ||
1005 | switch (rs->raid_type->level) { | 1032 | switch (rs->md.level) { |
1006 | case 0: | 1033 | case 0: |
1007 | break; | 1034 | break; |
1008 | case 1: | 1035 | case 1: |
@@ -1017,6 +1044,11 @@ static int validate_raid_redundancy(struct raid_set *rs) | |||
1017 | break; | 1044 | break; |
1018 | case 10: | 1045 | case 10: |
1019 | copies = raid10_md_layout_to_copies(rs->md.new_layout); | 1046 | copies = raid10_md_layout_to_copies(rs->md.new_layout); |
1047 | if (copies < 2) { | ||
1048 | DMERR("Bogus raid10 data copies < 2!"); | ||
1049 | return -EINVAL; | ||
1050 | } | ||
1051 | |||
1020 | if (rebuild_cnt < copies) | 1052 | if (rebuild_cnt < copies) |
1021 | break; | 1053 | break; |
1022 | 1054 | ||
@@ -1576,6 +1608,24 @@ static sector_t __rdev_sectors(struct raid_set *rs) | |||
1576 | return 0; | 1608 | return 0; |
1577 | } | 1609 | } |
1578 | 1610 | ||
1611 | /* Check that calculated dev_sectors fits all component devices. */ | ||
1612 | static int _check_data_dev_sectors(struct raid_set *rs) | ||
1613 | { | ||
1614 | sector_t ds = ~0; | ||
1615 | struct md_rdev *rdev; | ||
1616 | |||
1617 | rdev_for_each(rdev, &rs->md) | ||
1618 | if (!test_bit(Journal, &rdev->flags) && rdev->bdev) { | ||
1619 | ds = min(ds, to_sector(i_size_read(rdev->bdev->bd_inode))); | ||
1620 | if (ds < rs->md.dev_sectors) { | ||
1621 | rs->ti->error = "Component device(s) too small"; | ||
1622 | return -EINVAL; | ||
1623 | } | ||
1624 | } | ||
1625 | |||
1626 | return 0; | ||
1627 | } | ||
1628 | |||
1579 | /* Calculate the sectors per device and per array used for @rs */ | 1629 | /* Calculate the sectors per device and per array used for @rs */ |
1580 | static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) | 1630 | static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) |
1581 | { | 1631 | { |
@@ -1625,7 +1675,7 @@ static int rs_set_dev_and_array_sectors(struct raid_set *rs, bool use_mddev) | |||
1625 | mddev->array_sectors = array_sectors; | 1675 | mddev->array_sectors = array_sectors; |
1626 | mddev->dev_sectors = dev_sectors; | 1676 | mddev->dev_sectors = dev_sectors; |
1627 | 1677 | ||
1628 | return 0; | 1678 | return _check_data_dev_sectors(rs); |
1629 | bad: | 1679 | bad: |
1630 | rs->ti->error = "Target length not divisible by number of data devices"; | 1680 | rs->ti->error = "Target length not divisible by number of data devices"; |
1631 | return -EINVAL; | 1681 | return -EINVAL; |
@@ -1674,8 +1724,11 @@ static void do_table_event(struct work_struct *ws) | |||
1674 | struct raid_set *rs = container_of(ws, struct raid_set, md.event_work); | 1724 | struct raid_set *rs = container_of(ws, struct raid_set, md.event_work); |
1675 | 1725 | ||
1676 | smp_rmb(); /* Make sure we access most actual mddev properties */ | 1726 | smp_rmb(); /* Make sure we access most actual mddev properties */ |
1677 | if (!rs_is_reshaping(rs)) | 1727 | if (!rs_is_reshaping(rs)) { |
1728 | if (rs_is_raid10(rs)) | ||
1729 | rs_set_rdev_sectors(rs); | ||
1678 | rs_set_capacity(rs); | 1730 | rs_set_capacity(rs); |
1731 | } | ||
1679 | dm_table_event(rs->ti->table); | 1732 | dm_table_event(rs->ti->table); |
1680 | } | 1733 | } |
1681 | 1734 | ||
@@ -1860,7 +1913,7 @@ static bool rs_reshape_requested(struct raid_set *rs) | |||
1860 | if (rs_takeover_requested(rs)) | 1913 | if (rs_takeover_requested(rs)) |
1861 | return false; | 1914 | return false; |
1862 | 1915 | ||
1863 | if (!mddev->level) | 1916 | if (rs_is_raid0(rs)) |
1864 | return false; | 1917 | return false; |
1865 | 1918 | ||
1866 | change = mddev->new_layout != mddev->layout || | 1919 | change = mddev->new_layout != mddev->layout || |
@@ -1868,7 +1921,7 @@ static bool rs_reshape_requested(struct raid_set *rs) | |||
1868 | rs->delta_disks; | 1921 | rs->delta_disks; |
1869 | 1922 | ||
1870 | /* Historical case to support raid1 reshape without delta disks */ | 1923 | /* Historical case to support raid1 reshape without delta disks */ |
1871 | if (mddev->level == 1) { | 1924 | if (rs_is_raid1(rs)) { |
1872 | if (rs->delta_disks) | 1925 | if (rs->delta_disks) |
1873 | return !!rs->delta_disks; | 1926 | return !!rs->delta_disks; |
1874 | 1927 | ||
@@ -1876,7 +1929,7 @@ static bool rs_reshape_requested(struct raid_set *rs) | |||
1876 | mddev->raid_disks != rs->raid_disks; | 1929 | mddev->raid_disks != rs->raid_disks; |
1877 | } | 1930 | } |
1878 | 1931 | ||
1879 | if (mddev->level == 10) | 1932 | if (rs_is_raid10(rs)) |
1880 | return change && | 1933 | return change && |
1881 | !__is_raid10_far(mddev->new_layout) && | 1934 | !__is_raid10_far(mddev->new_layout) && |
1882 | rs->delta_disks >= 0; | 1935 | rs->delta_disks >= 0; |
@@ -2340,7 +2393,7 @@ static int super_init_validation(struct raid_set *rs, struct md_rdev *rdev) | |||
2340 | DMERR("new device%s provided without 'rebuild'", | 2393 | DMERR("new device%s provided without 'rebuild'", |
2341 | new_devs > 1 ? "s" : ""); | 2394 | new_devs > 1 ? "s" : ""); |
2342 | return -EINVAL; | 2395 | return -EINVAL; |
2343 | } else if (rs_is_recovering(rs)) { | 2396 | } else if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags) && rs_is_recovering(rs)) { |
2344 | DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)", | 2397 | DMERR("'rebuild' specified while raid set is not in-sync (recovery_cp=%llu)", |
2345 | (unsigned long long) mddev->recovery_cp); | 2398 | (unsigned long long) mddev->recovery_cp); |
2346 | return -EINVAL; | 2399 | return -EINVAL; |
@@ -2640,12 +2693,19 @@ static int rs_adjust_data_offsets(struct raid_set *rs) | |||
2640 | * Make sure we got a minimum amount of free sectors per device | 2693 | * Make sure we got a minimum amount of free sectors per device |
2641 | */ | 2694 | */ |
2642 | if (rs->data_offset && | 2695 | if (rs->data_offset && |
2643 | to_sector(i_size_read(rdev->bdev->bd_inode)) - rdev->sectors < MIN_FREE_RESHAPE_SPACE) { | 2696 | to_sector(i_size_read(rdev->bdev->bd_inode)) - rs->md.dev_sectors < MIN_FREE_RESHAPE_SPACE) { |
2644 | rs->ti->error = data_offset ? "No space for forward reshape" : | 2697 | rs->ti->error = data_offset ? "No space for forward reshape" : |
2645 | "No space for backward reshape"; | 2698 | "No space for backward reshape"; |
2646 | return -ENOSPC; | 2699 | return -ENOSPC; |
2647 | } | 2700 | } |
2648 | out: | 2701 | out: |
2702 | /* | ||
2703 | * Raise recovery_cp in case data_offset != 0 to | ||
2704 | * avoid false recovery positives in the constructor. | ||
2705 | */ | ||
2706 | if (rs->md.recovery_cp < rs->md.dev_sectors) | ||
2707 | rs->md.recovery_cp += rs->dev[0].rdev.data_offset; | ||
2708 | |||
2649 | /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */ | 2709 | /* Adjust data offsets on all rdevs but on any raid4/5/6 journal device */ |
2650 | rdev_for_each(rdev, &rs->md) { | 2710 | rdev_for_each(rdev, &rs->md) { |
2651 | if (!test_bit(Journal, &rdev->flags)) { | 2711 | if (!test_bit(Journal, &rdev->flags)) { |
@@ -2682,14 +2742,14 @@ static int rs_setup_takeover(struct raid_set *rs) | |||
2682 | sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset; | 2742 | sector_t new_data_offset = rs->dev[0].rdev.data_offset ? 0 : rs->data_offset; |
2683 | 2743 | ||
2684 | if (rt_is_raid10(rs->raid_type)) { | 2744 | if (rt_is_raid10(rs->raid_type)) { |
2685 | if (mddev->level == 0) { | 2745 | if (rs_is_raid0(rs)) { |
2686 | /* Userpace reordered disks -> adjust raid_disk indexes */ | 2746 | /* Userpace reordered disks -> adjust raid_disk indexes */ |
2687 | __reorder_raid_disk_indexes(rs); | 2747 | __reorder_raid_disk_indexes(rs); |
2688 | 2748 | ||
2689 | /* raid0 -> raid10_far layout */ | 2749 | /* raid0 -> raid10_far layout */ |
2690 | mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR, | 2750 | mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_FAR, |
2691 | rs->raid10_copies); | 2751 | rs->raid10_copies); |
2692 | } else if (mddev->level == 1) | 2752 | } else if (rs_is_raid1(rs)) |
2693 | /* raid1 -> raid10_near layout */ | 2753 | /* raid1 -> raid10_near layout */ |
2694 | mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR, | 2754 | mddev->layout = raid10_format_to_md_layout(rs, ALGORITHM_RAID10_NEAR, |
2695 | rs->raid_disks); | 2755 | rs->raid_disks); |
@@ -2777,6 +2837,23 @@ static int rs_prepare_reshape(struct raid_set *rs) | |||
2777 | return 0; | 2837 | return 0; |
2778 | } | 2838 | } |
2779 | 2839 | ||
2840 | /* Get reshape sectors from data_offsets or raid set */ | ||
2841 | static sector_t _get_reshape_sectors(struct raid_set *rs) | ||
2842 | { | ||
2843 | struct md_rdev *rdev; | ||
2844 | sector_t reshape_sectors = 0; | ||
2845 | |||
2846 | rdev_for_each(rdev, &rs->md) | ||
2847 | if (!test_bit(Journal, &rdev->flags)) { | ||
2848 | reshape_sectors = (rdev->data_offset > rdev->new_data_offset) ? | ||
2849 | rdev->data_offset - rdev->new_data_offset : | ||
2850 | rdev->new_data_offset - rdev->data_offset; | ||
2851 | break; | ||
2852 | } | ||
2853 | |||
2854 | return max(reshape_sectors, (sector_t) rs->data_offset); | ||
2855 | } | ||
2856 | |||
2780 | /* | 2857 | /* |
2781 | * | 2858 | * |
2782 | * - change raid layout | 2859 | * - change raid layout |
@@ -2788,6 +2865,7 @@ static int rs_setup_reshape(struct raid_set *rs) | |||
2788 | { | 2865 | { |
2789 | int r = 0; | 2866 | int r = 0; |
2790 | unsigned int cur_raid_devs, d; | 2867 | unsigned int cur_raid_devs, d; |
2868 | sector_t reshape_sectors = _get_reshape_sectors(rs); | ||
2791 | struct mddev *mddev = &rs->md; | 2869 | struct mddev *mddev = &rs->md; |
2792 | struct md_rdev *rdev; | 2870 | struct md_rdev *rdev; |
2793 | 2871 | ||
@@ -2804,13 +2882,13 @@ static int rs_setup_reshape(struct raid_set *rs) | |||
2804 | /* | 2882 | /* |
2805 | * Adjust array size: | 2883 | * Adjust array size: |
2806 | * | 2884 | * |
2807 | * - in case of adding disks, array size has | 2885 | * - in case of adding disk(s), array size has |
2808 | * to grow after the disk adding reshape, | 2886 | * to grow after the disk adding reshape, |
2809 | * which'll hapen in the event handler; | 2887 | * which'll hapen in the event handler; |
2810 | * reshape will happen forward, so space has to | 2888 | * reshape will happen forward, so space has to |
2811 | * be available at the beginning of each disk | 2889 | * be available at the beginning of each disk |
2812 | * | 2890 | * |
2813 | * - in case of removing disks, array size | 2891 | * - in case of removing disk(s), array size |
2814 | * has to shrink before starting the reshape, | 2892 | * has to shrink before starting the reshape, |
2815 | * which'll happen here; | 2893 | * which'll happen here; |
2816 | * reshape will happen backward, so space has to | 2894 | * reshape will happen backward, so space has to |
@@ -2841,7 +2919,7 @@ static int rs_setup_reshape(struct raid_set *rs) | |||
2841 | rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector; | 2919 | rdev->recovery_offset = rs_is_raid1(rs) ? 0 : MaxSector; |
2842 | } | 2920 | } |
2843 | 2921 | ||
2844 | mddev->reshape_backwards = 0; /* adding disks -> forward reshape */ | 2922 | mddev->reshape_backwards = 0; /* adding disk(s) -> forward reshape */ |
2845 | 2923 | ||
2846 | /* Remove disk(s) */ | 2924 | /* Remove disk(s) */ |
2847 | } else if (rs->delta_disks < 0) { | 2925 | } else if (rs->delta_disks < 0) { |
@@ -2874,6 +2952,15 @@ static int rs_setup_reshape(struct raid_set *rs) | |||
2874 | mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1; | 2952 | mddev->reshape_backwards = rs->dev[0].rdev.data_offset ? 0 : 1; |
2875 | } | 2953 | } |
2876 | 2954 | ||
2955 | /* | ||
2956 | * Adjust device size for forward reshape | ||
2957 | * because md_finish_reshape() reduces it. | ||
2958 | */ | ||
2959 | if (!mddev->reshape_backwards) | ||
2960 | rdev_for_each(rdev, &rs->md) | ||
2961 | if (!test_bit(Journal, &rdev->flags)) | ||
2962 | rdev->sectors += reshape_sectors; | ||
2963 | |||
2877 | return r; | 2964 | return r; |
2878 | } | 2965 | } |
2879 | 2966 | ||
@@ -2890,7 +2977,7 @@ static void configure_discard_support(struct raid_set *rs) | |||
2890 | /* | 2977 | /* |
2891 | * XXX: RAID level 4,5,6 require zeroing for safety. | 2978 | * XXX: RAID level 4,5,6 require zeroing for safety. |
2892 | */ | 2979 | */ |
2893 | raid456 = (rs->md.level == 4 || rs->md.level == 5 || rs->md.level == 6); | 2980 | raid456 = rs_is_raid456(rs); |
2894 | 2981 | ||
2895 | for (i = 0; i < rs->raid_disks; i++) { | 2982 | for (i = 0; i < rs->raid_disks; i++) { |
2896 | struct request_queue *q; | 2983 | struct request_queue *q; |
@@ -2915,7 +3002,7 @@ static void configure_discard_support(struct raid_set *rs) | |||
2915 | * RAID1 and RAID10 personalities require bio splitting, | 3002 | * RAID1 and RAID10 personalities require bio splitting, |
2916 | * RAID0/4/5/6 don't and process large discard bios properly. | 3003 | * RAID0/4/5/6 don't and process large discard bios properly. |
2917 | */ | 3004 | */ |
2918 | ti->split_discard_bios = !!(rs->md.level == 1 || rs->md.level == 10); | 3005 | ti->split_discard_bios = !!(rs_is_raid1(rs) || rs_is_raid10(rs)); |
2919 | ti->num_discard_bios = 1; | 3006 | ti->num_discard_bios = 1; |
2920 | } | 3007 | } |
2921 | 3008 | ||
@@ -2935,10 +3022,10 @@ static void configure_discard_support(struct raid_set *rs) | |||
2935 | static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) | 3022 | static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) |
2936 | { | 3023 | { |
2937 | int r; | 3024 | int r; |
2938 | bool resize; | 3025 | bool resize = false; |
2939 | struct raid_type *rt; | 3026 | struct raid_type *rt; |
2940 | unsigned int num_raid_params, num_raid_devs; | 3027 | unsigned int num_raid_params, num_raid_devs; |
2941 | sector_t calculated_dev_sectors, rdev_sectors; | 3028 | sector_t calculated_dev_sectors, rdev_sectors, reshape_sectors; |
2942 | struct raid_set *rs = NULL; | 3029 | struct raid_set *rs = NULL; |
2943 | const char *arg; | 3030 | const char *arg; |
2944 | struct rs_layout rs_layout; | 3031 | struct rs_layout rs_layout; |
@@ -3021,7 +3108,10 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) | |||
3021 | goto bad; | 3108 | goto bad; |
3022 | } | 3109 | } |
3023 | 3110 | ||
3024 | resize = calculated_dev_sectors != rdev_sectors; | 3111 | |
3112 | reshape_sectors = _get_reshape_sectors(rs); | ||
3113 | if (calculated_dev_sectors != rdev_sectors) | ||
3114 | resize = calculated_dev_sectors != (reshape_sectors ? rdev_sectors - reshape_sectors : rdev_sectors); | ||
3025 | 3115 | ||
3026 | INIT_WORK(&rs->md.event_work, do_table_event); | 3116 | INIT_WORK(&rs->md.event_work, do_table_event); |
3027 | ti->private = rs; | 3117 | ti->private = rs; |
@@ -3105,19 +3195,22 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) | |||
3105 | goto bad; | 3195 | goto bad; |
3106 | } | 3196 | } |
3107 | 3197 | ||
3108 | /* | 3198 | /* Out-of-place space has to be available to allow for a reshape unless raid1! */ |
3109 | * We can only prepare for a reshape here, because the | 3199 | if (reshape_sectors || rs_is_raid1(rs)) { |
3110 | * raid set needs to run to provide the repective reshape | 3200 | /* |
3111 | * check functions via its MD personality instance. | 3201 | * We can only prepare for a reshape here, because the |
3112 | * | 3202 | * raid set needs to run to provide the repective reshape |
3113 | * So do the reshape check after md_run() succeeded. | 3203 | * check functions via its MD personality instance. |
3114 | */ | 3204 | * |
3115 | r = rs_prepare_reshape(rs); | 3205 | * So do the reshape check after md_run() succeeded. |
3116 | if (r) | 3206 | */ |
3117 | return r; | 3207 | r = rs_prepare_reshape(rs); |
3208 | if (r) | ||
3209 | return r; | ||
3118 | 3210 | ||
3119 | /* Reshaping ain't recovery, so disable recovery */ | 3211 | /* Reshaping ain't recovery, so disable recovery */ |
3120 | rs_setup_recovery(rs, MaxSector); | 3212 | rs_setup_recovery(rs, MaxSector); |
3213 | } | ||
3121 | rs_set_cur(rs); | 3214 | rs_set_cur(rs); |
3122 | } else { | 3215 | } else { |
3123 | /* May not set recovery when a device rebuild is requested */ | 3216 | /* May not set recovery when a device rebuild is requested */ |
@@ -3144,7 +3237,6 @@ static int raid_ctr(struct dm_target *ti, unsigned int argc, char **argv) | |||
3144 | mddev_lock_nointr(&rs->md); | 3237 | mddev_lock_nointr(&rs->md); |
3145 | r = md_run(&rs->md); | 3238 | r = md_run(&rs->md); |
3146 | rs->md.in_sync = 0; /* Assume already marked dirty */ | 3239 | rs->md.in_sync = 0; /* Assume already marked dirty */ |
3147 | |||
3148 | if (r) { | 3240 | if (r) { |
3149 | ti->error = "Failed to run raid array"; | 3241 | ti->error = "Failed to run raid array"; |
3150 | mddev_unlock(&rs->md); | 3242 | mddev_unlock(&rs->md); |
@@ -3248,25 +3340,27 @@ static int raid_map(struct dm_target *ti, struct bio *bio) | |||
3248 | } | 3340 | } |
3249 | 3341 | ||
3250 | /* Return string describing the current sync action of @mddev */ | 3342 | /* Return string describing the current sync action of @mddev */ |
3251 | static const char *decipher_sync_action(struct mddev *mddev) | 3343 | static const char *decipher_sync_action(struct mddev *mddev, unsigned long recovery) |
3252 | { | 3344 | { |
3253 | if (test_bit(MD_RECOVERY_FROZEN, &mddev->recovery)) | 3345 | if (test_bit(MD_RECOVERY_FROZEN, &recovery)) |
3254 | return "frozen"; | 3346 | return "frozen"; |
3255 | 3347 | ||
3256 | if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery) || | 3348 | /* The MD sync thread can be done with io but still be running */ |
3257 | (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &mddev->recovery))) { | 3349 | if (!test_bit(MD_RECOVERY_DONE, &recovery) && |
3258 | if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery)) | 3350 | (test_bit(MD_RECOVERY_RUNNING, &recovery) || |
3351 | (!mddev->ro && test_bit(MD_RECOVERY_NEEDED, &recovery)))) { | ||
3352 | if (test_bit(MD_RECOVERY_RESHAPE, &recovery)) | ||
3259 | return "reshape"; | 3353 | return "reshape"; |
3260 | 3354 | ||
3261 | if (test_bit(MD_RECOVERY_SYNC, &mddev->recovery)) { | 3355 | if (test_bit(MD_RECOVERY_SYNC, &recovery)) { |
3262 | if (!test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) | 3356 | if (!test_bit(MD_RECOVERY_REQUESTED, &recovery)) |
3263 | return "resync"; | 3357 | return "resync"; |
3264 | else if (test_bit(MD_RECOVERY_CHECK, &mddev->recovery)) | 3358 | else if (test_bit(MD_RECOVERY_CHECK, &recovery)) |
3265 | return "check"; | 3359 | return "check"; |
3266 | return "repair"; | 3360 | return "repair"; |
3267 | } | 3361 | } |
3268 | 3362 | ||
3269 | if (test_bit(MD_RECOVERY_RECOVER, &mddev->recovery)) | 3363 | if (test_bit(MD_RECOVERY_RECOVER, &recovery)) |
3270 | return "recover"; | 3364 | return "recover"; |
3271 | } | 3365 | } |
3272 | 3366 | ||
@@ -3283,7 +3377,7 @@ static const char *decipher_sync_action(struct mddev *mddev) | |||
3283 | * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device | 3377 | * 'A' = Alive and in-sync raid set component _or_ alive raid4/5/6 'write_through' journal device |
3284 | * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr) | 3378 | * '-' = Non-existing device (i.e. uspace passed '- -' into the ctr) |
3285 | */ | 3379 | */ |
3286 | static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, bool array_in_sync) | 3380 | static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev) |
3287 | { | 3381 | { |
3288 | if (!rdev->bdev) | 3382 | if (!rdev->bdev) |
3289 | return "-"; | 3383 | return "-"; |
@@ -3291,85 +3385,108 @@ static const char *__raid_dev_status(struct raid_set *rs, struct md_rdev *rdev, | |||
3291 | return "D"; | 3385 | return "D"; |
3292 | else if (test_bit(Journal, &rdev->flags)) | 3386 | else if (test_bit(Journal, &rdev->flags)) |
3293 | return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a"; | 3387 | return (rs->journal_dev.mode == R5C_JOURNAL_MODE_WRITE_THROUGH) ? "A" : "a"; |
3294 | else if (!array_in_sync || !test_bit(In_sync, &rdev->flags)) | 3388 | else if (test_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags) || |
3389 | (!test_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags) && | ||
3390 | !test_bit(In_sync, &rdev->flags))) | ||
3295 | return "a"; | 3391 | return "a"; |
3296 | else | 3392 | else |
3297 | return "A"; | 3393 | return "A"; |
3298 | } | 3394 | } |
3299 | 3395 | ||
3300 | /* Helper to return resync/reshape progress for @rs and @array_in_sync */ | 3396 | /* Helper to return resync/reshape progress for @rs and runtime flags for raid set in sync / resynching */ |
3301 | static sector_t rs_get_progress(struct raid_set *rs, | 3397 | static sector_t rs_get_progress(struct raid_set *rs, unsigned long recovery, |
3302 | sector_t resync_max_sectors, bool *array_in_sync) | 3398 | sector_t resync_max_sectors) |
3303 | { | 3399 | { |
3304 | sector_t r, curr_resync_completed; | 3400 | sector_t r; |
3305 | struct mddev *mddev = &rs->md; | 3401 | struct mddev *mddev = &rs->md; |
3306 | 3402 | ||
3307 | curr_resync_completed = mddev->curr_resync_completed ?: mddev->recovery_cp; | 3403 | clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); |
3308 | *array_in_sync = false; | 3404 | clear_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); |
3309 | 3405 | ||
3310 | if (rs_is_raid0(rs)) { | 3406 | if (rs_is_raid0(rs)) { |
3311 | r = resync_max_sectors; | 3407 | r = resync_max_sectors; |
3312 | *array_in_sync = true; | 3408 | set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); |
3313 | 3409 | ||
3314 | } else { | 3410 | } else { |
3315 | r = mddev->reshape_position; | 3411 | if (test_bit(MD_RECOVERY_NEEDED, &recovery) || |
3316 | 3412 | test_bit(MD_RECOVERY_RESHAPE, &recovery) || | |
3317 | /* Reshape is relative to the array size */ | 3413 | test_bit(MD_RECOVERY_RUNNING, &recovery)) |
3318 | if (test_bit(MD_RECOVERY_RESHAPE, &mddev->recovery) || | 3414 | r = mddev->curr_resync_completed; |
3319 | r != MaxSector) { | ||
3320 | if (r == MaxSector) { | ||
3321 | *array_in_sync = true; | ||
3322 | r = resync_max_sectors; | ||
3323 | } else { | ||
3324 | /* Got to reverse on backward reshape */ | ||
3325 | if (mddev->reshape_backwards) | ||
3326 | r = mddev->array_sectors - r; | ||
3327 | |||
3328 | /* Devide by # of data stripes */ | ||
3329 | sector_div(r, mddev_data_stripes(rs)); | ||
3330 | } | ||
3331 | |||
3332 | /* Sync is relative to the component device size */ | ||
3333 | } else if (test_bit(MD_RECOVERY_RUNNING, &mddev->recovery)) | ||
3334 | r = curr_resync_completed; | ||
3335 | else | 3415 | else |
3336 | r = mddev->recovery_cp; | 3416 | r = mddev->recovery_cp; |
3337 | 3417 | ||
3338 | if ((r == MaxSector) || | 3418 | if (r >= resync_max_sectors && |
3339 | (test_bit(MD_RECOVERY_DONE, &mddev->recovery) && | 3419 | (!test_bit(MD_RECOVERY_REQUESTED, &recovery) || |
3340 | (mddev->curr_resync_completed == resync_max_sectors))) { | 3420 | (!test_bit(MD_RECOVERY_FROZEN, &recovery) && |
3421 | !test_bit(MD_RECOVERY_NEEDED, &recovery) && | ||
3422 | !test_bit(MD_RECOVERY_RUNNING, &recovery)))) { | ||
3341 | /* | 3423 | /* |
3342 | * Sync complete. | 3424 | * Sync complete. |
3343 | */ | 3425 | */ |
3344 | *array_in_sync = true; | 3426 | /* In case we have finished recovering, the array is in sync. */ |
3345 | r = resync_max_sectors; | 3427 | if (test_bit(MD_RECOVERY_RECOVER, &recovery)) |
3346 | } else if (test_bit(MD_RECOVERY_REQUESTED, &mddev->recovery)) { | 3428 | set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); |
3429 | |||
3430 | } else if (test_bit(MD_RECOVERY_RECOVER, &recovery)) { | ||
3431 | /* | ||
3432 | * In case we are recovering, the array is not in sync | ||
3433 | * and health chars should show the recovering legs. | ||
3434 | */ | ||
3435 | ; | ||
3436 | |||
3437 | } else if (test_bit(MD_RECOVERY_SYNC, &recovery) && | ||
3438 | !test_bit(MD_RECOVERY_REQUESTED, &recovery)) { | ||
3439 | /* | ||
3440 | * If "resync" is occurring, the raid set | ||
3441 | * is or may be out of sync hence the health | ||
3442 | * characters shall be 'a'. | ||
3443 | */ | ||
3444 | set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); | ||
3445 | |||
3446 | } else if (test_bit(MD_RECOVERY_RESHAPE, &recovery) && | ||
3447 | !test_bit(MD_RECOVERY_REQUESTED, &recovery)) { | ||
3448 | /* | ||
3449 | * If "reshape" is occurring, the raid set | ||
3450 | * is or may be out of sync hence the health | ||
3451 | * characters shall be 'a'. | ||
3452 | */ | ||
3453 | set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); | ||
3454 | |||
3455 | } else if (test_bit(MD_RECOVERY_REQUESTED, &recovery)) { | ||
3347 | /* | 3456 | /* |
3348 | * If "check" or "repair" is occurring, the raid set has | 3457 | * If "check" or "repair" is occurring, the raid set has |
3349 | * undergone an initial sync and the health characters | 3458 | * undergone an initial sync and the health characters |
3350 | * should not be 'a' anymore. | 3459 | * should not be 'a' anymore. |
3351 | */ | 3460 | */ |
3352 | *array_in_sync = true; | 3461 | set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); |
3462 | |||
3353 | } else { | 3463 | } else { |
3354 | struct md_rdev *rdev; | 3464 | struct md_rdev *rdev; |
3355 | 3465 | ||
3356 | /* | 3466 | /* |
3467 | * We are idle and recovery is needed, prevent 'A' chars race | ||
3468 | * caused by components still set to in-sync by constrcuctor. | ||
3469 | */ | ||
3470 | if (test_bit(MD_RECOVERY_NEEDED, &recovery)) | ||
3471 | set_bit(RT_FLAG_RS_RESYNCING, &rs->runtime_flags); | ||
3472 | |||
3473 | /* | ||
3357 | * The raid set may be doing an initial sync, or it may | 3474 | * The raid set may be doing an initial sync, or it may |
3358 | * be rebuilding individual components. If all the | 3475 | * be rebuilding individual components. If all the |
3359 | * devices are In_sync, then it is the raid set that is | 3476 | * devices are In_sync, then it is the raid set that is |
3360 | * being initialized. | 3477 | * being initialized. |
3361 | */ | 3478 | */ |
3479 | set_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); | ||
3362 | rdev_for_each(rdev, mddev) | 3480 | rdev_for_each(rdev, mddev) |
3363 | if (!test_bit(Journal, &rdev->flags) && | 3481 | if (!test_bit(Journal, &rdev->flags) && |
3364 | !test_bit(In_sync, &rdev->flags)) | 3482 | !test_bit(In_sync, &rdev->flags)) { |
3365 | *array_in_sync = true; | 3483 | clear_bit(RT_FLAG_RS_IN_SYNC, &rs->runtime_flags); |
3366 | #if 0 | 3484 | break; |
3367 | r = 0; /* HM FIXME: TESTME: https://bugzilla.redhat.com/show_bug.cgi?id=1210637 ? */ | 3485 | } |
3368 | #endif | ||
3369 | } | 3486 | } |
3370 | } | 3487 | } |
3371 | 3488 | ||
3372 | return r; | 3489 | return min(r, resync_max_sectors); |
3373 | } | 3490 | } |
3374 | 3491 | ||
3375 | /* Helper to return @dev name or "-" if !@dev */ | 3492 | /* Helper to return @dev name or "-" if !@dev */ |
@@ -3385,7 +3502,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, | |||
3385 | struct mddev *mddev = &rs->md; | 3502 | struct mddev *mddev = &rs->md; |
3386 | struct r5conf *conf = mddev->private; | 3503 | struct r5conf *conf = mddev->private; |
3387 | int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0; | 3504 | int i, max_nr_stripes = conf ? conf->max_nr_stripes : 0; |
3388 | bool array_in_sync; | 3505 | unsigned long recovery; |
3389 | unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */ | 3506 | unsigned int raid_param_cnt = 1; /* at least 1 for chunksize */ |
3390 | unsigned int sz = 0; | 3507 | unsigned int sz = 0; |
3391 | unsigned int rebuild_disks; | 3508 | unsigned int rebuild_disks; |
@@ -3405,17 +3522,18 @@ static void raid_status(struct dm_target *ti, status_type_t type, | |||
3405 | 3522 | ||
3406 | /* Access most recent mddev properties for status output */ | 3523 | /* Access most recent mddev properties for status output */ |
3407 | smp_rmb(); | 3524 | smp_rmb(); |
3525 | recovery = rs->md.recovery; | ||
3408 | /* Get sensible max sectors even if raid set not yet started */ | 3526 | /* Get sensible max sectors even if raid set not yet started */ |
3409 | resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ? | 3527 | resync_max_sectors = test_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags) ? |
3410 | mddev->resync_max_sectors : mddev->dev_sectors; | 3528 | mddev->resync_max_sectors : mddev->dev_sectors; |
3411 | progress = rs_get_progress(rs, resync_max_sectors, &array_in_sync); | 3529 | progress = rs_get_progress(rs, recovery, resync_max_sectors); |
3412 | resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? | 3530 | resync_mismatches = (mddev->last_sync_action && !strcasecmp(mddev->last_sync_action, "check")) ? |
3413 | atomic64_read(&mddev->resync_mismatches) : 0; | 3531 | atomic64_read(&mddev->resync_mismatches) : 0; |
3414 | sync_action = decipher_sync_action(&rs->md); | 3532 | sync_action = decipher_sync_action(&rs->md, recovery); |
3415 | 3533 | ||
3416 | /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ | 3534 | /* HM FIXME: do we want another state char for raid0? It shows 'D'/'A'/'-' now */ |
3417 | for (i = 0; i < rs->raid_disks; i++) | 3535 | for (i = 0; i < rs->raid_disks; i++) |
3418 | DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev, array_in_sync)); | 3536 | DMEMIT(__raid_dev_status(rs, &rs->dev[i].rdev)); |
3419 | 3537 | ||
3420 | /* | 3538 | /* |
3421 | * In-sync/Reshape ratio: | 3539 | * In-sync/Reshape ratio: |
@@ -3466,7 +3584,7 @@ static void raid_status(struct dm_target *ti, status_type_t type, | |||
3466 | * v1.10.0+: | 3584 | * v1.10.0+: |
3467 | */ | 3585 | */ |
3468 | DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? | 3586 | DMEMIT(" %s", test_bit(__CTR_FLAG_JOURNAL_DEV, &rs->ctr_flags) ? |
3469 | __raid_dev_status(rs, &rs->journal_dev.rdev, 0) : "-"); | 3587 | __raid_dev_status(rs, &rs->journal_dev.rdev) : "-"); |
3470 | break; | 3588 | break; |
3471 | 3589 | ||
3472 | case STATUSTYPE_TABLE: | 3590 | case STATUSTYPE_TABLE: |
@@ -3622,24 +3740,19 @@ static void raid_io_hints(struct dm_target *ti, struct queue_limits *limits) | |||
3622 | blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs)); | 3740 | blk_limits_io_opt(limits, chunk_size * mddev_data_stripes(rs)); |
3623 | } | 3741 | } |
3624 | 3742 | ||
3625 | static void raid_presuspend(struct dm_target *ti) | ||
3626 | { | ||
3627 | struct raid_set *rs = ti->private; | ||
3628 | |||
3629 | md_stop_writes(&rs->md); | ||
3630 | } | ||
3631 | |||
3632 | static void raid_postsuspend(struct dm_target *ti) | 3743 | static void raid_postsuspend(struct dm_target *ti) |
3633 | { | 3744 | { |
3634 | struct raid_set *rs = ti->private; | 3745 | struct raid_set *rs = ti->private; |
3635 | 3746 | ||
3636 | if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { | 3747 | if (!test_and_set_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { |
3748 | /* Writes have to be stopped before suspending to avoid deadlocks. */ | ||
3749 | if (!test_bit(MD_RECOVERY_FROZEN, &rs->md.recovery)) | ||
3750 | md_stop_writes(&rs->md); | ||
3751 | |||
3637 | mddev_lock_nointr(&rs->md); | 3752 | mddev_lock_nointr(&rs->md); |
3638 | mddev_suspend(&rs->md); | 3753 | mddev_suspend(&rs->md); |
3639 | mddev_unlock(&rs->md); | 3754 | mddev_unlock(&rs->md); |
3640 | } | 3755 | } |
3641 | |||
3642 | rs->md.ro = 1; | ||
3643 | } | 3756 | } |
3644 | 3757 | ||
3645 | static void attempt_restore_of_faulty_devices(struct raid_set *rs) | 3758 | static void attempt_restore_of_faulty_devices(struct raid_set *rs) |
@@ -3816,10 +3929,33 @@ static int raid_preresume(struct dm_target *ti) | |||
3816 | struct raid_set *rs = ti->private; | 3929 | struct raid_set *rs = ti->private; |
3817 | struct mddev *mddev = &rs->md; | 3930 | struct mddev *mddev = &rs->md; |
3818 | 3931 | ||
3819 | /* This is a resume after a suspend of the set -> it's already started */ | 3932 | /* This is a resume after a suspend of the set -> it's already started. */ |
3820 | if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags)) | 3933 | if (test_and_set_bit(RT_FLAG_RS_PRERESUMED, &rs->runtime_flags)) |
3821 | return 0; | 3934 | return 0; |
3822 | 3935 | ||
3936 | if (!test_bit(__CTR_FLAG_REBUILD, &rs->ctr_flags)) { | ||
3937 | struct raid_set *rs_active = rs_find_active(rs); | ||
3938 | |||
3939 | if (rs_active) { | ||
3940 | /* | ||
3941 | * In case no rebuilds have been requested | ||
3942 | * and an active table slot exists, copy | ||
3943 | * current resynchonization completed and | ||
3944 | * reshape position pointers across from | ||
3945 | * suspended raid set in the active slot. | ||
3946 | * | ||
3947 | * This resumes the new mapping at current | ||
3948 | * offsets to continue recover/reshape without | ||
3949 | * necessarily redoing a raid set partially or | ||
3950 | * causing data corruption in case of a reshape. | ||
3951 | */ | ||
3952 | if (rs_active->md.curr_resync_completed != MaxSector) | ||
3953 | mddev->curr_resync_completed = rs_active->md.curr_resync_completed; | ||
3954 | if (rs_active->md.reshape_position != MaxSector) | ||
3955 | mddev->reshape_position = rs_active->md.reshape_position; | ||
3956 | } | ||
3957 | } | ||
3958 | |||
3823 | /* | 3959 | /* |
3824 | * The superblocks need to be updated on disk if the | 3960 | * The superblocks need to be updated on disk if the |
3825 | * array is new or new devices got added (thus zeroed | 3961 | * array is new or new devices got added (thus zeroed |
@@ -3851,11 +3987,10 @@ static int raid_preresume(struct dm_target *ti) | |||
3851 | mddev->resync_min = mddev->recovery_cp; | 3987 | mddev->resync_min = mddev->recovery_cp; |
3852 | } | 3988 | } |
3853 | 3989 | ||
3854 | rs_set_capacity(rs); | ||
3855 | |||
3856 | /* Check for any reshape request unless new raid set */ | 3990 | /* Check for any reshape request unless new raid set */ |
3857 | if (test_and_clear_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { | 3991 | if (test_bit(RT_FLAG_RESHAPE_RS, &rs->runtime_flags)) { |
3858 | /* Initiate a reshape. */ | 3992 | /* Initiate a reshape. */ |
3993 | rs_set_rdev_sectors(rs); | ||
3859 | mddev_lock_nointr(mddev); | 3994 | mddev_lock_nointr(mddev); |
3860 | r = rs_start_reshape(rs); | 3995 | r = rs_start_reshape(rs); |
3861 | mddev_unlock(mddev); | 3996 | mddev_unlock(mddev); |
@@ -3881,21 +4016,15 @@ static void raid_resume(struct dm_target *ti) | |||
3881 | attempt_restore_of_faulty_devices(rs); | 4016 | attempt_restore_of_faulty_devices(rs); |
3882 | } | 4017 | } |
3883 | 4018 | ||
3884 | mddev->ro = 0; | ||
3885 | mddev->in_sync = 0; | ||
3886 | |||
3887 | /* | ||
3888 | * Keep the RAID set frozen if reshape/rebuild flags are set. | ||
3889 | * The RAID set is unfrozen once the next table load/resume, | ||
3890 | * which clears the reshape/rebuild flags, occurs. | ||
3891 | * This ensures that the constructor for the inactive table | ||
3892 | * retrieves an up-to-date reshape_position. | ||
3893 | */ | ||
3894 | if (!(rs->ctr_flags & RESUME_STAY_FROZEN_FLAGS)) | ||
3895 | clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); | ||
3896 | |||
3897 | if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { | 4019 | if (test_and_clear_bit(RT_FLAG_RS_SUSPENDED, &rs->runtime_flags)) { |
4020 | /* Only reduce raid set size before running a disk removing reshape. */ | ||
4021 | if (mddev->delta_disks < 0) | ||
4022 | rs_set_capacity(rs); | ||
4023 | |||
3898 | mddev_lock_nointr(mddev); | 4024 | mddev_lock_nointr(mddev); |
4025 | clear_bit(MD_RECOVERY_FROZEN, &mddev->recovery); | ||
4026 | mddev->ro = 0; | ||
4027 | mddev->in_sync = 0; | ||
3899 | mddev_resume(mddev); | 4028 | mddev_resume(mddev); |
3900 | mddev_unlock(mddev); | 4029 | mddev_unlock(mddev); |
3901 | } | 4030 | } |
@@ -3903,7 +4032,7 @@ static void raid_resume(struct dm_target *ti) | |||
3903 | 4032 | ||
3904 | static struct target_type raid_target = { | 4033 | static struct target_type raid_target = { |
3905 | .name = "raid", | 4034 | .name = "raid", |
3906 | .version = {1, 13, 0}, | 4035 | .version = {1, 13, 2}, |
3907 | .module = THIS_MODULE, | 4036 | .module = THIS_MODULE, |
3908 | .ctr = raid_ctr, | 4037 | .ctr = raid_ctr, |
3909 | .dtr = raid_dtr, | 4038 | .dtr = raid_dtr, |
@@ -3912,7 +4041,6 @@ static struct target_type raid_target = { | |||
3912 | .message = raid_message, | 4041 | .message = raid_message, |
3913 | .iterate_devices = raid_iterate_devices, | 4042 | .iterate_devices = raid_iterate_devices, |
3914 | .io_hints = raid_io_hints, | 4043 | .io_hints = raid_io_hints, |
3915 | .presuspend = raid_presuspend, | ||
3916 | .postsuspend = raid_postsuspend, | 4044 | .postsuspend = raid_postsuspend, |
3917 | .preresume = raid_preresume, | 4045 | .preresume = raid_preresume, |
3918 | .resume = raid_resume, | 4046 | .resume = raid_resume, |
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c index b7d175e94a02..aeaaaef43eff 100644 --- a/drivers/md/dm-rq.c +++ b/drivers/md/dm-rq.c | |||
@@ -315,6 +315,10 @@ static void dm_done(struct request *clone, blk_status_t error, bool mapped) | |||
315 | /* The target wants to requeue the I/O */ | 315 | /* The target wants to requeue the I/O */ |
316 | dm_requeue_original_request(tio, false); | 316 | dm_requeue_original_request(tio, false); |
317 | break; | 317 | break; |
318 | case DM_ENDIO_DELAY_REQUEUE: | ||
319 | /* The target wants to requeue the I/O after a delay */ | ||
320 | dm_requeue_original_request(tio, true); | ||
321 | break; | ||
318 | default: | 322 | default: |
319 | DMWARN("unimplemented target endio return value: %d", r); | 323 | DMWARN("unimplemented target endio return value: %d", r); |
320 | BUG(); | 324 | BUG(); |
@@ -713,7 +717,6 @@ int dm_old_init_request_queue(struct mapped_device *md, struct dm_table *t) | |||
713 | /* disable dm_old_request_fn's merge heuristic by default */ | 717 | /* disable dm_old_request_fn's merge heuristic by default */ |
714 | md->seq_rq_merge_deadline_usecs = 0; | 718 | md->seq_rq_merge_deadline_usecs = 0; |
715 | 719 | ||
716 | dm_init_normal_md_queue(md); | ||
717 | blk_queue_softirq_done(md->queue, dm_softirq_done); | 720 | blk_queue_softirq_done(md->queue, dm_softirq_done); |
718 | 721 | ||
719 | /* Initialize the request-based DM worker thread */ | 722 | /* Initialize the request-based DM worker thread */ |
@@ -821,7 +824,6 @@ int dm_mq_init_request_queue(struct mapped_device *md, struct dm_table *t) | |||
821 | err = PTR_ERR(q); | 824 | err = PTR_ERR(q); |
822 | goto out_tag_set; | 825 | goto out_tag_set; |
823 | } | 826 | } |
824 | dm_init_md_queue(md); | ||
825 | 827 | ||
826 | return 0; | 828 | return 0; |
827 | 829 | ||
diff --git a/drivers/md/dm-service-time.c b/drivers/md/dm-service-time.c index 7b8642045c55..f006a9005593 100644 --- a/drivers/md/dm-service-time.c +++ b/drivers/md/dm-service-time.c | |||
@@ -282,9 +282,6 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes) | |||
282 | if (list_empty(&s->valid_paths)) | 282 | if (list_empty(&s->valid_paths)) |
283 | goto out; | 283 | goto out; |
284 | 284 | ||
285 | /* Change preferred (first in list) path to evenly balance. */ | ||
286 | list_move_tail(s->valid_paths.next, &s->valid_paths); | ||
287 | |||
288 | list_for_each_entry(pi, &s->valid_paths, list) | 285 | list_for_each_entry(pi, &s->valid_paths, list) |
289 | if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) | 286 | if (!best || (st_compare_load(pi, best, nr_bytes) < 0)) |
290 | best = pi; | 287 | best = pi; |
@@ -292,6 +289,9 @@ static struct dm_path *st_select_path(struct path_selector *ps, size_t nr_bytes) | |||
292 | if (!best) | 289 | if (!best) |
293 | goto out; | 290 | goto out; |
294 | 291 | ||
292 | /* Move most recently used to least preferred to evenly balance. */ | ||
293 | list_move_tail(&best->list, &s->valid_paths); | ||
294 | |||
295 | ret = best->path; | 295 | ret = best->path; |
296 | out: | 296 | out: |
297 | spin_unlock_irqrestore(&s->lock, flags); | 297 | spin_unlock_irqrestore(&s->lock, flags); |
diff --git a/drivers/md/dm-snap.c b/drivers/md/dm-snap.c index a0613bd8ed00..216035be5661 100644 --- a/drivers/md/dm-snap.c +++ b/drivers/md/dm-snap.c | |||
@@ -47,7 +47,7 @@ struct dm_exception_table { | |||
47 | }; | 47 | }; |
48 | 48 | ||
49 | struct dm_snapshot { | 49 | struct dm_snapshot { |
50 | struct rw_semaphore lock; | 50 | struct mutex lock; |
51 | 51 | ||
52 | struct dm_dev *origin; | 52 | struct dm_dev *origin; |
53 | struct dm_dev *cow; | 53 | struct dm_dev *cow; |
@@ -439,9 +439,9 @@ static int __find_snapshots_sharing_cow(struct dm_snapshot *snap, | |||
439 | if (!bdev_equal(s->cow->bdev, snap->cow->bdev)) | 439 | if (!bdev_equal(s->cow->bdev, snap->cow->bdev)) |
440 | continue; | 440 | continue; |
441 | 441 | ||
442 | down_read(&s->lock); | 442 | mutex_lock(&s->lock); |
443 | active = s->active; | 443 | active = s->active; |
444 | up_read(&s->lock); | 444 | mutex_unlock(&s->lock); |
445 | 445 | ||
446 | if (active) { | 446 | if (active) { |
447 | if (snap_src) | 447 | if (snap_src) |
@@ -909,7 +909,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s) | |||
909 | int r; | 909 | int r; |
910 | chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1; | 910 | chunk_t old_chunk = s->first_merging_chunk + s->num_merging_chunks - 1; |
911 | 911 | ||
912 | down_write(&s->lock); | 912 | mutex_lock(&s->lock); |
913 | 913 | ||
914 | /* | 914 | /* |
915 | * Process chunks (and associated exceptions) in reverse order | 915 | * Process chunks (and associated exceptions) in reverse order |
@@ -924,7 +924,7 @@ static int remove_single_exception_chunk(struct dm_snapshot *s) | |||
924 | b = __release_queued_bios_after_merge(s); | 924 | b = __release_queued_bios_after_merge(s); |
925 | 925 | ||
926 | out: | 926 | out: |
927 | up_write(&s->lock); | 927 | mutex_unlock(&s->lock); |
928 | if (b) | 928 | if (b) |
929 | flush_bios(b); | 929 | flush_bios(b); |
930 | 930 | ||
@@ -983,9 +983,9 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s) | |||
983 | if (linear_chunks < 0) { | 983 | if (linear_chunks < 0) { |
984 | DMERR("Read error in exception store: " | 984 | DMERR("Read error in exception store: " |
985 | "shutting down merge"); | 985 | "shutting down merge"); |
986 | down_write(&s->lock); | 986 | mutex_lock(&s->lock); |
987 | s->merge_failed = 1; | 987 | s->merge_failed = 1; |
988 | up_write(&s->lock); | 988 | mutex_unlock(&s->lock); |
989 | } | 989 | } |
990 | goto shut; | 990 | goto shut; |
991 | } | 991 | } |
@@ -1026,10 +1026,10 @@ static void snapshot_merge_next_chunks(struct dm_snapshot *s) | |||
1026 | previous_count = read_pending_exceptions_done_count(); | 1026 | previous_count = read_pending_exceptions_done_count(); |
1027 | } | 1027 | } |
1028 | 1028 | ||
1029 | down_write(&s->lock); | 1029 | mutex_lock(&s->lock); |
1030 | s->first_merging_chunk = old_chunk; | 1030 | s->first_merging_chunk = old_chunk; |
1031 | s->num_merging_chunks = linear_chunks; | 1031 | s->num_merging_chunks = linear_chunks; |
1032 | up_write(&s->lock); | 1032 | mutex_unlock(&s->lock); |
1033 | 1033 | ||
1034 | /* Wait until writes to all 'linear_chunks' drain */ | 1034 | /* Wait until writes to all 'linear_chunks' drain */ |
1035 | for (i = 0; i < linear_chunks; i++) | 1035 | for (i = 0; i < linear_chunks; i++) |
@@ -1071,10 +1071,10 @@ static void merge_callback(int read_err, unsigned long write_err, void *context) | |||
1071 | return; | 1071 | return; |
1072 | 1072 | ||
1073 | shut: | 1073 | shut: |
1074 | down_write(&s->lock); | 1074 | mutex_lock(&s->lock); |
1075 | s->merge_failed = 1; | 1075 | s->merge_failed = 1; |
1076 | b = __release_queued_bios_after_merge(s); | 1076 | b = __release_queued_bios_after_merge(s); |
1077 | up_write(&s->lock); | 1077 | mutex_unlock(&s->lock); |
1078 | error_bios(b); | 1078 | error_bios(b); |
1079 | 1079 | ||
1080 | merge_shutdown(s); | 1080 | merge_shutdown(s); |
@@ -1173,7 +1173,7 @@ static int snapshot_ctr(struct dm_target *ti, unsigned int argc, char **argv) | |||
1173 | s->exception_start_sequence = 0; | 1173 | s->exception_start_sequence = 0; |
1174 | s->exception_complete_sequence = 0; | 1174 | s->exception_complete_sequence = 0; |
1175 | INIT_LIST_HEAD(&s->out_of_order_list); | 1175 | INIT_LIST_HEAD(&s->out_of_order_list); |
1176 | init_rwsem(&s->lock); | 1176 | mutex_init(&s->lock); |
1177 | INIT_LIST_HEAD(&s->list); | 1177 | INIT_LIST_HEAD(&s->list); |
1178 | spin_lock_init(&s->pe_lock); | 1178 | spin_lock_init(&s->pe_lock); |
1179 | s->state_bits = 0; | 1179 | s->state_bits = 0; |
@@ -1338,9 +1338,9 @@ static void snapshot_dtr(struct dm_target *ti) | |||
1338 | /* Check whether exception handover must be cancelled */ | 1338 | /* Check whether exception handover must be cancelled */ |
1339 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); | 1339 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); |
1340 | if (snap_src && snap_dest && (s == snap_src)) { | 1340 | if (snap_src && snap_dest && (s == snap_src)) { |
1341 | down_write(&snap_dest->lock); | 1341 | mutex_lock(&snap_dest->lock); |
1342 | snap_dest->valid = 0; | 1342 | snap_dest->valid = 0; |
1343 | up_write(&snap_dest->lock); | 1343 | mutex_unlock(&snap_dest->lock); |
1344 | DMERR("Cancelling snapshot handover."); | 1344 | DMERR("Cancelling snapshot handover."); |
1345 | } | 1345 | } |
1346 | up_read(&_origins_lock); | 1346 | up_read(&_origins_lock); |
@@ -1371,6 +1371,8 @@ static void snapshot_dtr(struct dm_target *ti) | |||
1371 | 1371 | ||
1372 | dm_exception_store_destroy(s->store); | 1372 | dm_exception_store_destroy(s->store); |
1373 | 1373 | ||
1374 | mutex_destroy(&s->lock); | ||
1375 | |||
1374 | dm_put_device(ti, s->cow); | 1376 | dm_put_device(ti, s->cow); |
1375 | 1377 | ||
1376 | dm_put_device(ti, s->origin); | 1378 | dm_put_device(ti, s->origin); |
@@ -1458,7 +1460,7 @@ static void pending_complete(void *context, int success) | |||
1458 | 1460 | ||
1459 | if (!success) { | 1461 | if (!success) { |
1460 | /* Read/write error - snapshot is unusable */ | 1462 | /* Read/write error - snapshot is unusable */ |
1461 | down_write(&s->lock); | 1463 | mutex_lock(&s->lock); |
1462 | __invalidate_snapshot(s, -EIO); | 1464 | __invalidate_snapshot(s, -EIO); |
1463 | error = 1; | 1465 | error = 1; |
1464 | goto out; | 1466 | goto out; |
@@ -1466,14 +1468,14 @@ static void pending_complete(void *context, int success) | |||
1466 | 1468 | ||
1467 | e = alloc_completed_exception(GFP_NOIO); | 1469 | e = alloc_completed_exception(GFP_NOIO); |
1468 | if (!e) { | 1470 | if (!e) { |
1469 | down_write(&s->lock); | 1471 | mutex_lock(&s->lock); |
1470 | __invalidate_snapshot(s, -ENOMEM); | 1472 | __invalidate_snapshot(s, -ENOMEM); |
1471 | error = 1; | 1473 | error = 1; |
1472 | goto out; | 1474 | goto out; |
1473 | } | 1475 | } |
1474 | *e = pe->e; | 1476 | *e = pe->e; |
1475 | 1477 | ||
1476 | down_write(&s->lock); | 1478 | mutex_lock(&s->lock); |
1477 | if (!s->valid) { | 1479 | if (!s->valid) { |
1478 | free_completed_exception(e); | 1480 | free_completed_exception(e); |
1479 | error = 1; | 1481 | error = 1; |
@@ -1498,7 +1500,7 @@ out: | |||
1498 | full_bio->bi_end_io = pe->full_bio_end_io; | 1500 | full_bio->bi_end_io = pe->full_bio_end_io; |
1499 | increment_pending_exceptions_done_count(); | 1501 | increment_pending_exceptions_done_count(); |
1500 | 1502 | ||
1501 | up_write(&s->lock); | 1503 | mutex_unlock(&s->lock); |
1502 | 1504 | ||
1503 | /* Submit any pending write bios */ | 1505 | /* Submit any pending write bios */ |
1504 | if (error) { | 1506 | if (error) { |
@@ -1694,7 +1696,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) | |||
1694 | 1696 | ||
1695 | /* FIXME: should only take write lock if we need | 1697 | /* FIXME: should only take write lock if we need |
1696 | * to copy an exception */ | 1698 | * to copy an exception */ |
1697 | down_write(&s->lock); | 1699 | mutex_lock(&s->lock); |
1698 | 1700 | ||
1699 | if (!s->valid || (unlikely(s->snapshot_overflowed) && | 1701 | if (!s->valid || (unlikely(s->snapshot_overflowed) && |
1700 | bio_data_dir(bio) == WRITE)) { | 1702 | bio_data_dir(bio) == WRITE)) { |
@@ -1717,9 +1719,9 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) | |||
1717 | if (bio_data_dir(bio) == WRITE) { | 1719 | if (bio_data_dir(bio) == WRITE) { |
1718 | pe = __lookup_pending_exception(s, chunk); | 1720 | pe = __lookup_pending_exception(s, chunk); |
1719 | if (!pe) { | 1721 | if (!pe) { |
1720 | up_write(&s->lock); | 1722 | mutex_unlock(&s->lock); |
1721 | pe = alloc_pending_exception(s); | 1723 | pe = alloc_pending_exception(s); |
1722 | down_write(&s->lock); | 1724 | mutex_lock(&s->lock); |
1723 | 1725 | ||
1724 | if (!s->valid || s->snapshot_overflowed) { | 1726 | if (!s->valid || s->snapshot_overflowed) { |
1725 | free_pending_exception(pe); | 1727 | free_pending_exception(pe); |
@@ -1754,7 +1756,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) | |||
1754 | bio->bi_iter.bi_size == | 1756 | bio->bi_iter.bi_size == |
1755 | (s->store->chunk_size << SECTOR_SHIFT)) { | 1757 | (s->store->chunk_size << SECTOR_SHIFT)) { |
1756 | pe->started = 1; | 1758 | pe->started = 1; |
1757 | up_write(&s->lock); | 1759 | mutex_unlock(&s->lock); |
1758 | start_full_bio(pe, bio); | 1760 | start_full_bio(pe, bio); |
1759 | goto out; | 1761 | goto out; |
1760 | } | 1762 | } |
@@ -1764,7 +1766,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) | |||
1764 | if (!pe->started) { | 1766 | if (!pe->started) { |
1765 | /* this is protected by snap->lock */ | 1767 | /* this is protected by snap->lock */ |
1766 | pe->started = 1; | 1768 | pe->started = 1; |
1767 | up_write(&s->lock); | 1769 | mutex_unlock(&s->lock); |
1768 | start_copy(pe); | 1770 | start_copy(pe); |
1769 | goto out; | 1771 | goto out; |
1770 | } | 1772 | } |
@@ -1774,7 +1776,7 @@ static int snapshot_map(struct dm_target *ti, struct bio *bio) | |||
1774 | } | 1776 | } |
1775 | 1777 | ||
1776 | out_unlock: | 1778 | out_unlock: |
1777 | up_write(&s->lock); | 1779 | mutex_unlock(&s->lock); |
1778 | out: | 1780 | out: |
1779 | return r; | 1781 | return r; |
1780 | } | 1782 | } |
@@ -1810,7 +1812,7 @@ static int snapshot_merge_map(struct dm_target *ti, struct bio *bio) | |||
1810 | 1812 | ||
1811 | chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); | 1813 | chunk = sector_to_chunk(s->store, bio->bi_iter.bi_sector); |
1812 | 1814 | ||
1813 | down_write(&s->lock); | 1815 | mutex_lock(&s->lock); |
1814 | 1816 | ||
1815 | /* Full merging snapshots are redirected to the origin */ | 1817 | /* Full merging snapshots are redirected to the origin */ |
1816 | if (!s->valid) | 1818 | if (!s->valid) |
@@ -1841,12 +1843,12 @@ redirect_to_origin: | |||
1841 | bio_set_dev(bio, s->origin->bdev); | 1843 | bio_set_dev(bio, s->origin->bdev); |
1842 | 1844 | ||
1843 | if (bio_data_dir(bio) == WRITE) { | 1845 | if (bio_data_dir(bio) == WRITE) { |
1844 | up_write(&s->lock); | 1846 | mutex_unlock(&s->lock); |
1845 | return do_origin(s->origin, bio); | 1847 | return do_origin(s->origin, bio); |
1846 | } | 1848 | } |
1847 | 1849 | ||
1848 | out_unlock: | 1850 | out_unlock: |
1849 | up_write(&s->lock); | 1851 | mutex_unlock(&s->lock); |
1850 | 1852 | ||
1851 | return r; | 1853 | return r; |
1852 | } | 1854 | } |
@@ -1878,7 +1880,7 @@ static int snapshot_preresume(struct dm_target *ti) | |||
1878 | down_read(&_origins_lock); | 1880 | down_read(&_origins_lock); |
1879 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); | 1881 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); |
1880 | if (snap_src && snap_dest) { | 1882 | if (snap_src && snap_dest) { |
1881 | down_read(&snap_src->lock); | 1883 | mutex_lock(&snap_src->lock); |
1882 | if (s == snap_src) { | 1884 | if (s == snap_src) { |
1883 | DMERR("Unable to resume snapshot source until " | 1885 | DMERR("Unable to resume snapshot source until " |
1884 | "handover completes."); | 1886 | "handover completes."); |
@@ -1888,7 +1890,7 @@ static int snapshot_preresume(struct dm_target *ti) | |||
1888 | "source is suspended."); | 1890 | "source is suspended."); |
1889 | r = -EINVAL; | 1891 | r = -EINVAL; |
1890 | } | 1892 | } |
1891 | up_read(&snap_src->lock); | 1893 | mutex_unlock(&snap_src->lock); |
1892 | } | 1894 | } |
1893 | up_read(&_origins_lock); | 1895 | up_read(&_origins_lock); |
1894 | 1896 | ||
@@ -1934,11 +1936,11 @@ static void snapshot_resume(struct dm_target *ti) | |||
1934 | 1936 | ||
1935 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); | 1937 | (void) __find_snapshots_sharing_cow(s, &snap_src, &snap_dest, NULL); |
1936 | if (snap_src && snap_dest) { | 1938 | if (snap_src && snap_dest) { |
1937 | down_write(&snap_src->lock); | 1939 | mutex_lock(&snap_src->lock); |
1938 | down_write_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); | 1940 | mutex_lock_nested(&snap_dest->lock, SINGLE_DEPTH_NESTING); |
1939 | __handover_exceptions(snap_src, snap_dest); | 1941 | __handover_exceptions(snap_src, snap_dest); |
1940 | up_write(&snap_dest->lock); | 1942 | mutex_unlock(&snap_dest->lock); |
1941 | up_write(&snap_src->lock); | 1943 | mutex_unlock(&snap_src->lock); |
1942 | } | 1944 | } |
1943 | 1945 | ||
1944 | up_read(&_origins_lock); | 1946 | up_read(&_origins_lock); |
@@ -1953,9 +1955,9 @@ static void snapshot_resume(struct dm_target *ti) | |||
1953 | /* Now we have correct chunk size, reregister */ | 1955 | /* Now we have correct chunk size, reregister */ |
1954 | reregister_snapshot(s); | 1956 | reregister_snapshot(s); |
1955 | 1957 | ||
1956 | down_write(&s->lock); | 1958 | mutex_lock(&s->lock); |
1957 | s->active = 1; | 1959 | s->active = 1; |
1958 | up_write(&s->lock); | 1960 | mutex_unlock(&s->lock); |
1959 | } | 1961 | } |
1960 | 1962 | ||
1961 | static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) | 1963 | static uint32_t get_origin_minimum_chunksize(struct block_device *bdev) |
@@ -1995,7 +1997,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type, | |||
1995 | switch (type) { | 1997 | switch (type) { |
1996 | case STATUSTYPE_INFO: | 1998 | case STATUSTYPE_INFO: |
1997 | 1999 | ||
1998 | down_write(&snap->lock); | 2000 | mutex_lock(&snap->lock); |
1999 | 2001 | ||
2000 | if (!snap->valid) | 2002 | if (!snap->valid) |
2001 | DMEMIT("Invalid"); | 2003 | DMEMIT("Invalid"); |
@@ -2020,7 +2022,7 @@ static void snapshot_status(struct dm_target *ti, status_type_t type, | |||
2020 | DMEMIT("Unknown"); | 2022 | DMEMIT("Unknown"); |
2021 | } | 2023 | } |
2022 | 2024 | ||
2023 | up_write(&snap->lock); | 2025 | mutex_unlock(&snap->lock); |
2024 | 2026 | ||
2025 | break; | 2027 | break; |
2026 | 2028 | ||
@@ -2086,7 +2088,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, | |||
2086 | if (dm_target_is_snapshot_merge(snap->ti)) | 2088 | if (dm_target_is_snapshot_merge(snap->ti)) |
2087 | continue; | 2089 | continue; |
2088 | 2090 | ||
2089 | down_write(&snap->lock); | 2091 | mutex_lock(&snap->lock); |
2090 | 2092 | ||
2091 | /* Only deal with valid and active snapshots */ | 2093 | /* Only deal with valid and active snapshots */ |
2092 | if (!snap->valid || !snap->active) | 2094 | if (!snap->valid || !snap->active) |
@@ -2113,9 +2115,9 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, | |||
2113 | 2115 | ||
2114 | pe = __lookup_pending_exception(snap, chunk); | 2116 | pe = __lookup_pending_exception(snap, chunk); |
2115 | if (!pe) { | 2117 | if (!pe) { |
2116 | up_write(&snap->lock); | 2118 | mutex_unlock(&snap->lock); |
2117 | pe = alloc_pending_exception(snap); | 2119 | pe = alloc_pending_exception(snap); |
2118 | down_write(&snap->lock); | 2120 | mutex_lock(&snap->lock); |
2119 | 2121 | ||
2120 | if (!snap->valid) { | 2122 | if (!snap->valid) { |
2121 | free_pending_exception(pe); | 2123 | free_pending_exception(pe); |
@@ -2158,7 +2160,7 @@ static int __origin_write(struct list_head *snapshots, sector_t sector, | |||
2158 | } | 2160 | } |
2159 | 2161 | ||
2160 | next_snapshot: | 2162 | next_snapshot: |
2161 | up_write(&snap->lock); | 2163 | mutex_unlock(&snap->lock); |
2162 | 2164 | ||
2163 | if (pe_to_start_now) { | 2165 | if (pe_to_start_now) { |
2164 | start_copy(pe_to_start_now); | 2166 | start_copy(pe_to_start_now); |
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c index 29bc51084c82..56059fb56e2d 100644 --- a/drivers/md/dm-stats.c +++ b/drivers/md/dm-stats.c | |||
@@ -228,6 +228,7 @@ void dm_stats_cleanup(struct dm_stats *stats) | |||
228 | dm_stat_free(&s->rcu_head); | 228 | dm_stat_free(&s->rcu_head); |
229 | } | 229 | } |
230 | free_percpu(stats->last); | 230 | free_percpu(stats->last); |
231 | mutex_destroy(&stats->mutex); | ||
231 | } | 232 | } |
232 | 233 | ||
233 | static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end, | 234 | static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end, |
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c index aaffd0c0ee9a..5fe7ec356c33 100644 --- a/drivers/md/dm-table.c +++ b/drivers/md/dm-table.c | |||
@@ -866,7 +866,8 @@ EXPORT_SYMBOL(dm_consume_args); | |||
866 | static bool __table_type_bio_based(enum dm_queue_mode table_type) | 866 | static bool __table_type_bio_based(enum dm_queue_mode table_type) |
867 | { | 867 | { |
868 | return (table_type == DM_TYPE_BIO_BASED || | 868 | return (table_type == DM_TYPE_BIO_BASED || |
869 | table_type == DM_TYPE_DAX_BIO_BASED); | 869 | table_type == DM_TYPE_DAX_BIO_BASED || |
870 | table_type == DM_TYPE_NVME_BIO_BASED); | ||
870 | } | 871 | } |
871 | 872 | ||
872 | static bool __table_type_request_based(enum dm_queue_mode table_type) | 873 | static bool __table_type_request_based(enum dm_queue_mode table_type) |
@@ -909,13 +910,33 @@ static bool dm_table_supports_dax(struct dm_table *t) | |||
909 | return true; | 910 | return true; |
910 | } | 911 | } |
911 | 912 | ||
913 | static bool dm_table_does_not_support_partial_completion(struct dm_table *t); | ||
914 | |||
915 | struct verify_rq_based_data { | ||
916 | unsigned sq_count; | ||
917 | unsigned mq_count; | ||
918 | }; | ||
919 | |||
920 | static int device_is_rq_based(struct dm_target *ti, struct dm_dev *dev, | ||
921 | sector_t start, sector_t len, void *data) | ||
922 | { | ||
923 | struct request_queue *q = bdev_get_queue(dev->bdev); | ||
924 | struct verify_rq_based_data *v = data; | ||
925 | |||
926 | if (q->mq_ops) | ||
927 | v->mq_count++; | ||
928 | else | ||
929 | v->sq_count++; | ||
930 | |||
931 | return queue_is_rq_based(q); | ||
932 | } | ||
933 | |||
912 | static int dm_table_determine_type(struct dm_table *t) | 934 | static int dm_table_determine_type(struct dm_table *t) |
913 | { | 935 | { |
914 | unsigned i; | 936 | unsigned i; |
915 | unsigned bio_based = 0, request_based = 0, hybrid = 0; | 937 | unsigned bio_based = 0, request_based = 0, hybrid = 0; |
916 | unsigned sq_count = 0, mq_count = 0; | 938 | struct verify_rq_based_data v = {.sq_count = 0, .mq_count = 0}; |
917 | struct dm_target *tgt; | 939 | struct dm_target *tgt; |
918 | struct dm_dev_internal *dd; | ||
919 | struct list_head *devices = dm_table_get_devices(t); | 940 | struct list_head *devices = dm_table_get_devices(t); |
920 | enum dm_queue_mode live_md_type = dm_get_md_type(t->md); | 941 | enum dm_queue_mode live_md_type = dm_get_md_type(t->md); |
921 | 942 | ||
@@ -923,6 +944,14 @@ static int dm_table_determine_type(struct dm_table *t) | |||
923 | /* target already set the table's type */ | 944 | /* target already set the table's type */ |
924 | if (t->type == DM_TYPE_BIO_BASED) | 945 | if (t->type == DM_TYPE_BIO_BASED) |
925 | return 0; | 946 | return 0; |
947 | else if (t->type == DM_TYPE_NVME_BIO_BASED) { | ||
948 | if (!dm_table_does_not_support_partial_completion(t)) { | ||
949 | DMERR("nvme bio-based is only possible with devices" | ||
950 | " that don't support partial completion"); | ||
951 | return -EINVAL; | ||
952 | } | ||
953 | /* Fallthru, also verify all devices are blk-mq */ | ||
954 | } | ||
926 | BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); | 955 | BUG_ON(t->type == DM_TYPE_DAX_BIO_BASED); |
927 | goto verify_rq_based; | 956 | goto verify_rq_based; |
928 | } | 957 | } |
@@ -937,8 +966,8 @@ static int dm_table_determine_type(struct dm_table *t) | |||
937 | bio_based = 1; | 966 | bio_based = 1; |
938 | 967 | ||
939 | if (bio_based && request_based) { | 968 | if (bio_based && request_based) { |
940 | DMWARN("Inconsistent table: different target types" | 969 | DMERR("Inconsistent table: different target types" |
941 | " can't be mixed up"); | 970 | " can't be mixed up"); |
942 | return -EINVAL; | 971 | return -EINVAL; |
943 | } | 972 | } |
944 | } | 973 | } |
@@ -959,8 +988,18 @@ static int dm_table_determine_type(struct dm_table *t) | |||
959 | /* We must use this table as bio-based */ | 988 | /* We must use this table as bio-based */ |
960 | t->type = DM_TYPE_BIO_BASED; | 989 | t->type = DM_TYPE_BIO_BASED; |
961 | if (dm_table_supports_dax(t) || | 990 | if (dm_table_supports_dax(t) || |
962 | (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) | 991 | (list_empty(devices) && live_md_type == DM_TYPE_DAX_BIO_BASED)) { |
963 | t->type = DM_TYPE_DAX_BIO_BASED; | 992 | t->type = DM_TYPE_DAX_BIO_BASED; |
993 | } else { | ||
994 | /* Check if upgrading to NVMe bio-based is valid or required */ | ||
995 | tgt = dm_table_get_immutable_target(t); | ||
996 | if (tgt && !tgt->max_io_len && dm_table_does_not_support_partial_completion(t)) { | ||
997 | t->type = DM_TYPE_NVME_BIO_BASED; | ||
998 | goto verify_rq_based; /* must be stacked directly on NVMe (blk-mq) */ | ||
999 | } else if (list_empty(devices) && live_md_type == DM_TYPE_NVME_BIO_BASED) { | ||
1000 | t->type = DM_TYPE_NVME_BIO_BASED; | ||
1001 | } | ||
1002 | } | ||
964 | return 0; | 1003 | return 0; |
965 | } | 1004 | } |
966 | 1005 | ||
@@ -980,7 +1019,8 @@ verify_rq_based: | |||
980 | * (e.g. request completion process for partial completion.) | 1019 | * (e.g. request completion process for partial completion.) |
981 | */ | 1020 | */ |
982 | if (t->num_targets > 1) { | 1021 | if (t->num_targets > 1) { |
983 | DMWARN("Request-based dm doesn't support multiple targets yet"); | 1022 | DMERR("%s DM doesn't support multiple targets", |
1023 | t->type == DM_TYPE_NVME_BIO_BASED ? "nvme bio-based" : "request-based"); | ||
984 | return -EINVAL; | 1024 | return -EINVAL; |
985 | } | 1025 | } |
986 | 1026 | ||
@@ -997,28 +1037,29 @@ verify_rq_based: | |||
997 | return 0; | 1037 | return 0; |
998 | } | 1038 | } |
999 | 1039 | ||
1000 | /* Non-request-stackable devices can't be used for request-based dm */ | 1040 | tgt = dm_table_get_immutable_target(t); |
1001 | list_for_each_entry(dd, devices, list) { | 1041 | if (!tgt) { |
1002 | struct request_queue *q = bdev_get_queue(dd->dm_dev->bdev); | 1042 | DMERR("table load rejected: immutable target is required"); |
1003 | 1043 | return -EINVAL; | |
1004 | if (!queue_is_rq_based(q)) { | 1044 | } else if (tgt->max_io_len) { |
1005 | DMERR("table load rejected: including" | 1045 | DMERR("table load rejected: immutable target that splits IO is not supported"); |
1006 | " non-request-stackable devices"); | 1046 | return -EINVAL; |
1007 | return -EINVAL; | 1047 | } |
1008 | } | ||
1009 | 1048 | ||
1010 | if (q->mq_ops) | 1049 | /* Non-request-stackable devices can't be used for request-based dm */ |
1011 | mq_count++; | 1050 | if (!tgt->type->iterate_devices || |
1012 | else | 1051 | !tgt->type->iterate_devices(tgt, device_is_rq_based, &v)) { |
1013 | sq_count++; | 1052 | DMERR("table load rejected: including non-request-stackable devices"); |
1053 | return -EINVAL; | ||
1014 | } | 1054 | } |
1015 | if (sq_count && mq_count) { | 1055 | if (v.sq_count && v.mq_count) { |
1016 | DMERR("table load rejected: not all devices are blk-mq request-stackable"); | 1056 | DMERR("table load rejected: not all devices are blk-mq request-stackable"); |
1017 | return -EINVAL; | 1057 | return -EINVAL; |
1018 | } | 1058 | } |
1019 | t->all_blk_mq = mq_count > 0; | 1059 | t->all_blk_mq = v.mq_count > 0; |
1020 | 1060 | ||
1021 | if (t->type == DM_TYPE_MQ_REQUEST_BASED && !t->all_blk_mq) { | 1061 | if (!t->all_blk_mq && |
1062 | (t->type == DM_TYPE_MQ_REQUEST_BASED || t->type == DM_TYPE_NVME_BIO_BASED)) { | ||
1022 | DMERR("table load rejected: all devices are not blk-mq request-stackable"); | 1063 | DMERR("table load rejected: all devices are not blk-mq request-stackable"); |
1023 | return -EINVAL; | 1064 | return -EINVAL; |
1024 | } | 1065 | } |
@@ -1079,7 +1120,8 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device * | |||
1079 | { | 1120 | { |
1080 | enum dm_queue_mode type = dm_table_get_type(t); | 1121 | enum dm_queue_mode type = dm_table_get_type(t); |
1081 | unsigned per_io_data_size = 0; | 1122 | unsigned per_io_data_size = 0; |
1082 | struct dm_target *tgt; | 1123 | unsigned min_pool_size = 0; |
1124 | struct dm_target *ti; | ||
1083 | unsigned i; | 1125 | unsigned i; |
1084 | 1126 | ||
1085 | if (unlikely(type == DM_TYPE_NONE)) { | 1127 | if (unlikely(type == DM_TYPE_NONE)) { |
@@ -1089,11 +1131,13 @@ static int dm_table_alloc_md_mempools(struct dm_table *t, struct mapped_device * | |||
1089 | 1131 | ||
1090 | if (__table_type_bio_based(type)) | 1132 | if (__table_type_bio_based(type)) |
1091 | for (i = 0; i < t->num_targets; i++) { | 1133 | for (i = 0; i < t->num_targets; i++) { |
1092 | tgt = t->targets + i; | 1134 | ti = t->targets + i; |
1093 | per_io_data_size = max(per_io_data_size, tgt->per_io_data_size); | 1135 | per_io_data_size = max(per_io_data_size, ti->per_io_data_size); |
1136 | min_pool_size = max(min_pool_size, ti->num_flush_bios); | ||
1094 | } | 1137 | } |
1095 | 1138 | ||
1096 | t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, per_io_data_size); | 1139 | t->mempools = dm_alloc_md_mempools(md, type, t->integrity_supported, |
1140 | per_io_data_size, min_pool_size); | ||
1097 | if (!t->mempools) | 1141 | if (!t->mempools) |
1098 | return -ENOMEM; | 1142 | return -ENOMEM; |
1099 | 1143 | ||
@@ -1705,6 +1749,20 @@ static bool dm_table_all_devices_attribute(struct dm_table *t, | |||
1705 | return true; | 1749 | return true; |
1706 | } | 1750 | } |
1707 | 1751 | ||
1752 | static int device_no_partial_completion(struct dm_target *ti, struct dm_dev *dev, | ||
1753 | sector_t start, sector_t len, void *data) | ||
1754 | { | ||
1755 | char b[BDEVNAME_SIZE]; | ||
1756 | |||
1757 | /* For now, NVMe devices are the only devices of this class */ | ||
1758 | return (strncmp(bdevname(dev->bdev, b), "nvme", 3) == 0); | ||
1759 | } | ||
1760 | |||
1761 | static bool dm_table_does_not_support_partial_completion(struct dm_table *t) | ||
1762 | { | ||
1763 | return dm_table_all_devices_attribute(t, device_no_partial_completion); | ||
1764 | } | ||
1765 | |||
1708 | static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, | 1766 | static int device_not_write_same_capable(struct dm_target *ti, struct dm_dev *dev, |
1709 | sector_t start, sector_t len, void *data) | 1767 | sector_t start, sector_t len, void *data) |
1710 | { | 1768 | { |
@@ -1820,6 +1878,8 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, | |||
1820 | } | 1878 | } |
1821 | blk_queue_write_cache(q, wc, fua); | 1879 | blk_queue_write_cache(q, wc, fua); |
1822 | 1880 | ||
1881 | if (dm_table_supports_dax(t)) | ||
1882 | queue_flag_set_unlocked(QUEUE_FLAG_DAX, q); | ||
1823 | if (dm_table_supports_dax_write_cache(t)) | 1883 | if (dm_table_supports_dax_write_cache(t)) |
1824 | dax_write_cache(t->md->dax_dev, true); | 1884 | dax_write_cache(t->md->dax_dev, true); |
1825 | 1885 | ||
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c index f91d771fff4b..629c555890c1 100644 --- a/drivers/md/dm-thin.c +++ b/drivers/md/dm-thin.c | |||
@@ -492,6 +492,11 @@ static void pool_table_init(void) | |||
492 | INIT_LIST_HEAD(&dm_thin_pool_table.pools); | 492 | INIT_LIST_HEAD(&dm_thin_pool_table.pools); |
493 | } | 493 | } |
494 | 494 | ||
495 | static void pool_table_exit(void) | ||
496 | { | ||
497 | mutex_destroy(&dm_thin_pool_table.mutex); | ||
498 | } | ||
499 | |||
495 | static void __pool_table_insert(struct pool *pool) | 500 | static void __pool_table_insert(struct pool *pool) |
496 | { | 501 | { |
497 | BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex)); | 502 | BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex)); |
@@ -1717,7 +1722,7 @@ static void __remap_and_issue_shared_cell(void *context, | |||
1717 | bio_op(bio) == REQ_OP_DISCARD) | 1722 | bio_op(bio) == REQ_OP_DISCARD) |
1718 | bio_list_add(&info->defer_bios, bio); | 1723 | bio_list_add(&info->defer_bios, bio); |
1719 | else { | 1724 | else { |
1720 | struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook));; | 1725 | struct dm_thin_endio_hook *h = dm_per_bio_data(bio, sizeof(struct dm_thin_endio_hook)); |
1721 | 1726 | ||
1722 | h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds); | 1727 | h->shared_read_entry = dm_deferred_entry_inc(info->tc->pool->shared_read_ds); |
1723 | inc_all_io_entry(info->tc->pool, bio); | 1728 | inc_all_io_entry(info->tc->pool, bio); |
@@ -4387,6 +4392,8 @@ static void dm_thin_exit(void) | |||
4387 | dm_unregister_target(&pool_target); | 4392 | dm_unregister_target(&pool_target); |
4388 | 4393 | ||
4389 | kmem_cache_destroy(_new_mapping_cache); | 4394 | kmem_cache_destroy(_new_mapping_cache); |
4395 | |||
4396 | pool_table_exit(); | ||
4390 | } | 4397 | } |
4391 | 4398 | ||
4392 | module_init(dm_thin_init); | 4399 | module_init(dm_thin_init); |
diff --git a/drivers/md/dm-unstripe.c b/drivers/md/dm-unstripe.c new file mode 100644 index 000000000000..65f838fa2e99 --- /dev/null +++ b/drivers/md/dm-unstripe.c | |||
@@ -0,0 +1,219 @@ | |||
1 | /* | ||
2 | * Copyright (C) 2017 Intel Corporation. | ||
3 | * | ||
4 | * This file is released under the GPL. | ||
5 | */ | ||
6 | |||
7 | #include "dm.h" | ||
8 | |||
9 | #include <linux/module.h> | ||
10 | #include <linux/init.h> | ||
11 | #include <linux/blkdev.h> | ||
12 | #include <linux/bio.h> | ||
13 | #include <linux/slab.h> | ||
14 | #include <linux/bitops.h> | ||
15 | #include <linux/device-mapper.h> | ||
16 | |||
17 | struct unstripe_c { | ||
18 | struct dm_dev *dev; | ||
19 | sector_t physical_start; | ||
20 | |||
21 | uint32_t stripes; | ||
22 | |||
23 | uint32_t unstripe; | ||
24 | sector_t unstripe_width; | ||
25 | sector_t unstripe_offset; | ||
26 | |||
27 | uint32_t chunk_size; | ||
28 | u8 chunk_shift; | ||
29 | }; | ||
30 | |||
31 | #define DM_MSG_PREFIX "unstriped" | ||
32 | |||
33 | static void cleanup_unstripe(struct unstripe_c *uc, struct dm_target *ti) | ||
34 | { | ||
35 | if (uc->dev) | ||
36 | dm_put_device(ti, uc->dev); | ||
37 | kfree(uc); | ||
38 | } | ||
39 | |||
40 | /* | ||
41 | * Contruct an unstriped mapping. | ||
42 | * <number of stripes> <chunk size> <stripe #> <dev_path> <offset> | ||
43 | */ | ||
44 | static int unstripe_ctr(struct dm_target *ti, unsigned int argc, char **argv) | ||
45 | { | ||
46 | struct unstripe_c *uc; | ||
47 | sector_t tmp_len; | ||
48 | unsigned long long start; | ||
49 | char dummy; | ||
50 | |||
51 | if (argc != 5) { | ||
52 | ti->error = "Invalid number of arguments"; | ||
53 | return -EINVAL; | ||
54 | } | ||
55 | |||
56 | uc = kzalloc(sizeof(*uc), GFP_KERNEL); | ||
57 | if (!uc) { | ||
58 | ti->error = "Memory allocation for unstriped context failed"; | ||
59 | return -ENOMEM; | ||
60 | } | ||
61 | |||
62 | if (kstrtouint(argv[0], 10, &uc->stripes) || !uc->stripes) { | ||
63 | ti->error = "Invalid stripe count"; | ||
64 | goto err; | ||
65 | } | ||
66 | |||
67 | if (kstrtouint(argv[1], 10, &uc->chunk_size) || !uc->chunk_size) { | ||
68 | ti->error = "Invalid chunk_size"; | ||
69 | goto err; | ||
70 | } | ||
71 | |||
72 | // FIXME: must support non power of 2 chunk_size, dm-stripe.c does | ||
73 | if (!is_power_of_2(uc->chunk_size)) { | ||
74 | ti->error = "Non power of 2 chunk_size is not supported yet"; | ||
75 | goto err; | ||
76 | } | ||
77 | |||
78 | if (kstrtouint(argv[2], 10, &uc->unstripe)) { | ||
79 | ti->error = "Invalid stripe number"; | ||
80 | goto err; | ||
81 | } | ||
82 | |||
83 | if (uc->unstripe > uc->stripes && uc->stripes > 1) { | ||
84 | ti->error = "Please provide stripe between [0, # of stripes]"; | ||
85 | goto err; | ||
86 | } | ||
87 | |||
88 | if (dm_get_device(ti, argv[3], dm_table_get_mode(ti->table), &uc->dev)) { | ||
89 | ti->error = "Couldn't get striped device"; | ||
90 | goto err; | ||
91 | } | ||
92 | |||
93 | if (sscanf(argv[4], "%llu%c", &start, &dummy) != 1) { | ||
94 | ti->error = "Invalid striped device offset"; | ||
95 | goto err; | ||
96 | } | ||
97 | uc->physical_start = start; | ||
98 | |||
99 | uc->unstripe_offset = uc->unstripe * uc->chunk_size; | ||
100 | uc->unstripe_width = (uc->stripes - 1) * uc->chunk_size; | ||
101 | uc->chunk_shift = fls(uc->chunk_size) - 1; | ||
102 | |||
103 | tmp_len = ti->len; | ||
104 | if (sector_div(tmp_len, uc->chunk_size)) { | ||
105 | ti->error = "Target length not divisible by chunk size"; | ||
106 | goto err; | ||
107 | } | ||
108 | |||
109 | if (dm_set_target_max_io_len(ti, uc->chunk_size)) { | ||
110 | ti->error = "Failed to set max io len"; | ||
111 | goto err; | ||
112 | } | ||
113 | |||
114 | ti->private = uc; | ||
115 | return 0; | ||
116 | err: | ||
117 | cleanup_unstripe(uc, ti); | ||
118 | return -EINVAL; | ||
119 | } | ||
120 | |||
121 | static void unstripe_dtr(struct dm_target *ti) | ||
122 | { | ||
123 | struct unstripe_c *uc = ti->private; | ||
124 | |||
125 | cleanup_unstripe(uc, ti); | ||
126 | } | ||
127 | |||
128 | static sector_t map_to_core(struct dm_target *ti, struct bio *bio) | ||
129 | { | ||
130 | struct unstripe_c *uc = ti->private; | ||
131 | sector_t sector = bio->bi_iter.bi_sector; | ||
132 | |||
133 | /* Shift us up to the right "row" on the stripe */ | ||
134 | sector += uc->unstripe_width * (sector >> uc->chunk_shift); | ||
135 | |||
136 | /* Account for what stripe we're operating on */ | ||
137 | sector += uc->unstripe_offset; | ||
138 | |||
139 | return sector; | ||
140 | } | ||
141 | |||
142 | static int unstripe_map(struct dm_target *ti, struct bio *bio) | ||
143 | { | ||
144 | struct unstripe_c *uc = ti->private; | ||
145 | |||
146 | bio_set_dev(bio, uc->dev->bdev); | ||
147 | bio->bi_iter.bi_sector = map_to_core(ti, bio) + uc->physical_start; | ||
148 | |||
149 | return DM_MAPIO_REMAPPED; | ||
150 | } | ||
151 | |||
152 | static void unstripe_status(struct dm_target *ti, status_type_t type, | ||
153 | unsigned int status_flags, char *result, unsigned int maxlen) | ||
154 | { | ||
155 | struct unstripe_c *uc = ti->private; | ||
156 | unsigned int sz = 0; | ||
157 | |||
158 | switch (type) { | ||
159 | case STATUSTYPE_INFO: | ||
160 | break; | ||
161 | |||
162 | case STATUSTYPE_TABLE: | ||
163 | DMEMIT("%d %llu %d %s %llu", | ||
164 | uc->stripes, (unsigned long long)uc->chunk_size, uc->unstripe, | ||
165 | uc->dev->name, (unsigned long long)uc->physical_start); | ||
166 | break; | ||
167 | } | ||
168 | } | ||
169 | |||
170 | static int unstripe_iterate_devices(struct dm_target *ti, | ||
171 | iterate_devices_callout_fn fn, void *data) | ||
172 | { | ||
173 | struct unstripe_c *uc = ti->private; | ||
174 | |||
175 | return fn(ti, uc->dev, uc->physical_start, ti->len, data); | ||
176 | } | ||
177 | |||
178 | static void unstripe_io_hints(struct dm_target *ti, | ||
179 | struct queue_limits *limits) | ||
180 | { | ||
181 | struct unstripe_c *uc = ti->private; | ||
182 | |||
183 | limits->chunk_sectors = uc->chunk_size; | ||
184 | } | ||
185 | |||
186 | static struct target_type unstripe_target = { | ||
187 | .name = "unstriped", | ||
188 | .version = {1, 0, 0}, | ||
189 | .module = THIS_MODULE, | ||
190 | .ctr = unstripe_ctr, | ||
191 | .dtr = unstripe_dtr, | ||
192 | .map = unstripe_map, | ||
193 | .status = unstripe_status, | ||
194 | .iterate_devices = unstripe_iterate_devices, | ||
195 | .io_hints = unstripe_io_hints, | ||
196 | }; | ||
197 | |||
198 | static int __init dm_unstripe_init(void) | ||
199 | { | ||
200 | int r; | ||
201 | |||
202 | r = dm_register_target(&unstripe_target); | ||
203 | if (r < 0) | ||
204 | DMERR("target registration failed"); | ||
205 | |||
206 | return r; | ||
207 | } | ||
208 | |||
209 | static void __exit dm_unstripe_exit(void) | ||
210 | { | ||
211 | dm_unregister_target(&unstripe_target); | ||
212 | } | ||
213 | |||
214 | module_init(dm_unstripe_init); | ||
215 | module_exit(dm_unstripe_exit); | ||
216 | |||
217 | MODULE_DESCRIPTION(DM_NAME " unstriped target"); | ||
218 | MODULE_AUTHOR("Scott Bauer <scott.bauer@intel.com>"); | ||
219 | MODULE_LICENSE("GPL"); | ||
diff --git a/drivers/md/dm-zoned-metadata.c b/drivers/md/dm-zoned-metadata.c index 70485de37b66..969954915566 100644 --- a/drivers/md/dm-zoned-metadata.c +++ b/drivers/md/dm-zoned-metadata.c | |||
@@ -2333,6 +2333,9 @@ static void dmz_cleanup_metadata(struct dmz_metadata *zmd) | |||
2333 | 2333 | ||
2334 | /* Free the zone descriptors */ | 2334 | /* Free the zone descriptors */ |
2335 | dmz_drop_zones(zmd); | 2335 | dmz_drop_zones(zmd); |
2336 | |||
2337 | mutex_destroy(&zmd->mblk_flush_lock); | ||
2338 | mutex_destroy(&zmd->map_lock); | ||
2336 | } | 2339 | } |
2337 | 2340 | ||
2338 | /* | 2341 | /* |
diff --git a/drivers/md/dm-zoned-target.c b/drivers/md/dm-zoned-target.c index 6d7bda6f8190..caff02caf083 100644 --- a/drivers/md/dm-zoned-target.c +++ b/drivers/md/dm-zoned-target.c | |||
@@ -827,6 +827,7 @@ err_fwq: | |||
827 | err_cwq: | 827 | err_cwq: |
828 | destroy_workqueue(dmz->chunk_wq); | 828 | destroy_workqueue(dmz->chunk_wq); |
829 | err_bio: | 829 | err_bio: |
830 | mutex_destroy(&dmz->chunk_lock); | ||
830 | bioset_free(dmz->bio_set); | 831 | bioset_free(dmz->bio_set); |
831 | err_meta: | 832 | err_meta: |
832 | dmz_dtr_metadata(dmz->metadata); | 833 | dmz_dtr_metadata(dmz->metadata); |
@@ -861,6 +862,8 @@ static void dmz_dtr(struct dm_target *ti) | |||
861 | 862 | ||
862 | dmz_put_zoned_device(ti); | 863 | dmz_put_zoned_device(ti); |
863 | 864 | ||
865 | mutex_destroy(&dmz->chunk_lock); | ||
866 | |||
864 | kfree(dmz); | 867 | kfree(dmz); |
865 | } | 868 | } |
866 | 869 | ||
diff --git a/drivers/md/dm.c b/drivers/md/dm.c index 8c26bfc35335..d6de00f367ef 100644 --- a/drivers/md/dm.c +++ b/drivers/md/dm.c | |||
@@ -60,18 +60,73 @@ void dm_issue_global_event(void) | |||
60 | } | 60 | } |
61 | 61 | ||
62 | /* | 62 | /* |
63 | * One of these is allocated per bio. | 63 | * One of these is allocated (on-stack) per original bio. |
64 | */ | 64 | */ |
65 | struct clone_info { | ||
66 | struct dm_table *map; | ||
67 | struct bio *bio; | ||
68 | struct dm_io *io; | ||
69 | sector_t sector; | ||
70 | unsigned sector_count; | ||
71 | }; | ||
72 | |||
73 | /* | ||
74 | * One of these is allocated per clone bio. | ||
75 | */ | ||
76 | #define DM_TIO_MAGIC 7282014 | ||
77 | struct dm_target_io { | ||
78 | unsigned magic; | ||
79 | struct dm_io *io; | ||
80 | struct dm_target *ti; | ||
81 | unsigned target_bio_nr; | ||
82 | unsigned *len_ptr; | ||
83 | bool inside_dm_io; | ||
84 | struct bio clone; | ||
85 | }; | ||
86 | |||
87 | /* | ||
88 | * One of these is allocated per original bio. | ||
89 | * It contains the first clone used for that original. | ||
90 | */ | ||
91 | #define DM_IO_MAGIC 5191977 | ||
65 | struct dm_io { | 92 | struct dm_io { |
93 | unsigned magic; | ||
66 | struct mapped_device *md; | 94 | struct mapped_device *md; |
67 | blk_status_t status; | 95 | blk_status_t status; |
68 | atomic_t io_count; | 96 | atomic_t io_count; |
69 | struct bio *bio; | 97 | struct bio *orig_bio; |
70 | unsigned long start_time; | 98 | unsigned long start_time; |
71 | spinlock_t endio_lock; | 99 | spinlock_t endio_lock; |
72 | struct dm_stats_aux stats_aux; | 100 | struct dm_stats_aux stats_aux; |
101 | /* last member of dm_target_io is 'struct bio' */ | ||
102 | struct dm_target_io tio; | ||
73 | }; | 103 | }; |
74 | 104 | ||
105 | void *dm_per_bio_data(struct bio *bio, size_t data_size) | ||
106 | { | ||
107 | struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); | ||
108 | if (!tio->inside_dm_io) | ||
109 | return (char *)bio - offsetof(struct dm_target_io, clone) - data_size; | ||
110 | return (char *)bio - offsetof(struct dm_target_io, clone) - offsetof(struct dm_io, tio) - data_size; | ||
111 | } | ||
112 | EXPORT_SYMBOL_GPL(dm_per_bio_data); | ||
113 | |||
114 | struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size) | ||
115 | { | ||
116 | struct dm_io *io = (struct dm_io *)((char *)data + data_size); | ||
117 | if (io->magic == DM_IO_MAGIC) | ||
118 | return (struct bio *)((char *)io + offsetof(struct dm_io, tio) + offsetof(struct dm_target_io, clone)); | ||
119 | BUG_ON(io->magic != DM_TIO_MAGIC); | ||
120 | return (struct bio *)((char *)io + offsetof(struct dm_target_io, clone)); | ||
121 | } | ||
122 | EXPORT_SYMBOL_GPL(dm_bio_from_per_bio_data); | ||
123 | |||
124 | unsigned dm_bio_get_target_bio_nr(const struct bio *bio) | ||
125 | { | ||
126 | return container_of(bio, struct dm_target_io, clone)->target_bio_nr; | ||
127 | } | ||
128 | EXPORT_SYMBOL_GPL(dm_bio_get_target_bio_nr); | ||
129 | |||
75 | #define MINOR_ALLOCED ((void *)-1) | 130 | #define MINOR_ALLOCED ((void *)-1) |
76 | 131 | ||
77 | /* | 132 | /* |
@@ -93,8 +148,8 @@ static int dm_numa_node = DM_NUMA_NODE; | |||
93 | * For mempools pre-allocation at the table loading time. | 148 | * For mempools pre-allocation at the table loading time. |
94 | */ | 149 | */ |
95 | struct dm_md_mempools { | 150 | struct dm_md_mempools { |
96 | mempool_t *io_pool; | ||
97 | struct bio_set *bs; | 151 | struct bio_set *bs; |
152 | struct bio_set *io_bs; | ||
98 | }; | 153 | }; |
99 | 154 | ||
100 | struct table_device { | 155 | struct table_device { |
@@ -103,7 +158,6 @@ struct table_device { | |||
103 | struct dm_dev dm_dev; | 158 | struct dm_dev dm_dev; |
104 | }; | 159 | }; |
105 | 160 | ||
106 | static struct kmem_cache *_io_cache; | ||
107 | static struct kmem_cache *_rq_tio_cache; | 161 | static struct kmem_cache *_rq_tio_cache; |
108 | static struct kmem_cache *_rq_cache; | 162 | static struct kmem_cache *_rq_cache; |
109 | 163 | ||
@@ -170,14 +224,9 @@ static int __init local_init(void) | |||
170 | { | 224 | { |
171 | int r = -ENOMEM; | 225 | int r = -ENOMEM; |
172 | 226 | ||
173 | /* allocate a slab for the dm_ios */ | ||
174 | _io_cache = KMEM_CACHE(dm_io, 0); | ||
175 | if (!_io_cache) | ||
176 | return r; | ||
177 | |||
178 | _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0); | 227 | _rq_tio_cache = KMEM_CACHE(dm_rq_target_io, 0); |
179 | if (!_rq_tio_cache) | 228 | if (!_rq_tio_cache) |
180 | goto out_free_io_cache; | 229 | return r; |
181 | 230 | ||
182 | _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request), | 231 | _rq_cache = kmem_cache_create("dm_old_clone_request", sizeof(struct request), |
183 | __alignof__(struct request), 0, NULL); | 232 | __alignof__(struct request), 0, NULL); |
@@ -212,8 +261,6 @@ out_free_rq_cache: | |||
212 | kmem_cache_destroy(_rq_cache); | 261 | kmem_cache_destroy(_rq_cache); |
213 | out_free_rq_tio_cache: | 262 | out_free_rq_tio_cache: |
214 | kmem_cache_destroy(_rq_tio_cache); | 263 | kmem_cache_destroy(_rq_tio_cache); |
215 | out_free_io_cache: | ||
216 | kmem_cache_destroy(_io_cache); | ||
217 | 264 | ||
218 | return r; | 265 | return r; |
219 | } | 266 | } |
@@ -225,7 +272,6 @@ static void local_exit(void) | |||
225 | 272 | ||
226 | kmem_cache_destroy(_rq_cache); | 273 | kmem_cache_destroy(_rq_cache); |
227 | kmem_cache_destroy(_rq_tio_cache); | 274 | kmem_cache_destroy(_rq_tio_cache); |
228 | kmem_cache_destroy(_io_cache); | ||
229 | unregister_blkdev(_major, _name); | 275 | unregister_blkdev(_major, _name); |
230 | dm_uevent_exit(); | 276 | dm_uevent_exit(); |
231 | 277 | ||
@@ -486,18 +532,69 @@ out: | |||
486 | return r; | 532 | return r; |
487 | } | 533 | } |
488 | 534 | ||
489 | static struct dm_io *alloc_io(struct mapped_device *md) | 535 | static void start_io_acct(struct dm_io *io); |
536 | |||
537 | static struct dm_io *alloc_io(struct mapped_device *md, struct bio *bio) | ||
490 | { | 538 | { |
491 | return mempool_alloc(md->io_pool, GFP_NOIO); | 539 | struct dm_io *io; |
540 | struct dm_target_io *tio; | ||
541 | struct bio *clone; | ||
542 | |||
543 | clone = bio_alloc_bioset(GFP_NOIO, 0, md->io_bs); | ||
544 | if (!clone) | ||
545 | return NULL; | ||
546 | |||
547 | tio = container_of(clone, struct dm_target_io, clone); | ||
548 | tio->inside_dm_io = true; | ||
549 | tio->io = NULL; | ||
550 | |||
551 | io = container_of(tio, struct dm_io, tio); | ||
552 | io->magic = DM_IO_MAGIC; | ||
553 | io->status = 0; | ||
554 | atomic_set(&io->io_count, 1); | ||
555 | io->orig_bio = bio; | ||
556 | io->md = md; | ||
557 | spin_lock_init(&io->endio_lock); | ||
558 | |||
559 | start_io_acct(io); | ||
560 | |||
561 | return io; | ||
492 | } | 562 | } |
493 | 563 | ||
494 | static void free_io(struct mapped_device *md, struct dm_io *io) | 564 | static void free_io(struct mapped_device *md, struct dm_io *io) |
495 | { | 565 | { |
496 | mempool_free(io, md->io_pool); | 566 | bio_put(&io->tio.clone); |
567 | } | ||
568 | |||
569 | static struct dm_target_io *alloc_tio(struct clone_info *ci, struct dm_target *ti, | ||
570 | unsigned target_bio_nr, gfp_t gfp_mask) | ||
571 | { | ||
572 | struct dm_target_io *tio; | ||
573 | |||
574 | if (!ci->io->tio.io) { | ||
575 | /* the dm_target_io embedded in ci->io is available */ | ||
576 | tio = &ci->io->tio; | ||
577 | } else { | ||
578 | struct bio *clone = bio_alloc_bioset(gfp_mask, 0, ci->io->md->bs); | ||
579 | if (!clone) | ||
580 | return NULL; | ||
581 | |||
582 | tio = container_of(clone, struct dm_target_io, clone); | ||
583 | tio->inside_dm_io = false; | ||
584 | } | ||
585 | |||
586 | tio->magic = DM_TIO_MAGIC; | ||
587 | tio->io = ci->io; | ||
588 | tio->ti = ti; | ||
589 | tio->target_bio_nr = target_bio_nr; | ||
590 | |||
591 | return tio; | ||
497 | } | 592 | } |
498 | 593 | ||
499 | static void free_tio(struct dm_target_io *tio) | 594 | static void free_tio(struct dm_target_io *tio) |
500 | { | 595 | { |
596 | if (tio->inside_dm_io) | ||
597 | return; | ||
501 | bio_put(&tio->clone); | 598 | bio_put(&tio->clone); |
502 | } | 599 | } |
503 | 600 | ||
@@ -510,17 +607,15 @@ int md_in_flight(struct mapped_device *md) | |||
510 | static void start_io_acct(struct dm_io *io) | 607 | static void start_io_acct(struct dm_io *io) |
511 | { | 608 | { |
512 | struct mapped_device *md = io->md; | 609 | struct mapped_device *md = io->md; |
513 | struct bio *bio = io->bio; | 610 | struct bio *bio = io->orig_bio; |
514 | int cpu; | ||
515 | int rw = bio_data_dir(bio); | 611 | int rw = bio_data_dir(bio); |
516 | 612 | ||
517 | io->start_time = jiffies; | 613 | io->start_time = jiffies; |
518 | 614 | ||
519 | cpu = part_stat_lock(); | 615 | generic_start_io_acct(md->queue, rw, bio_sectors(bio), &dm_disk(md)->part0); |
520 | part_round_stats(md->queue, cpu, &dm_disk(md)->part0); | 616 | |
521 | part_stat_unlock(); | ||
522 | atomic_set(&dm_disk(md)->part0.in_flight[rw], | 617 | atomic_set(&dm_disk(md)->part0.in_flight[rw], |
523 | atomic_inc_return(&md->pending[rw])); | 618 | atomic_inc_return(&md->pending[rw])); |
524 | 619 | ||
525 | if (unlikely(dm_stats_used(&md->stats))) | 620 | if (unlikely(dm_stats_used(&md->stats))) |
526 | dm_stats_account_io(&md->stats, bio_data_dir(bio), | 621 | dm_stats_account_io(&md->stats, bio_data_dir(bio), |
@@ -531,7 +626,7 @@ static void start_io_acct(struct dm_io *io) | |||
531 | static void end_io_acct(struct dm_io *io) | 626 | static void end_io_acct(struct dm_io *io) |
532 | { | 627 | { |
533 | struct mapped_device *md = io->md; | 628 | struct mapped_device *md = io->md; |
534 | struct bio *bio = io->bio; | 629 | struct bio *bio = io->orig_bio; |
535 | unsigned long duration = jiffies - io->start_time; | 630 | unsigned long duration = jiffies - io->start_time; |
536 | int pending; | 631 | int pending; |
537 | int rw = bio_data_dir(bio); | 632 | int rw = bio_data_dir(bio); |
@@ -752,15 +847,6 @@ int dm_set_geometry(struct mapped_device *md, struct hd_geometry *geo) | |||
752 | return 0; | 847 | return 0; |
753 | } | 848 | } |
754 | 849 | ||
755 | /*----------------------------------------------------------------- | ||
756 | * CRUD START: | ||
757 | * A more elegant soln is in the works that uses the queue | ||
758 | * merge fn, unfortunately there are a couple of changes to | ||
759 | * the block layer that I want to make for this. So in the | ||
760 | * interests of getting something for people to use I give | ||
761 | * you this clearly demarcated crap. | ||
762 | *---------------------------------------------------------------*/ | ||
763 | |||
764 | static int __noflush_suspending(struct mapped_device *md) | 850 | static int __noflush_suspending(struct mapped_device *md) |
765 | { | 851 | { |
766 | return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); | 852 | return test_bit(DMF_NOFLUSH_SUSPENDING, &md->flags); |
@@ -780,8 +866,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error) | |||
780 | /* Push-back supersedes any I/O errors */ | 866 | /* Push-back supersedes any I/O errors */ |
781 | if (unlikely(error)) { | 867 | if (unlikely(error)) { |
782 | spin_lock_irqsave(&io->endio_lock, flags); | 868 | spin_lock_irqsave(&io->endio_lock, flags); |
783 | if (!(io->status == BLK_STS_DM_REQUEUE && | 869 | if (!(io->status == BLK_STS_DM_REQUEUE && __noflush_suspending(md))) |
784 | __noflush_suspending(md))) | ||
785 | io->status = error; | 870 | io->status = error; |
786 | spin_unlock_irqrestore(&io->endio_lock, flags); | 871 | spin_unlock_irqrestore(&io->endio_lock, flags); |
787 | } | 872 | } |
@@ -793,7 +878,8 @@ static void dec_pending(struct dm_io *io, blk_status_t error) | |||
793 | */ | 878 | */ |
794 | spin_lock_irqsave(&md->deferred_lock, flags); | 879 | spin_lock_irqsave(&md->deferred_lock, flags); |
795 | if (__noflush_suspending(md)) | 880 | if (__noflush_suspending(md)) |
796 | bio_list_add_head(&md->deferred, io->bio); | 881 | /* NOTE early return due to BLK_STS_DM_REQUEUE below */ |
882 | bio_list_add_head(&md->deferred, io->orig_bio); | ||
797 | else | 883 | else |
798 | /* noflush suspend was interrupted. */ | 884 | /* noflush suspend was interrupted. */ |
799 | io->status = BLK_STS_IOERR; | 885 | io->status = BLK_STS_IOERR; |
@@ -801,7 +887,7 @@ static void dec_pending(struct dm_io *io, blk_status_t error) | |||
801 | } | 887 | } |
802 | 888 | ||
803 | io_error = io->status; | 889 | io_error = io->status; |
804 | bio = io->bio; | 890 | bio = io->orig_bio; |
805 | end_io_acct(io); | 891 | end_io_acct(io); |
806 | free_io(md, io); | 892 | free_io(md, io); |
807 | 893 | ||
@@ -847,7 +933,7 @@ static void clone_endio(struct bio *bio) | |||
847 | struct mapped_device *md = tio->io->md; | 933 | struct mapped_device *md = tio->io->md; |
848 | dm_endio_fn endio = tio->ti->type->end_io; | 934 | dm_endio_fn endio = tio->ti->type->end_io; |
849 | 935 | ||
850 | if (unlikely(error == BLK_STS_TARGET)) { | 936 | if (unlikely(error == BLK_STS_TARGET) && md->type != DM_TYPE_NVME_BIO_BASED) { |
851 | if (bio_op(bio) == REQ_OP_WRITE_SAME && | 937 | if (bio_op(bio) == REQ_OP_WRITE_SAME && |
852 | !bio->bi_disk->queue->limits.max_write_same_sectors) | 938 | !bio->bi_disk->queue->limits.max_write_same_sectors) |
853 | disable_write_same(md); | 939 | disable_write_same(md); |
@@ -1005,7 +1091,7 @@ static size_t dm_dax_copy_from_iter(struct dax_device *dax_dev, pgoff_t pgoff, | |||
1005 | 1091 | ||
1006 | /* | 1092 | /* |
1007 | * A target may call dm_accept_partial_bio only from the map routine. It is | 1093 | * A target may call dm_accept_partial_bio only from the map routine. It is |
1008 | * allowed for all bio types except REQ_PREFLUSH. | 1094 | * allowed for all bio types except REQ_PREFLUSH and REQ_OP_ZONE_RESET. |
1009 | * | 1095 | * |
1010 | * dm_accept_partial_bio informs the dm that the target only wants to process | 1096 | * dm_accept_partial_bio informs the dm that the target only wants to process |
1011 | * additional n_sectors sectors of the bio and the rest of the data should be | 1097 | * additional n_sectors sectors of the bio and the rest of the data should be |
@@ -1055,7 +1141,7 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start) | |||
1055 | { | 1141 | { |
1056 | #ifdef CONFIG_BLK_DEV_ZONED | 1142 | #ifdef CONFIG_BLK_DEV_ZONED |
1057 | struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); | 1143 | struct dm_target_io *tio = container_of(bio, struct dm_target_io, clone); |
1058 | struct bio *report_bio = tio->io->bio; | 1144 | struct bio *report_bio = tio->io->orig_bio; |
1059 | struct blk_zone_report_hdr *hdr = NULL; | 1145 | struct blk_zone_report_hdr *hdr = NULL; |
1060 | struct blk_zone *zone; | 1146 | struct blk_zone *zone; |
1061 | unsigned int nr_rep = 0; | 1147 | unsigned int nr_rep = 0; |
@@ -1122,67 +1208,15 @@ void dm_remap_zone_report(struct dm_target *ti, struct bio *bio, sector_t start) | |||
1122 | } | 1208 | } |
1123 | EXPORT_SYMBOL_GPL(dm_remap_zone_report); | 1209 | EXPORT_SYMBOL_GPL(dm_remap_zone_report); |
1124 | 1210 | ||
1125 | /* | 1211 | static blk_qc_t __map_bio(struct dm_target_io *tio) |
1126 | * Flush current->bio_list when the target map method blocks. | ||
1127 | * This fixes deadlocks in snapshot and possibly in other targets. | ||
1128 | */ | ||
1129 | struct dm_offload { | ||
1130 | struct blk_plug plug; | ||
1131 | struct blk_plug_cb cb; | ||
1132 | }; | ||
1133 | |||
1134 | static void flush_current_bio_list(struct blk_plug_cb *cb, bool from_schedule) | ||
1135 | { | ||
1136 | struct dm_offload *o = container_of(cb, struct dm_offload, cb); | ||
1137 | struct bio_list list; | ||
1138 | struct bio *bio; | ||
1139 | int i; | ||
1140 | |||
1141 | INIT_LIST_HEAD(&o->cb.list); | ||
1142 | |||
1143 | if (unlikely(!current->bio_list)) | ||
1144 | return; | ||
1145 | |||
1146 | for (i = 0; i < 2; i++) { | ||
1147 | list = current->bio_list[i]; | ||
1148 | bio_list_init(¤t->bio_list[i]); | ||
1149 | |||
1150 | while ((bio = bio_list_pop(&list))) { | ||
1151 | struct bio_set *bs = bio->bi_pool; | ||
1152 | if (unlikely(!bs) || bs == fs_bio_set || | ||
1153 | !bs->rescue_workqueue) { | ||
1154 | bio_list_add(¤t->bio_list[i], bio); | ||
1155 | continue; | ||
1156 | } | ||
1157 | |||
1158 | spin_lock(&bs->rescue_lock); | ||
1159 | bio_list_add(&bs->rescue_list, bio); | ||
1160 | queue_work(bs->rescue_workqueue, &bs->rescue_work); | ||
1161 | spin_unlock(&bs->rescue_lock); | ||
1162 | } | ||
1163 | } | ||
1164 | } | ||
1165 | |||
1166 | static void dm_offload_start(struct dm_offload *o) | ||
1167 | { | ||
1168 | blk_start_plug(&o->plug); | ||
1169 | o->cb.callback = flush_current_bio_list; | ||
1170 | list_add(&o->cb.list, ¤t->plug->cb_list); | ||
1171 | } | ||
1172 | |||
1173 | static void dm_offload_end(struct dm_offload *o) | ||
1174 | { | ||
1175 | list_del(&o->cb.list); | ||
1176 | blk_finish_plug(&o->plug); | ||
1177 | } | ||
1178 | |||
1179 | static void __map_bio(struct dm_target_io *tio) | ||
1180 | { | 1212 | { |
1181 | int r; | 1213 | int r; |
1182 | sector_t sector; | 1214 | sector_t sector; |
1183 | struct dm_offload o; | ||
1184 | struct bio *clone = &tio->clone; | 1215 | struct bio *clone = &tio->clone; |
1216 | struct dm_io *io = tio->io; | ||
1217 | struct mapped_device *md = io->md; | ||
1185 | struct dm_target *ti = tio->ti; | 1218 | struct dm_target *ti = tio->ti; |
1219 | blk_qc_t ret = BLK_QC_T_NONE; | ||
1186 | 1220 | ||
1187 | clone->bi_end_io = clone_endio; | 1221 | clone->bi_end_io = clone_endio; |
1188 | 1222 | ||
@@ -1191,44 +1225,37 @@ static void __map_bio(struct dm_target_io *tio) | |||
1191 | * anything, the target has assumed ownership of | 1225 | * anything, the target has assumed ownership of |
1192 | * this io. | 1226 | * this io. |
1193 | */ | 1227 | */ |
1194 | atomic_inc(&tio->io->io_count); | 1228 | atomic_inc(&io->io_count); |
1195 | sector = clone->bi_iter.bi_sector; | 1229 | sector = clone->bi_iter.bi_sector; |
1196 | 1230 | ||
1197 | dm_offload_start(&o); | ||
1198 | r = ti->type->map(ti, clone); | 1231 | r = ti->type->map(ti, clone); |
1199 | dm_offload_end(&o); | ||
1200 | |||
1201 | switch (r) { | 1232 | switch (r) { |
1202 | case DM_MAPIO_SUBMITTED: | 1233 | case DM_MAPIO_SUBMITTED: |
1203 | break; | 1234 | break; |
1204 | case DM_MAPIO_REMAPPED: | 1235 | case DM_MAPIO_REMAPPED: |
1205 | /* the bio has been remapped so dispatch it */ | 1236 | /* the bio has been remapped so dispatch it */ |
1206 | trace_block_bio_remap(clone->bi_disk->queue, clone, | 1237 | trace_block_bio_remap(clone->bi_disk->queue, clone, |
1207 | bio_dev(tio->io->bio), sector); | 1238 | bio_dev(io->orig_bio), sector); |
1208 | generic_make_request(clone); | 1239 | if (md->type == DM_TYPE_NVME_BIO_BASED) |
1240 | ret = direct_make_request(clone); | ||
1241 | else | ||
1242 | ret = generic_make_request(clone); | ||
1209 | break; | 1243 | break; |
1210 | case DM_MAPIO_KILL: | 1244 | case DM_MAPIO_KILL: |
1211 | dec_pending(tio->io, BLK_STS_IOERR); | ||
1212 | free_tio(tio); | 1245 | free_tio(tio); |
1246 | dec_pending(io, BLK_STS_IOERR); | ||
1213 | break; | 1247 | break; |
1214 | case DM_MAPIO_REQUEUE: | 1248 | case DM_MAPIO_REQUEUE: |
1215 | dec_pending(tio->io, BLK_STS_DM_REQUEUE); | ||
1216 | free_tio(tio); | 1249 | free_tio(tio); |
1250 | dec_pending(io, BLK_STS_DM_REQUEUE); | ||
1217 | break; | 1251 | break; |
1218 | default: | 1252 | default: |
1219 | DMWARN("unimplemented target map return value: %d", r); | 1253 | DMWARN("unimplemented target map return value: %d", r); |
1220 | BUG(); | 1254 | BUG(); |
1221 | } | 1255 | } |
1222 | } | ||
1223 | 1256 | ||
1224 | struct clone_info { | 1257 | return ret; |
1225 | struct mapped_device *md; | 1258 | } |
1226 | struct dm_table *map; | ||
1227 | struct bio *bio; | ||
1228 | struct dm_io *io; | ||
1229 | sector_t sector; | ||
1230 | unsigned sector_count; | ||
1231 | }; | ||
1232 | 1259 | ||
1233 | static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len) | 1260 | static void bio_setup_sector(struct bio *bio, sector_t sector, unsigned len) |
1234 | { | 1261 | { |
@@ -1272,28 +1299,49 @@ static int clone_bio(struct dm_target_io *tio, struct bio *bio, | |||
1272 | return 0; | 1299 | return 0; |
1273 | } | 1300 | } |
1274 | 1301 | ||
1275 | static struct dm_target_io *alloc_tio(struct clone_info *ci, | 1302 | static void alloc_multiple_bios(struct bio_list *blist, struct clone_info *ci, |
1276 | struct dm_target *ti, | 1303 | struct dm_target *ti, unsigned num_bios) |
1277 | unsigned target_bio_nr) | ||
1278 | { | 1304 | { |
1279 | struct dm_target_io *tio; | 1305 | struct dm_target_io *tio; |
1280 | struct bio *clone; | 1306 | int try; |
1281 | 1307 | ||
1282 | clone = bio_alloc_bioset(GFP_NOIO, 0, ci->md->bs); | 1308 | if (!num_bios) |
1283 | tio = container_of(clone, struct dm_target_io, clone); | 1309 | return; |
1284 | 1310 | ||
1285 | tio->io = ci->io; | 1311 | if (num_bios == 1) { |
1286 | tio->ti = ti; | 1312 | tio = alloc_tio(ci, ti, 0, GFP_NOIO); |
1287 | tio->target_bio_nr = target_bio_nr; | 1313 | bio_list_add(blist, &tio->clone); |
1314 | return; | ||
1315 | } | ||
1288 | 1316 | ||
1289 | return tio; | 1317 | for (try = 0; try < 2; try++) { |
1318 | int bio_nr; | ||
1319 | struct bio *bio; | ||
1320 | |||
1321 | if (try) | ||
1322 | mutex_lock(&ci->io->md->table_devices_lock); | ||
1323 | for (bio_nr = 0; bio_nr < num_bios; bio_nr++) { | ||
1324 | tio = alloc_tio(ci, ti, bio_nr, try ? GFP_NOIO : GFP_NOWAIT); | ||
1325 | if (!tio) | ||
1326 | break; | ||
1327 | |||
1328 | bio_list_add(blist, &tio->clone); | ||
1329 | } | ||
1330 | if (try) | ||
1331 | mutex_unlock(&ci->io->md->table_devices_lock); | ||
1332 | if (bio_nr == num_bios) | ||
1333 | return; | ||
1334 | |||
1335 | while ((bio = bio_list_pop(blist))) { | ||
1336 | tio = container_of(bio, struct dm_target_io, clone); | ||
1337 | free_tio(tio); | ||
1338 | } | ||
1339 | } | ||
1290 | } | 1340 | } |
1291 | 1341 | ||
1292 | static void __clone_and_map_simple_bio(struct clone_info *ci, | 1342 | static blk_qc_t __clone_and_map_simple_bio(struct clone_info *ci, |
1293 | struct dm_target *ti, | 1343 | struct dm_target_io *tio, unsigned *len) |
1294 | unsigned target_bio_nr, unsigned *len) | ||
1295 | { | 1344 | { |
1296 | struct dm_target_io *tio = alloc_tio(ci, ti, target_bio_nr); | ||
1297 | struct bio *clone = &tio->clone; | 1345 | struct bio *clone = &tio->clone; |
1298 | 1346 | ||
1299 | tio->len_ptr = len; | 1347 | tio->len_ptr = len; |
@@ -1302,16 +1350,22 @@ static void __clone_and_map_simple_bio(struct clone_info *ci, | |||
1302 | if (len) | 1350 | if (len) |
1303 | bio_setup_sector(clone, ci->sector, *len); | 1351 | bio_setup_sector(clone, ci->sector, *len); |
1304 | 1352 | ||
1305 | __map_bio(tio); | 1353 | return __map_bio(tio); |
1306 | } | 1354 | } |
1307 | 1355 | ||
1308 | static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, | 1356 | static void __send_duplicate_bios(struct clone_info *ci, struct dm_target *ti, |
1309 | unsigned num_bios, unsigned *len) | 1357 | unsigned num_bios, unsigned *len) |
1310 | { | 1358 | { |
1311 | unsigned target_bio_nr; | 1359 | struct bio_list blist = BIO_EMPTY_LIST; |
1360 | struct bio *bio; | ||
1361 | struct dm_target_io *tio; | ||
1362 | |||
1363 | alloc_multiple_bios(&blist, ci, ti, num_bios); | ||
1312 | 1364 | ||
1313 | for (target_bio_nr = 0; target_bio_nr < num_bios; target_bio_nr++) | 1365 | while ((bio = bio_list_pop(&blist))) { |
1314 | __clone_and_map_simple_bio(ci, ti, target_bio_nr, len); | 1366 | tio = container_of(bio, struct dm_target_io, clone); |
1367 | (void) __clone_and_map_simple_bio(ci, tio, len); | ||
1368 | } | ||
1315 | } | 1369 | } |
1316 | 1370 | ||
1317 | static int __send_empty_flush(struct clone_info *ci) | 1371 | static int __send_empty_flush(struct clone_info *ci) |
@@ -1327,32 +1381,22 @@ static int __send_empty_flush(struct clone_info *ci) | |||
1327 | } | 1381 | } |
1328 | 1382 | ||
1329 | static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, | 1383 | static int __clone_and_map_data_bio(struct clone_info *ci, struct dm_target *ti, |
1330 | sector_t sector, unsigned *len) | 1384 | sector_t sector, unsigned *len) |
1331 | { | 1385 | { |
1332 | struct bio *bio = ci->bio; | 1386 | struct bio *bio = ci->bio; |
1333 | struct dm_target_io *tio; | 1387 | struct dm_target_io *tio; |
1334 | unsigned target_bio_nr; | 1388 | int r; |
1335 | unsigned num_target_bios = 1; | ||
1336 | int r = 0; | ||
1337 | 1389 | ||
1338 | /* | 1390 | tio = alloc_tio(ci, ti, 0, GFP_NOIO); |
1339 | * Does the target want to receive duplicate copies of the bio? | 1391 | tio->len_ptr = len; |
1340 | */ | 1392 | r = clone_bio(tio, bio, sector, *len); |
1341 | if (bio_data_dir(bio) == WRITE && ti->num_write_bios) | 1393 | if (r < 0) { |
1342 | num_target_bios = ti->num_write_bios(ti, bio); | 1394 | free_tio(tio); |
1343 | 1395 | return r; | |
1344 | for (target_bio_nr = 0; target_bio_nr < num_target_bios; target_bio_nr++) { | ||
1345 | tio = alloc_tio(ci, ti, target_bio_nr); | ||
1346 | tio->len_ptr = len; | ||
1347 | r = clone_bio(tio, bio, sector, *len); | ||
1348 | if (r < 0) { | ||
1349 | free_tio(tio); | ||
1350 | break; | ||
1351 | } | ||
1352 | __map_bio(tio); | ||
1353 | } | 1396 | } |
1397 | (void) __map_bio(tio); | ||
1354 | 1398 | ||
1355 | return r; | 1399 | return 0; |
1356 | } | 1400 | } |
1357 | 1401 | ||
1358 | typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); | 1402 | typedef unsigned (*get_num_bios_fn)(struct dm_target *ti); |
@@ -1379,56 +1423,50 @@ static bool is_split_required_for_discard(struct dm_target *ti) | |||
1379 | return ti->split_discard_bios; | 1423 | return ti->split_discard_bios; |
1380 | } | 1424 | } |
1381 | 1425 | ||
1382 | static int __send_changing_extent_only(struct clone_info *ci, | 1426 | static int __send_changing_extent_only(struct clone_info *ci, struct dm_target *ti, |
1383 | get_num_bios_fn get_num_bios, | 1427 | get_num_bios_fn get_num_bios, |
1384 | is_split_required_fn is_split_required) | 1428 | is_split_required_fn is_split_required) |
1385 | { | 1429 | { |
1386 | struct dm_target *ti; | ||
1387 | unsigned len; | 1430 | unsigned len; |
1388 | unsigned num_bios; | 1431 | unsigned num_bios; |
1389 | 1432 | ||
1390 | do { | 1433 | /* |
1391 | ti = dm_table_find_target(ci->map, ci->sector); | 1434 | * Even though the device advertised support for this type of |
1392 | if (!dm_target_is_valid(ti)) | 1435 | * request, that does not mean every target supports it, and |
1393 | return -EIO; | 1436 | * reconfiguration might also have changed that since the |
1394 | 1437 | * check was performed. | |
1395 | /* | 1438 | */ |
1396 | * Even though the device advertised support for this type of | 1439 | num_bios = get_num_bios ? get_num_bios(ti) : 0; |
1397 | * request, that does not mean every target supports it, and | 1440 | if (!num_bios) |
1398 | * reconfiguration might also have changed that since the | 1441 | return -EOPNOTSUPP; |
1399 | * check was performed. | ||
1400 | */ | ||
1401 | num_bios = get_num_bios ? get_num_bios(ti) : 0; | ||
1402 | if (!num_bios) | ||
1403 | return -EOPNOTSUPP; | ||
1404 | 1442 | ||
1405 | if (is_split_required && !is_split_required(ti)) | 1443 | if (is_split_required && !is_split_required(ti)) |
1406 | len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); | 1444 | len = min((sector_t)ci->sector_count, max_io_len_target_boundary(ci->sector, ti)); |
1407 | else | 1445 | else |
1408 | len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti)); | 1446 | len = min((sector_t)ci->sector_count, max_io_len(ci->sector, ti)); |
1409 | 1447 | ||
1410 | __send_duplicate_bios(ci, ti, num_bios, &len); | 1448 | __send_duplicate_bios(ci, ti, num_bios, &len); |
1411 | 1449 | ||
1412 | ci->sector += len; | 1450 | ci->sector += len; |
1413 | } while (ci->sector_count -= len); | 1451 | ci->sector_count -= len; |
1414 | 1452 | ||
1415 | return 0; | 1453 | return 0; |
1416 | } | 1454 | } |
1417 | 1455 | ||
1418 | static int __send_discard(struct clone_info *ci) | 1456 | static int __send_discard(struct clone_info *ci, struct dm_target *ti) |
1419 | { | 1457 | { |
1420 | return __send_changing_extent_only(ci, get_num_discard_bios, | 1458 | return __send_changing_extent_only(ci, ti, get_num_discard_bios, |
1421 | is_split_required_for_discard); | 1459 | is_split_required_for_discard); |
1422 | } | 1460 | } |
1423 | 1461 | ||
1424 | static int __send_write_same(struct clone_info *ci) | 1462 | static int __send_write_same(struct clone_info *ci, struct dm_target *ti) |
1425 | { | 1463 | { |
1426 | return __send_changing_extent_only(ci, get_num_write_same_bios, NULL); | 1464 | return __send_changing_extent_only(ci, ti, get_num_write_same_bios, NULL); |
1427 | } | 1465 | } |
1428 | 1466 | ||
1429 | static int __send_write_zeroes(struct clone_info *ci) | 1467 | static int __send_write_zeroes(struct clone_info *ci, struct dm_target *ti) |
1430 | { | 1468 | { |
1431 | return __send_changing_extent_only(ci, get_num_write_zeroes_bios, NULL); | 1469 | return __send_changing_extent_only(ci, ti, get_num_write_zeroes_bios, NULL); |
1432 | } | 1470 | } |
1433 | 1471 | ||
1434 | /* | 1472 | /* |
@@ -1441,17 +1479,17 @@ static int __split_and_process_non_flush(struct clone_info *ci) | |||
1441 | unsigned len; | 1479 | unsigned len; |
1442 | int r; | 1480 | int r; |
1443 | 1481 | ||
1444 | if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) | ||
1445 | return __send_discard(ci); | ||
1446 | else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) | ||
1447 | return __send_write_same(ci); | ||
1448 | else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES)) | ||
1449 | return __send_write_zeroes(ci); | ||
1450 | |||
1451 | ti = dm_table_find_target(ci->map, ci->sector); | 1482 | ti = dm_table_find_target(ci->map, ci->sector); |
1452 | if (!dm_target_is_valid(ti)) | 1483 | if (!dm_target_is_valid(ti)) |
1453 | return -EIO; | 1484 | return -EIO; |
1454 | 1485 | ||
1486 | if (unlikely(bio_op(bio) == REQ_OP_DISCARD)) | ||
1487 | return __send_discard(ci, ti); | ||
1488 | else if (unlikely(bio_op(bio) == REQ_OP_WRITE_SAME)) | ||
1489 | return __send_write_same(ci, ti); | ||
1490 | else if (unlikely(bio_op(bio) == REQ_OP_WRITE_ZEROES)) | ||
1491 | return __send_write_zeroes(ci, ti); | ||
1492 | |||
1455 | if (bio_op(bio) == REQ_OP_ZONE_REPORT) | 1493 | if (bio_op(bio) == REQ_OP_ZONE_REPORT) |
1456 | len = ci->sector_count; | 1494 | len = ci->sector_count; |
1457 | else | 1495 | else |
@@ -1468,34 +1506,33 @@ static int __split_and_process_non_flush(struct clone_info *ci) | |||
1468 | return 0; | 1506 | return 0; |
1469 | } | 1507 | } |
1470 | 1508 | ||
1509 | static void init_clone_info(struct clone_info *ci, struct mapped_device *md, | ||
1510 | struct dm_table *map, struct bio *bio) | ||
1511 | { | ||
1512 | ci->map = map; | ||
1513 | ci->io = alloc_io(md, bio); | ||
1514 | ci->sector = bio->bi_iter.bi_sector; | ||
1515 | } | ||
1516 | |||
1471 | /* | 1517 | /* |
1472 | * Entry point to split a bio into clones and submit them to the targets. | 1518 | * Entry point to split a bio into clones and submit them to the targets. |
1473 | */ | 1519 | */ |
1474 | static void __split_and_process_bio(struct mapped_device *md, | 1520 | static blk_qc_t __split_and_process_bio(struct mapped_device *md, |
1475 | struct dm_table *map, struct bio *bio) | 1521 | struct dm_table *map, struct bio *bio) |
1476 | { | 1522 | { |
1477 | struct clone_info ci; | 1523 | struct clone_info ci; |
1524 | blk_qc_t ret = BLK_QC_T_NONE; | ||
1478 | int error = 0; | 1525 | int error = 0; |
1479 | 1526 | ||
1480 | if (unlikely(!map)) { | 1527 | if (unlikely(!map)) { |
1481 | bio_io_error(bio); | 1528 | bio_io_error(bio); |
1482 | return; | 1529 | return ret; |
1483 | } | 1530 | } |
1484 | 1531 | ||
1485 | ci.map = map; | 1532 | init_clone_info(&ci, md, map, bio); |
1486 | ci.md = md; | ||
1487 | ci.io = alloc_io(md); | ||
1488 | ci.io->status = 0; | ||
1489 | atomic_set(&ci.io->io_count, 1); | ||
1490 | ci.io->bio = bio; | ||
1491 | ci.io->md = md; | ||
1492 | spin_lock_init(&ci.io->endio_lock); | ||
1493 | ci.sector = bio->bi_iter.bi_sector; | ||
1494 | |||
1495 | start_io_acct(ci.io); | ||
1496 | 1533 | ||
1497 | if (bio->bi_opf & REQ_PREFLUSH) { | 1534 | if (bio->bi_opf & REQ_PREFLUSH) { |
1498 | ci.bio = &ci.md->flush_bio; | 1535 | ci.bio = &ci.io->md->flush_bio; |
1499 | ci.sector_count = 0; | 1536 | ci.sector_count = 0; |
1500 | error = __send_empty_flush(&ci); | 1537 | error = __send_empty_flush(&ci); |
1501 | /* dec_pending submits any data associated with flush */ | 1538 | /* dec_pending submits any data associated with flush */ |
@@ -1506,32 +1543,95 @@ static void __split_and_process_bio(struct mapped_device *md, | |||
1506 | } else { | 1543 | } else { |
1507 | ci.bio = bio; | 1544 | ci.bio = bio; |
1508 | ci.sector_count = bio_sectors(bio); | 1545 | ci.sector_count = bio_sectors(bio); |
1509 | while (ci.sector_count && !error) | 1546 | while (ci.sector_count && !error) { |
1510 | error = __split_and_process_non_flush(&ci); | 1547 | error = __split_and_process_non_flush(&ci); |
1548 | if (current->bio_list && ci.sector_count && !error) { | ||
1549 | /* | ||
1550 | * Remainder must be passed to generic_make_request() | ||
1551 | * so that it gets handled *after* bios already submitted | ||
1552 | * have been completely processed. | ||
1553 | * We take a clone of the original to store in | ||
1554 | * ci.io->orig_bio to be used by end_io_acct() and | ||
1555 | * for dec_pending to use for completion handling. | ||
1556 | * As this path is not used for REQ_OP_ZONE_REPORT, | ||
1557 | * the usage of io->orig_bio in dm_remap_zone_report() | ||
1558 | * won't be affected by this reassignment. | ||
1559 | */ | ||
1560 | struct bio *b = bio_clone_bioset(bio, GFP_NOIO, | ||
1561 | md->queue->bio_split); | ||
1562 | ci.io->orig_bio = b; | ||
1563 | bio_advance(bio, (bio_sectors(bio) - ci.sector_count) << 9); | ||
1564 | bio_chain(b, bio); | ||
1565 | ret = generic_make_request(bio); | ||
1566 | break; | ||
1567 | } | ||
1568 | } | ||
1511 | } | 1569 | } |
1512 | 1570 | ||
1513 | /* drop the extra reference count */ | 1571 | /* drop the extra reference count */ |
1514 | dec_pending(ci.io, errno_to_blk_status(error)); | 1572 | dec_pending(ci.io, errno_to_blk_status(error)); |
1573 | return ret; | ||
1515 | } | 1574 | } |
1516 | /*----------------------------------------------------------------- | ||
1517 | * CRUD END | ||
1518 | *---------------------------------------------------------------*/ | ||
1519 | 1575 | ||
1520 | /* | 1576 | /* |
1521 | * The request function that just remaps the bio built up by | 1577 | * Optimized variant of __split_and_process_bio that leverages the |
1522 | * dm_merge_bvec. | 1578 | * fact that targets that use it do _not_ have a need to split bios. |
1523 | */ | 1579 | */ |
1524 | static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) | 1580 | static blk_qc_t __process_bio(struct mapped_device *md, |
1581 | struct dm_table *map, struct bio *bio) | ||
1582 | { | ||
1583 | struct clone_info ci; | ||
1584 | blk_qc_t ret = BLK_QC_T_NONE; | ||
1585 | int error = 0; | ||
1586 | |||
1587 | if (unlikely(!map)) { | ||
1588 | bio_io_error(bio); | ||
1589 | return ret; | ||
1590 | } | ||
1591 | |||
1592 | init_clone_info(&ci, md, map, bio); | ||
1593 | |||
1594 | if (bio->bi_opf & REQ_PREFLUSH) { | ||
1595 | ci.bio = &ci.io->md->flush_bio; | ||
1596 | ci.sector_count = 0; | ||
1597 | error = __send_empty_flush(&ci); | ||
1598 | /* dec_pending submits any data associated with flush */ | ||
1599 | } else { | ||
1600 | struct dm_target *ti = md->immutable_target; | ||
1601 | struct dm_target_io *tio; | ||
1602 | |||
1603 | /* | ||
1604 | * Defend against IO still getting in during teardown | ||
1605 | * - as was seen for a time with nvme-fcloop | ||
1606 | */ | ||
1607 | if (unlikely(WARN_ON_ONCE(!ti || !dm_target_is_valid(ti)))) { | ||
1608 | error = -EIO; | ||
1609 | goto out; | ||
1610 | } | ||
1611 | |||
1612 | tio = alloc_tio(&ci, ti, 0, GFP_NOIO); | ||
1613 | ci.bio = bio; | ||
1614 | ci.sector_count = bio_sectors(bio); | ||
1615 | ret = __clone_and_map_simple_bio(&ci, tio, NULL); | ||
1616 | } | ||
1617 | out: | ||
1618 | /* drop the extra reference count */ | ||
1619 | dec_pending(ci.io, errno_to_blk_status(error)); | ||
1620 | return ret; | ||
1621 | } | ||
1622 | |||
1623 | typedef blk_qc_t (process_bio_fn)(struct mapped_device *, struct dm_table *, struct bio *); | ||
1624 | |||
1625 | static blk_qc_t __dm_make_request(struct request_queue *q, struct bio *bio, | ||
1626 | process_bio_fn process_bio) | ||
1525 | { | 1627 | { |
1526 | int rw = bio_data_dir(bio); | ||
1527 | struct mapped_device *md = q->queuedata; | 1628 | struct mapped_device *md = q->queuedata; |
1629 | blk_qc_t ret = BLK_QC_T_NONE; | ||
1528 | int srcu_idx; | 1630 | int srcu_idx; |
1529 | struct dm_table *map; | 1631 | struct dm_table *map; |
1530 | 1632 | ||
1531 | map = dm_get_live_table(md, &srcu_idx); | 1633 | map = dm_get_live_table(md, &srcu_idx); |
1532 | 1634 | ||
1533 | generic_start_io_acct(q, rw, bio_sectors(bio), &dm_disk(md)->part0); | ||
1534 | |||
1535 | /* if we're suspended, we have to queue this io for later */ | 1635 | /* if we're suspended, we have to queue this io for later */ |
1536 | if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { | 1636 | if (unlikely(test_bit(DMF_BLOCK_IO_FOR_SUSPEND, &md->flags))) { |
1537 | dm_put_live_table(md, srcu_idx); | 1637 | dm_put_live_table(md, srcu_idx); |
@@ -1540,12 +1640,27 @@ static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) | |||
1540 | queue_io(md, bio); | 1640 | queue_io(md, bio); |
1541 | else | 1641 | else |
1542 | bio_io_error(bio); | 1642 | bio_io_error(bio); |
1543 | return BLK_QC_T_NONE; | 1643 | return ret; |
1544 | } | 1644 | } |
1545 | 1645 | ||
1546 | __split_and_process_bio(md, map, bio); | 1646 | ret = process_bio(md, map, bio); |
1647 | |||
1547 | dm_put_live_table(md, srcu_idx); | 1648 | dm_put_live_table(md, srcu_idx); |
1548 | return BLK_QC_T_NONE; | 1649 | return ret; |
1650 | } | ||
1651 | |||
1652 | /* | ||
1653 | * The request function that remaps the bio to one target and | ||
1654 | * splits off any remainder. | ||
1655 | */ | ||
1656 | static blk_qc_t dm_make_request(struct request_queue *q, struct bio *bio) | ||
1657 | { | ||
1658 | return __dm_make_request(q, bio, __split_and_process_bio); | ||
1659 | } | ||
1660 | |||
1661 | static blk_qc_t dm_make_request_nvme(struct request_queue *q, struct bio *bio) | ||
1662 | { | ||
1663 | return __dm_make_request(q, bio, __process_bio); | ||
1549 | } | 1664 | } |
1550 | 1665 | ||
1551 | static int dm_any_congested(void *congested_data, int bdi_bits) | 1666 | static int dm_any_congested(void *congested_data, int bdi_bits) |
@@ -1626,20 +1741,9 @@ static const struct dax_operations dm_dax_ops; | |||
1626 | 1741 | ||
1627 | static void dm_wq_work(struct work_struct *work); | 1742 | static void dm_wq_work(struct work_struct *work); |
1628 | 1743 | ||
1629 | void dm_init_md_queue(struct mapped_device *md) | 1744 | static void dm_init_normal_md_queue(struct mapped_device *md) |
1630 | { | ||
1631 | /* | ||
1632 | * Initialize data that will only be used by a non-blk-mq DM queue | ||
1633 | * - must do so here (in alloc_dev callchain) before queue is used | ||
1634 | */ | ||
1635 | md->queue->queuedata = md; | ||
1636 | md->queue->backing_dev_info->congested_data = md; | ||
1637 | } | ||
1638 | |||
1639 | void dm_init_normal_md_queue(struct mapped_device *md) | ||
1640 | { | 1745 | { |
1641 | md->use_blk_mq = false; | 1746 | md->use_blk_mq = false; |
1642 | dm_init_md_queue(md); | ||
1643 | 1747 | ||
1644 | /* | 1748 | /* |
1645 | * Initialize aspects of queue that aren't relevant for blk-mq | 1749 | * Initialize aspects of queue that aren't relevant for blk-mq |
@@ -1653,9 +1757,10 @@ static void cleanup_mapped_device(struct mapped_device *md) | |||
1653 | destroy_workqueue(md->wq); | 1757 | destroy_workqueue(md->wq); |
1654 | if (md->kworker_task) | 1758 | if (md->kworker_task) |
1655 | kthread_stop(md->kworker_task); | 1759 | kthread_stop(md->kworker_task); |
1656 | mempool_destroy(md->io_pool); | ||
1657 | if (md->bs) | 1760 | if (md->bs) |
1658 | bioset_free(md->bs); | 1761 | bioset_free(md->bs); |
1762 | if (md->io_bs) | ||
1763 | bioset_free(md->io_bs); | ||
1659 | 1764 | ||
1660 | if (md->dax_dev) { | 1765 | if (md->dax_dev) { |
1661 | kill_dax(md->dax_dev); | 1766 | kill_dax(md->dax_dev); |
@@ -1681,6 +1786,10 @@ static void cleanup_mapped_device(struct mapped_device *md) | |||
1681 | md->bdev = NULL; | 1786 | md->bdev = NULL; |
1682 | } | 1787 | } |
1683 | 1788 | ||
1789 | mutex_destroy(&md->suspend_lock); | ||
1790 | mutex_destroy(&md->type_lock); | ||
1791 | mutex_destroy(&md->table_devices_lock); | ||
1792 | |||
1684 | dm_mq_cleanup_mapped_device(md); | 1793 | dm_mq_cleanup_mapped_device(md); |
1685 | } | 1794 | } |
1686 | 1795 | ||
@@ -1734,10 +1843,10 @@ static struct mapped_device *alloc_dev(int minor) | |||
1734 | md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); | 1843 | md->queue = blk_alloc_queue_node(GFP_KERNEL, numa_node_id); |
1735 | if (!md->queue) | 1844 | if (!md->queue) |
1736 | goto bad; | 1845 | goto bad; |
1846 | md->queue->queuedata = md; | ||
1847 | md->queue->backing_dev_info->congested_data = md; | ||
1737 | 1848 | ||
1738 | dm_init_md_queue(md); | 1849 | md->disk = alloc_disk_node(1, md->numa_node_id); |
1739 | |||
1740 | md->disk = alloc_disk_node(1, numa_node_id); | ||
1741 | if (!md->disk) | 1850 | if (!md->disk) |
1742 | goto bad; | 1851 | goto bad; |
1743 | 1852 | ||
@@ -1820,17 +1929,22 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t) | |||
1820 | { | 1929 | { |
1821 | struct dm_md_mempools *p = dm_table_get_md_mempools(t); | 1930 | struct dm_md_mempools *p = dm_table_get_md_mempools(t); |
1822 | 1931 | ||
1823 | if (md->bs) { | 1932 | if (dm_table_bio_based(t)) { |
1824 | /* The md already has necessary mempools. */ | 1933 | /* |
1825 | if (dm_table_bio_based(t)) { | 1934 | * The md may already have mempools that need changing. |
1826 | /* | 1935 | * If so, reload bioset because front_pad may have changed |
1827 | * Reload bioset because front_pad may have changed | 1936 | * because a different table was loaded. |
1828 | * because a different table was loaded. | 1937 | */ |
1829 | */ | 1938 | if (md->bs) { |
1830 | bioset_free(md->bs); | 1939 | bioset_free(md->bs); |
1831 | md->bs = p->bs; | 1940 | md->bs = NULL; |
1832 | p->bs = NULL; | 1941 | } |
1942 | if (md->io_bs) { | ||
1943 | bioset_free(md->io_bs); | ||
1944 | md->io_bs = NULL; | ||
1833 | } | 1945 | } |
1946 | |||
1947 | } else if (md->bs) { | ||
1834 | /* | 1948 | /* |
1835 | * There's no need to reload with request-based dm | 1949 | * There's no need to reload with request-based dm |
1836 | * because the size of front_pad doesn't change. | 1950 | * because the size of front_pad doesn't change. |
@@ -1842,13 +1956,12 @@ static void __bind_mempools(struct mapped_device *md, struct dm_table *t) | |||
1842 | goto out; | 1956 | goto out; |
1843 | } | 1957 | } |
1844 | 1958 | ||
1845 | BUG_ON(!p || md->io_pool || md->bs); | 1959 | BUG_ON(!p || md->bs || md->io_bs); |
1846 | 1960 | ||
1847 | md->io_pool = p->io_pool; | ||
1848 | p->io_pool = NULL; | ||
1849 | md->bs = p->bs; | 1961 | md->bs = p->bs; |
1850 | p->bs = NULL; | 1962 | p->bs = NULL; |
1851 | 1963 | md->io_bs = p->io_bs; | |
1964 | p->io_bs = NULL; | ||
1852 | out: | 1965 | out: |
1853 | /* mempool bind completed, no longer need any mempools in the table */ | 1966 | /* mempool bind completed, no longer need any mempools in the table */ |
1854 | dm_table_free_md_mempools(t); | 1967 | dm_table_free_md_mempools(t); |
@@ -1894,6 +2007,7 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, | |||
1894 | { | 2007 | { |
1895 | struct dm_table *old_map; | 2008 | struct dm_table *old_map; |
1896 | struct request_queue *q = md->queue; | 2009 | struct request_queue *q = md->queue; |
2010 | bool request_based = dm_table_request_based(t); | ||
1897 | sector_t size; | 2011 | sector_t size; |
1898 | 2012 | ||
1899 | lockdep_assert_held(&md->suspend_lock); | 2013 | lockdep_assert_held(&md->suspend_lock); |
@@ -1917,12 +2031,15 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t, | |||
1917 | * This must be done before setting the queue restrictions, | 2031 | * This must be done before setting the queue restrictions, |
1918 | * because request-based dm may be run just after the setting. | 2032 | * because request-based dm may be run just after the setting. |
1919 | */ | 2033 | */ |
1920 | if (dm_table_request_based(t)) { | 2034 | if (request_based) |
1921 | dm_stop_queue(q); | 2035 | dm_stop_queue(q); |
2036 | |||
2037 | if (request_based || md->type == DM_TYPE_NVME_BIO_BASED) { | ||
1922 | /* | 2038 | /* |
1923 | * Leverage the fact that request-based DM targets are | 2039 | * Leverage the fact that request-based DM targets and |
1924 | * immutable singletons and establish md->immutable_target | 2040 | * NVMe bio based targets are immutable singletons |
1925 | * - used to optimize both dm_request_fn and dm_mq_queue_rq | 2041 | * - used to optimize both dm_request_fn and dm_mq_queue_rq; |
2042 | * and __process_bio. | ||
1926 | */ | 2043 | */ |
1927 | md->immutable_target = dm_table_get_immutable_target(t); | 2044 | md->immutable_target = dm_table_get_immutable_target(t); |
1928 | } | 2045 | } |
@@ -1962,13 +2079,18 @@ static struct dm_table *__unbind(struct mapped_device *md) | |||
1962 | */ | 2079 | */ |
1963 | int dm_create(int minor, struct mapped_device **result) | 2080 | int dm_create(int minor, struct mapped_device **result) |
1964 | { | 2081 | { |
2082 | int r; | ||
1965 | struct mapped_device *md; | 2083 | struct mapped_device *md; |
1966 | 2084 | ||
1967 | md = alloc_dev(minor); | 2085 | md = alloc_dev(minor); |
1968 | if (!md) | 2086 | if (!md) |
1969 | return -ENXIO; | 2087 | return -ENXIO; |
1970 | 2088 | ||
1971 | dm_sysfs_init(md); | 2089 | r = dm_sysfs_init(md); |
2090 | if (r) { | ||
2091 | free_dev(md); | ||
2092 | return r; | ||
2093 | } | ||
1972 | 2094 | ||
1973 | *result = md; | 2095 | *result = md; |
1974 | return 0; | 2096 | return 0; |
@@ -2026,6 +2148,7 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) | |||
2026 | 2148 | ||
2027 | switch (type) { | 2149 | switch (type) { |
2028 | case DM_TYPE_REQUEST_BASED: | 2150 | case DM_TYPE_REQUEST_BASED: |
2151 | dm_init_normal_md_queue(md); | ||
2029 | r = dm_old_init_request_queue(md, t); | 2152 | r = dm_old_init_request_queue(md, t); |
2030 | if (r) { | 2153 | if (r) { |
2031 | DMERR("Cannot initialize queue for request-based mapped device"); | 2154 | DMERR("Cannot initialize queue for request-based mapped device"); |
@@ -2043,15 +2166,10 @@ int dm_setup_md_queue(struct mapped_device *md, struct dm_table *t) | |||
2043 | case DM_TYPE_DAX_BIO_BASED: | 2166 | case DM_TYPE_DAX_BIO_BASED: |
2044 | dm_init_normal_md_queue(md); | 2167 | dm_init_normal_md_queue(md); |
2045 | blk_queue_make_request(md->queue, dm_make_request); | 2168 | blk_queue_make_request(md->queue, dm_make_request); |
2046 | /* | 2169 | break; |
2047 | * DM handles splitting bios as needed. Free the bio_split bioset | 2170 | case DM_TYPE_NVME_BIO_BASED: |
2048 | * since it won't be used (saves 1 process per bio-based DM device). | 2171 | dm_init_normal_md_queue(md); |
2049 | */ | 2172 | blk_queue_make_request(md->queue, dm_make_request_nvme); |
2050 | bioset_free(md->queue->bio_split); | ||
2051 | md->queue->bio_split = NULL; | ||
2052 | |||
2053 | if (type == DM_TYPE_DAX_BIO_BASED) | ||
2054 | queue_flag_set_unlocked(QUEUE_FLAG_DAX, md->queue); | ||
2055 | break; | 2173 | break; |
2056 | case DM_TYPE_NONE: | 2174 | case DM_TYPE_NONE: |
2057 | WARN_ON_ONCE(true); | 2175 | WARN_ON_ONCE(true); |
@@ -2130,7 +2248,6 @@ EXPORT_SYMBOL_GPL(dm_device_name); | |||
2130 | 2248 | ||
2131 | static void __dm_destroy(struct mapped_device *md, bool wait) | 2249 | static void __dm_destroy(struct mapped_device *md, bool wait) |
2132 | { | 2250 | { |
2133 | struct request_queue *q = dm_get_md_queue(md); | ||
2134 | struct dm_table *map; | 2251 | struct dm_table *map; |
2135 | int srcu_idx; | 2252 | int srcu_idx; |
2136 | 2253 | ||
@@ -2141,7 +2258,7 @@ static void __dm_destroy(struct mapped_device *md, bool wait) | |||
2141 | set_bit(DMF_FREEING, &md->flags); | 2258 | set_bit(DMF_FREEING, &md->flags); |
2142 | spin_unlock(&_minor_lock); | 2259 | spin_unlock(&_minor_lock); |
2143 | 2260 | ||
2144 | blk_set_queue_dying(q); | 2261 | blk_set_queue_dying(md->queue); |
2145 | 2262 | ||
2146 | if (dm_request_based(md) && md->kworker_task) | 2263 | if (dm_request_based(md) && md->kworker_task) |
2147 | kthread_flush_worker(&md->kworker); | 2264 | kthread_flush_worker(&md->kworker); |
@@ -2752,11 +2869,12 @@ int dm_noflush_suspending(struct dm_target *ti) | |||
2752 | EXPORT_SYMBOL_GPL(dm_noflush_suspending); | 2869 | EXPORT_SYMBOL_GPL(dm_noflush_suspending); |
2753 | 2870 | ||
2754 | struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, | 2871 | struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, |
2755 | unsigned integrity, unsigned per_io_data_size) | 2872 | unsigned integrity, unsigned per_io_data_size, |
2873 | unsigned min_pool_size) | ||
2756 | { | 2874 | { |
2757 | struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id); | 2875 | struct dm_md_mempools *pools = kzalloc_node(sizeof(*pools), GFP_KERNEL, md->numa_node_id); |
2758 | unsigned int pool_size = 0; | 2876 | unsigned int pool_size = 0; |
2759 | unsigned int front_pad; | 2877 | unsigned int front_pad, io_front_pad; |
2760 | 2878 | ||
2761 | if (!pools) | 2879 | if (!pools) |
2762 | return NULL; | 2880 | return NULL; |
@@ -2764,16 +2882,19 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu | |||
2764 | switch (type) { | 2882 | switch (type) { |
2765 | case DM_TYPE_BIO_BASED: | 2883 | case DM_TYPE_BIO_BASED: |
2766 | case DM_TYPE_DAX_BIO_BASED: | 2884 | case DM_TYPE_DAX_BIO_BASED: |
2767 | pool_size = dm_get_reserved_bio_based_ios(); | 2885 | case DM_TYPE_NVME_BIO_BASED: |
2886 | pool_size = max(dm_get_reserved_bio_based_ios(), min_pool_size); | ||
2768 | front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); | 2887 | front_pad = roundup(per_io_data_size, __alignof__(struct dm_target_io)) + offsetof(struct dm_target_io, clone); |
2769 | 2888 | io_front_pad = roundup(front_pad, __alignof__(struct dm_io)) + offsetof(struct dm_io, tio); | |
2770 | pools->io_pool = mempool_create_slab_pool(pool_size, _io_cache); | 2889 | pools->io_bs = bioset_create(pool_size, io_front_pad, 0); |
2771 | if (!pools->io_pool) | 2890 | if (!pools->io_bs) |
2891 | goto out; | ||
2892 | if (integrity && bioset_integrity_create(pools->io_bs, pool_size)) | ||
2772 | goto out; | 2893 | goto out; |
2773 | break; | 2894 | break; |
2774 | case DM_TYPE_REQUEST_BASED: | 2895 | case DM_TYPE_REQUEST_BASED: |
2775 | case DM_TYPE_MQ_REQUEST_BASED: | 2896 | case DM_TYPE_MQ_REQUEST_BASED: |
2776 | pool_size = dm_get_reserved_rq_based_ios(); | 2897 | pool_size = max(dm_get_reserved_rq_based_ios(), min_pool_size); |
2777 | front_pad = offsetof(struct dm_rq_clone_bio_info, clone); | 2898 | front_pad = offsetof(struct dm_rq_clone_bio_info, clone); |
2778 | /* per_io_data_size is used for blk-mq pdu at queue allocation */ | 2899 | /* per_io_data_size is used for blk-mq pdu at queue allocation */ |
2779 | break; | 2900 | break; |
@@ -2781,7 +2902,7 @@ struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_qu | |||
2781 | BUG(); | 2902 | BUG(); |
2782 | } | 2903 | } |
2783 | 2904 | ||
2784 | pools->bs = bioset_create(pool_size, front_pad, BIOSET_NEED_RESCUER); | 2905 | pools->bs = bioset_create(pool_size, front_pad, 0); |
2785 | if (!pools->bs) | 2906 | if (!pools->bs) |
2786 | goto out; | 2907 | goto out; |
2787 | 2908 | ||
@@ -2801,10 +2922,10 @@ void dm_free_md_mempools(struct dm_md_mempools *pools) | |||
2801 | if (!pools) | 2922 | if (!pools) |
2802 | return; | 2923 | return; |
2803 | 2924 | ||
2804 | mempool_destroy(pools->io_pool); | ||
2805 | |||
2806 | if (pools->bs) | 2925 | if (pools->bs) |
2807 | bioset_free(pools->bs); | 2926 | bioset_free(pools->bs); |
2927 | if (pools->io_bs) | ||
2928 | bioset_free(pools->io_bs); | ||
2808 | 2929 | ||
2809 | kfree(pools); | 2930 | kfree(pools); |
2810 | } | 2931 | } |
diff --git a/drivers/md/dm.h b/drivers/md/dm.h index 36399bb875dd..114a81b27c37 100644 --- a/drivers/md/dm.h +++ b/drivers/md/dm.h | |||
@@ -49,7 +49,6 @@ struct dm_md_mempools; | |||
49 | /*----------------------------------------------------------------- | 49 | /*----------------------------------------------------------------- |
50 | * Internal table functions. | 50 | * Internal table functions. |
51 | *---------------------------------------------------------------*/ | 51 | *---------------------------------------------------------------*/ |
52 | void dm_table_destroy(struct dm_table *t); | ||
53 | void dm_table_event_callback(struct dm_table *t, | 52 | void dm_table_event_callback(struct dm_table *t, |
54 | void (*fn)(void *), void *context); | 53 | void (*fn)(void *), void *context); |
55 | struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); | 54 | struct dm_target *dm_table_get_target(struct dm_table *t, unsigned int index); |
@@ -206,7 +205,8 @@ void dm_kcopyd_exit(void); | |||
206 | * Mempool operations | 205 | * Mempool operations |
207 | */ | 206 | */ |
208 | struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, | 207 | struct dm_md_mempools *dm_alloc_md_mempools(struct mapped_device *md, enum dm_queue_mode type, |
209 | unsigned integrity, unsigned per_bio_data_size); | 208 | unsigned integrity, unsigned per_bio_data_size, |
209 | unsigned min_pool_size); | ||
210 | void dm_free_md_mempools(struct dm_md_mempools *pools); | 210 | void dm_free_md_mempools(struct dm_md_mempools *pools); |
211 | 211 | ||
212 | /* | 212 | /* |
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h index a5538433c927..da83f64952e7 100644 --- a/include/linux/device-mapper.h +++ b/include/linux/device-mapper.h | |||
@@ -28,6 +28,7 @@ enum dm_queue_mode { | |||
28 | DM_TYPE_REQUEST_BASED = 2, | 28 | DM_TYPE_REQUEST_BASED = 2, |
29 | DM_TYPE_MQ_REQUEST_BASED = 3, | 29 | DM_TYPE_MQ_REQUEST_BASED = 3, |
30 | DM_TYPE_DAX_BIO_BASED = 4, | 30 | DM_TYPE_DAX_BIO_BASED = 4, |
31 | DM_TYPE_NVME_BIO_BASED = 5, | ||
31 | }; | 32 | }; |
32 | 33 | ||
33 | typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; | 34 | typedef enum { STATUSTYPE_INFO, STATUSTYPE_TABLE } status_type_t; |
@@ -221,14 +222,6 @@ struct target_type { | |||
221 | #define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD) | 222 | #define dm_target_is_wildcard(type) ((type)->features & DM_TARGET_WILDCARD) |
222 | 223 | ||
223 | /* | 224 | /* |
224 | * Some targets need to be sent the same WRITE bio severals times so | ||
225 | * that they can send copies of it to different devices. This function | ||
226 | * examines any supplied bio and returns the number of copies of it the | ||
227 | * target requires. | ||
228 | */ | ||
229 | typedef unsigned (*dm_num_write_bios_fn) (struct dm_target *ti, struct bio *bio); | ||
230 | |||
231 | /* | ||
232 | * A target implements own bio data integrity. | 225 | * A target implements own bio data integrity. |
233 | */ | 226 | */ |
234 | #define DM_TARGET_INTEGRITY 0x00000010 | 227 | #define DM_TARGET_INTEGRITY 0x00000010 |
@@ -291,13 +284,6 @@ struct dm_target { | |||
291 | */ | 284 | */ |
292 | unsigned per_io_data_size; | 285 | unsigned per_io_data_size; |
293 | 286 | ||
294 | /* | ||
295 | * If defined, this function is called to find out how many | ||
296 | * duplicate bios should be sent to the target when writing | ||
297 | * data. | ||
298 | */ | ||
299 | dm_num_write_bios_fn num_write_bios; | ||
300 | |||
301 | /* target specific data */ | 287 | /* target specific data */ |
302 | void *private; | 288 | void *private; |
303 | 289 | ||
@@ -329,35 +315,9 @@ struct dm_target_callbacks { | |||
329 | int (*congested_fn) (struct dm_target_callbacks *, int); | 315 | int (*congested_fn) (struct dm_target_callbacks *, int); |
330 | }; | 316 | }; |
331 | 317 | ||
332 | /* | 318 | void *dm_per_bio_data(struct bio *bio, size_t data_size); |
333 | * For bio-based dm. | 319 | struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size); |
334 | * One of these is allocated for each bio. | 320 | unsigned dm_bio_get_target_bio_nr(const struct bio *bio); |
335 | * This structure shouldn't be touched directly by target drivers. | ||
336 | * It is here so that we can inline dm_per_bio_data and | ||
337 | * dm_bio_from_per_bio_data | ||
338 | */ | ||
339 | struct dm_target_io { | ||
340 | struct dm_io *io; | ||
341 | struct dm_target *ti; | ||
342 | unsigned target_bio_nr; | ||
343 | unsigned *len_ptr; | ||
344 | struct bio clone; | ||
345 | }; | ||
346 | |||
347 | static inline void *dm_per_bio_data(struct bio *bio, size_t data_size) | ||
348 | { | ||
349 | return (char *)bio - offsetof(struct dm_target_io, clone) - data_size; | ||
350 | } | ||
351 | |||
352 | static inline struct bio *dm_bio_from_per_bio_data(void *data, size_t data_size) | ||
353 | { | ||
354 | return (struct bio *)((char *)data + data_size + offsetof(struct dm_target_io, clone)); | ||
355 | } | ||
356 | |||
357 | static inline unsigned dm_bio_get_target_bio_nr(const struct bio *bio) | ||
358 | { | ||
359 | return container_of(bio, struct dm_target_io, clone)->target_bio_nr; | ||
360 | } | ||
361 | 321 | ||
362 | int dm_register_target(struct target_type *t); | 322 | int dm_register_target(struct target_type *t); |
363 | void dm_unregister_target(struct target_type *t); | 323 | void dm_unregister_target(struct target_type *t); |
@@ -500,6 +460,11 @@ void dm_table_set_type(struct dm_table *t, enum dm_queue_mode type); | |||
500 | int dm_table_complete(struct dm_table *t); | 460 | int dm_table_complete(struct dm_table *t); |
501 | 461 | ||
502 | /* | 462 | /* |
463 | * Destroy the table when finished. | ||
464 | */ | ||
465 | void dm_table_destroy(struct dm_table *t); | ||
466 | |||
467 | /* | ||
503 | * Target may require that it is never sent I/O larger than len. | 468 | * Target may require that it is never sent I/O larger than len. |
504 | */ | 469 | */ |
505 | int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len); | 470 | int __must_check dm_set_target_max_io_len(struct dm_target *ti, sector_t len); |
@@ -585,6 +550,7 @@ do { \ | |||
585 | #define DM_ENDIO_DONE 0 | 550 | #define DM_ENDIO_DONE 0 |
586 | #define DM_ENDIO_INCOMPLETE 1 | 551 | #define DM_ENDIO_INCOMPLETE 1 |
587 | #define DM_ENDIO_REQUEUE 2 | 552 | #define DM_ENDIO_REQUEUE 2 |
553 | #define DM_ENDIO_DELAY_REQUEUE 3 | ||
588 | 554 | ||
589 | /* | 555 | /* |
590 | * Definitions of return values from target map function. | 556 | * Definitions of return values from target map function. |
@@ -592,7 +558,7 @@ do { \ | |||
592 | #define DM_MAPIO_SUBMITTED 0 | 558 | #define DM_MAPIO_SUBMITTED 0 |
593 | #define DM_MAPIO_REMAPPED 1 | 559 | #define DM_MAPIO_REMAPPED 1 |
594 | #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE | 560 | #define DM_MAPIO_REQUEUE DM_ENDIO_REQUEUE |
595 | #define DM_MAPIO_DELAY_REQUEUE 3 | 561 | #define DM_MAPIO_DELAY_REQUEUE DM_ENDIO_DELAY_REQUEUE |
596 | #define DM_MAPIO_KILL 4 | 562 | #define DM_MAPIO_KILL 4 |
597 | 563 | ||
598 | #define dm_sector_div64(x, y)( \ | 564 | #define dm_sector_div64(x, y)( \ |