aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2011-11-02 20:02:37 -0400
committerLinus Torvalds <torvalds@linux-foundation.org>2011-11-02 20:02:37 -0400
commit43672a0784707d795556b1f93925da8b8e797d03 (patch)
tree5c92aabd211281300f89fc2e69e9ee7e58bcc449
parent2380078cdb7e6d520e33dcf834e0be979d542e48 (diff)
parent2e727c3ca1beff05f27b6207a795790f222bf8d8 (diff)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm
* git://git.kernel.org/pub/scm/linux/kernel/git/steve/linux-dm: dm: raid fix device status indicator when array initializing dm log userspace: add log device dependency dm log userspace: fix comment hyphens dm: add thin provisioning target dm: add persistent data library dm: add bufio dm: export dm get md dm table: add immutable feature dm table: add always writeable feature dm table: add singleton feature dm kcopyd: add dm_kcopyd_zero to zero an area dm: remove superfluous smp_mb dm: use local printk ratelimit dm table: propagate non rotational flag
-rw-r--r--Documentation/device-mapper/dm-log.txt2
-rw-r--r--Documentation/device-mapper/persistent-data.txt84
-rw-r--r--Documentation/device-mapper/thin-provisioning.txt285
-rw-r--r--drivers/md/Kconfig36
-rw-r--r--drivers/md/Makefile4
-rw-r--r--drivers/md/dm-bufio.c1699
-rw-r--r--drivers/md/dm-bufio.h112
-rw-r--r--drivers/md/dm-ioctl.c11
-rw-r--r--drivers/md/dm-kcopyd.c31
-rw-r--r--drivers/md/dm-log-userspace-base.c37
-rw-r--r--drivers/md/dm-raid.c48
-rw-r--r--drivers/md/dm-table.c73
-rw-r--r--drivers/md/dm-thin-metadata.c1391
-rw-r--r--drivers/md/dm-thin-metadata.h156
-rw-r--r--drivers/md/dm-thin.c2428
-rw-r--r--drivers/md/dm.c21
-rw-r--r--drivers/md/dm.h2
-rw-r--r--drivers/md/persistent-data/Kconfig8
-rw-r--r--drivers/md/persistent-data/Makefile11
-rw-r--r--drivers/md/persistent-data/dm-block-manager.c620
-rw-r--r--drivers/md/persistent-data/dm-block-manager.h123
-rw-r--r--drivers/md/persistent-data/dm-btree-internal.h137
-rw-r--r--drivers/md/persistent-data/dm-btree-remove.c566
-rw-r--r--drivers/md/persistent-data/dm-btree-spine.c244
-rw-r--r--drivers/md/persistent-data/dm-btree.c805
-rw-r--r--drivers/md/persistent-data/dm-btree.h145
-rw-r--r--drivers/md/persistent-data/dm-persistent-data-internal.h19
-rw-r--r--drivers/md/persistent-data/dm-space-map-checker.c437
-rw-r--r--drivers/md/persistent-data/dm-space-map-checker.h26
-rw-r--r--drivers/md/persistent-data/dm-space-map-common.c705
-rw-r--r--drivers/md/persistent-data/dm-space-map-common.h126
-rw-r--r--drivers/md/persistent-data/dm-space-map-disk.c335
-rw-r--r--drivers/md/persistent-data/dm-space-map-disk.h25
-rw-r--r--drivers/md/persistent-data/dm-space-map-metadata.c596
-rw-r--r--drivers/md/persistent-data/dm-space-map-metadata.h33
-rw-r--r--drivers/md/persistent-data/dm-space-map.h134
-rw-r--r--drivers/md/persistent-data/dm-transaction-manager.c400
-rw-r--r--drivers/md/persistent-data/dm-transaction-manager.h130
-rw-r--r--include/linux/device-mapper.h45
-rw-r--r--include/linux/dm-ioctl.h4
-rw-r--r--include/linux/dm-kcopyd.h4
-rw-r--r--include/linux/dm-log-userspace.h18
42 files changed, 12079 insertions, 37 deletions
diff --git a/Documentation/device-mapper/dm-log.txt b/Documentation/device-mapper/dm-log.txt
index 994dd75475a..c155ac569c4 100644
--- a/Documentation/device-mapper/dm-log.txt
+++ b/Documentation/device-mapper/dm-log.txt
@@ -48,7 +48,7 @@ kernel and userspace, 'connector' is used as the interface for
48communication. 48communication.
49 49
50There are currently two userspace log implementations that leverage this 50There are currently two userspace log implementations that leverage this
51framework - "clustered_disk" and "clustered_core". These implementations 51framework - "clustered-disk" and "clustered-core". These implementations
52provide a cluster-coherent log for shared-storage. Device-mapper mirroring 52provide a cluster-coherent log for shared-storage. Device-mapper mirroring
53can be used in a shared-storage environment when the cluster log implementations 53can be used in a shared-storage environment when the cluster log implementations
54are employed. 54are employed.
diff --git a/Documentation/device-mapper/persistent-data.txt b/Documentation/device-mapper/persistent-data.txt
new file mode 100644
index 00000000000..0e5df9b04ad
--- /dev/null
+++ b/Documentation/device-mapper/persistent-data.txt
@@ -0,0 +1,84 @@
1Introduction
2============
3
4The more-sophisticated device-mapper targets require complex metadata
5that is managed in kernel. In late 2010 we were seeing that various
6different targets were rolling their own data strutures, for example:
7
8- Mikulas Patocka's multisnap implementation
9- Heinz Mauelshagen's thin provisioning target
10- Another btree-based caching target posted to dm-devel
11- Another multi-snapshot target based on a design of Daniel Phillips
12
13Maintaining these data structures takes a lot of work, so if possible
14we'd like to reduce the number.
15
16The persistent-data library is an attempt to provide a re-usable
17framework for people who want to store metadata in device-mapper
18targets. It's currently used by the thin-provisioning target and an
19upcoming hierarchical storage target.
20
21Overview
22========
23
24The main documentation is in the header files which can all be found
25under drivers/md/persistent-data.
26
27The block manager
28-----------------
29
30dm-block-manager.[hc]
31
32This provides access to the data on disk in fixed sized-blocks. There
33is a read/write locking interface to prevent concurrent accesses, and
34keep data that is being used in the cache.
35
36Clients of persistent-data are unlikely to use this directly.
37
38The transaction manager
39-----------------------
40
41dm-transaction-manager.[hc]
42
43This restricts access to blocks and enforces copy-on-write semantics.
44The only way you can get hold of a writable block through the
45transaction manager is by shadowing an existing block (ie. doing
46copy-on-write) or allocating a fresh one. Shadowing is elided within
47the same transaction so performance is reasonable. The commit method
48ensures that all data is flushed before it writes the superblock.
49On power failure your metadata will be as it was when last committed.
50
51The Space Maps
52--------------
53
54dm-space-map.h
55dm-space-map-metadata.[hc]
56dm-space-map-disk.[hc]
57
58On-disk data structures that keep track of reference counts of blocks.
59Also acts as the allocator of new blocks. Currently two
60implementations: a simpler one for managing blocks on a different
61device (eg. thinly-provisioned data blocks); and one for managing
62the metadata space. The latter is complicated by the need to store
63its own data within the space it's managing.
64
65The data structures
66-------------------
67
68dm-btree.[hc]
69dm-btree-remove.c
70dm-btree-spine.c
71dm-btree-internal.h
72
73Currently there is only one data structure, a hierarchical btree.
74There are plans to add more. For example, something with an
75array-like interface would see a lot of use.
76
77The btree is 'hierarchical' in that you can define it to be composed
78of nested btrees, and take multiple keys. For example, the
79thin-provisioning target uses a btree with two levels of nesting.
80The first maps a device id to a mapping tree, and that in turn maps a
81virtual block to a physical block.
82
83Values stored in the btrees can have arbitrary size. Keys are always
8464bits, although nesting allows you to use multiple keys.
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
new file mode 100644
index 00000000000..801d9d1cf82
--- /dev/null
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -0,0 +1,285 @@
1Introduction
2============
3
4This document descibes a collection of device-mapper targets that
5between them implement thin-provisioning and snapshots.
6
7The main highlight of this implementation, compared to the previous
8implementation of snapshots, is that it allows many virtual devices to
9be stored on the same data volume. This simplifies administration and
10allows the sharing of data between volumes, thus reducing disk usage.
11
12Another significant feature is support for an arbitrary depth of
13recursive snapshots (snapshots of snapshots of snapshots ...). The
14previous implementation of snapshots did this by chaining together
15lookup tables, and so performance was O(depth). This new
16implementation uses a single data structure to avoid this degradation
17with depth. Fragmentation may still be an issue, however, in some
18scenarios.
19
20Metadata is stored on a separate device from data, giving the
21administrator some freedom, for example to:
22
23- Improve metadata resilience by storing metadata on a mirrored volume
24 but data on a non-mirrored one.
25
26- Improve performance by storing the metadata on SSD.
27
28Status
29======
30
31These targets are very much still in the EXPERIMENTAL state. Please
32do not yet rely on them in production. But do experiment and offer us
33feedback. Different use cases will have different performance
34characteristics, for example due to fragmentation of the data volume.
35
36If you find this software is not performing as expected please mail
37dm-devel@redhat.com with details and we'll try our best to improve
38things for you.
39
40Userspace tools for checking and repairing the metadata are under
41development.
42
43Cookbook
44========
45
46This section describes some quick recipes for using thin provisioning.
47They use the dmsetup program to control the device-mapper driver
48directly. End users will be advised to use a higher-level volume
49manager such as LVM2 once support has been added.
50
51Pool device
52-----------
53
54The pool device ties together the metadata volume and the data volume.
55It maps I/O linearly to the data volume and updates the metadata via
56two mechanisms:
57
58- Function calls from the thin targets
59
60- Device-mapper 'messages' from userspace which control the creation of new
61 virtual devices amongst other things.
62
63Setting up a fresh pool device
64------------------------------
65
66Setting up a pool device requires a valid metadata device, and a
67data device. If you do not have an existing metadata device you can
68make one by zeroing the first 4k to indicate empty metadata.
69
70 dd if=/dev/zero of=$metadata_dev bs=4096 count=1
71
72The amount of metadata you need will vary according to how many blocks
73are shared between thin devices (i.e. through snapshots). If you have
74less sharing than average you'll need a larger-than-average metadata device.
75
76As a guide, we suggest you calculate the number of bytes to use in the
77metadata device as 48 * $data_dev_size / $data_block_size but round it up
78to 2MB if the answer is smaller. The largest size supported is 16GB.
79
80If you're creating large numbers of snapshots which are recording large
81amounts of change, you may need find you need to increase this.
82
83Reloading a pool table
84----------------------
85
86You may reload a pool's table, indeed this is how the pool is resized
87if it runs out of space. (N.B. While specifying a different metadata
88device when reloading is not forbidden at the moment, things will go
89wrong if it does not route I/O to exactly the same on-disk location as
90previously.)
91
92Using an existing pool device
93-----------------------------
94
95 dmsetup create pool \
96 --table "0 20971520 thin-pool $metadata_dev $data_dev \
97 $data_block_size $low_water_mark"
98
99$data_block_size gives the smallest unit of disk space that can be
100allocated at a time expressed in units of 512-byte sectors. People
101primarily interested in thin provisioning may want to use a value such
102as 1024 (512KB). People doing lots of snapshotting may want a smaller value
103such as 128 (64KB). If you are not zeroing newly-allocated data,
104a larger $data_block_size in the region of 256000 (128MB) is suggested.
105$data_block_size must be the same for the lifetime of the
106metadata device.
107
108$low_water_mark is expressed in blocks of size $data_block_size. If
109free space on the data device drops below this level then a dm event
110will be triggered which a userspace daemon should catch allowing it to
111extend the pool device. Only one such event will be sent.
112Resuming a device with a new table itself triggers an event so the
113userspace daemon can use this to detect a situation where a new table
114already exceeds the threshold.
115
116Thin provisioning
117-----------------
118
119i) Creating a new thinly-provisioned volume.
120
121 To create a new thinly- provisioned volume you must send a message to an
122 active pool device, /dev/mapper/pool in this example.
123
124 dmsetup message /dev/mapper/pool 0 "create_thin 0"
125
126 Here '0' is an identifier for the volume, a 24-bit number. It's up
127 to the caller to allocate and manage these identifiers. If the
128 identifier is already in use, the message will fail with -EEXIST.
129
130ii) Using a thinly-provisioned volume.
131
132 Thinly-provisioned volumes are activated using the 'thin' target:
133
134 dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
135
136 The last parameter is the identifier for the thinp device.
137
138Internal snapshots
139------------------
140
141i) Creating an internal snapshot.
142
143 Snapshots are created with another message to the pool.
144
145 N.B. If the origin device that you wish to snapshot is active, you
146 must suspend it before creating the snapshot to avoid corruption.
147 This is NOT enforced at the moment, so please be careful!
148
149 dmsetup suspend /dev/mapper/thin
150 dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
151 dmsetup resume /dev/mapper/thin
152
153 Here '1' is the identifier for the volume, a 24-bit number. '0' is the
154 identifier for the origin device.
155
156ii) Using an internal snapshot.
157
158 Once created, the user doesn't have to worry about any connection
159 between the origin and the snapshot. Indeed the snapshot is no
160 different from any other thinly-provisioned device and can be
161 snapshotted itself via the same method. It's perfectly legal to
162 have only one of them active, and there's no ordering requirement on
163 activating or removing them both. (This differs from conventional
164 device-mapper snapshots.)
165
166 Activate it exactly the same way as any other thinly-provisioned volume:
167
168 dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
169
170Deactivation
171------------
172
173All devices using a pool must be deactivated before the pool itself
174can be.
175
176 dmsetup remove thin
177 dmsetup remove snap
178 dmsetup remove pool
179
180Reference
181=========
182
183'thin-pool' target
184------------------
185
186i) Constructor
187
188 thin-pool <metadata dev> <data dev> <data block size (sectors)> \
189 <low water mark (blocks)> [<number of feature args> [<arg>]*]
190
191 Optional feature arguments:
192 - 'skip_block_zeroing': skips the zeroing of newly-provisioned blocks.
193
194 Data block size must be between 64KB (128 sectors) and 1GB
195 (2097152 sectors) inclusive.
196
197
198ii) Status
199
200 <transaction id> <used metadata blocks>/<total metadata blocks>
201 <used data blocks>/<total data blocks> <held metadata root>
202
203
204 transaction id:
205 A 64-bit number used by userspace to help synchronise with metadata
206 from volume managers.
207
208 used data blocks / total data blocks
209 If the number of free blocks drops below the pool's low water mark a
210 dm event will be sent to userspace. This event is edge-triggered and
211 it will occur only once after each resume so volume manager writers
212 should register for the event and then check the target's status.
213
214 held metadata root:
215 The location, in sectors, of the metadata root that has been
216 'held' for userspace read access. '-' indicates there is no
217 held root. This feature is not yet implemented so '-' is
218 always returned.
219
220iii) Messages
221
222 create_thin <dev id>
223
224 Create a new thinly-provisioned device.
225 <dev id> is an arbitrary unique 24-bit identifier chosen by
226 the caller.
227
228 create_snap <dev id> <origin id>
229
230 Create a new snapshot of another thinly-provisioned device.
231 <dev id> is an arbitrary unique 24-bit identifier chosen by
232 the caller.
233 <origin id> is the identifier of the thinly-provisioned device
234 of which the new device will be a snapshot.
235
236 delete <dev id>
237
238 Deletes a thin device. Irreversible.
239
240 trim <dev id> <new size in sectors>
241
242 Delete mappings from the end of a thin device. Irreversible.
243 You might want to use this if you're reducing the size of
244 your thinly-provisioned device. In many cases, due to the
245 sharing of blocks between devices, it is not possible to
246 determine in advance how much space 'trim' will release. (In
247 future a userspace tool might be able to perform this
248 calculation.)
249
250 set_transaction_id <current id> <new id>
251
252 Userland volume managers, such as LVM, need a way to
253 synchronise their external metadata with the internal metadata of the
254 pool target. The thin-pool target offers to store an
255 arbitrary 64-bit transaction id and return it on the target's
256 status line. To avoid races you must provide what you think
257 the current transaction id is when you change it with this
258 compare-and-swap message.
259
260'thin' target
261-------------
262
263i) Constructor
264
265 thin <pool dev> <dev id>
266
267 pool dev:
268 the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
269
270 dev id:
271 the internal device identifier of the device to be
272 activated.
273
274The pool doesn't store any size against the thin devices. If you
275load a thin target that is smaller than you've been using previously,
276then you'll have no access to blocks mapped beyond the end. If you
277load a target that is bigger than before, then extra blocks will be
278provisioned as and when needed.
279
280If you wish to reduce the size of your thin device and potentially
281regain some space then send the 'trim' message to the pool.
282
283ii) Status
284
285 <nr mapped sectors> <highest mapped sector>
diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig
index f75a66e7d31..faa4741df6d 100644
--- a/drivers/md/Kconfig
+++ b/drivers/md/Kconfig
@@ -208,6 +208,16 @@ config DM_DEBUG
208 208
209 If unsure, say N. 209 If unsure, say N.
210 210
211config DM_BUFIO
212 tristate
213 depends on BLK_DEV_DM && EXPERIMENTAL
214 ---help---
215 This interface allows you to do buffered I/O on a device and acts
216 as a cache, holding recently-read blocks in memory and performing
217 delayed writes.
218
219source "drivers/md/persistent-data/Kconfig"
220
211config DM_CRYPT 221config DM_CRYPT
212 tristate "Crypt target support" 222 tristate "Crypt target support"
213 depends on BLK_DEV_DM 223 depends on BLK_DEV_DM
@@ -233,6 +243,32 @@ config DM_SNAPSHOT
233 ---help--- 243 ---help---
234 Allow volume managers to take writable snapshots of a device. 244 Allow volume managers to take writable snapshots of a device.
235 245
246config DM_THIN_PROVISIONING
247 tristate "Thin provisioning target (EXPERIMENTAL)"
248 depends on BLK_DEV_DM && EXPERIMENTAL
249 select DM_PERSISTENT_DATA
250 ---help---
251 Provides thin provisioning and snapshots that share a data store.
252
253config DM_DEBUG_BLOCK_STACK_TRACING
254 boolean "Keep stack trace of thin provisioning block lock holders"
255 depends on STACKTRACE_SUPPORT && DM_THIN_PROVISIONING
256 select STACKTRACE
257 ---help---
258 Enable this for messages that may help debug problems with the
259 block manager locking used by thin provisioning.
260
261 If unsure, say N.
262
263config DM_DEBUG_SPACE_MAPS
264 boolean "Extra validation for thin provisioning space maps"
265 depends on DM_THIN_PROVISIONING
266 ---help---
267 Enable this for messages that may help debug problems with the
268 space maps used by thin provisioning.
269
270 If unsure, say N.
271
236config DM_MIRROR 272config DM_MIRROR
237 tristate "Mirror target" 273 tristate "Mirror target"
238 depends on BLK_DEV_DM 274 depends on BLK_DEV_DM
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 448838b1f92..046860c7a16 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -10,6 +10,7 @@ dm-snapshot-y += dm-snap.o dm-exception-store.o dm-snap-transient.o \
10dm-mirror-y += dm-raid1.o 10dm-mirror-y += dm-raid1.o
11dm-log-userspace-y \ 11dm-log-userspace-y \
12 += dm-log-userspace-base.o dm-log-userspace-transfer.o 12 += dm-log-userspace-base.o dm-log-userspace-transfer.o
13dm-thin-pool-y += dm-thin.o dm-thin-metadata.o
13md-mod-y += md.o bitmap.o 14md-mod-y += md.o bitmap.o
14raid456-y += raid5.o 15raid456-y += raid5.o
15 16
@@ -27,6 +28,7 @@ obj-$(CONFIG_MD_MULTIPATH) += multipath.o
27obj-$(CONFIG_MD_FAULTY) += faulty.o 28obj-$(CONFIG_MD_FAULTY) += faulty.o
28obj-$(CONFIG_BLK_DEV_MD) += md-mod.o 29obj-$(CONFIG_BLK_DEV_MD) += md-mod.o
29obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o 30obj-$(CONFIG_BLK_DEV_DM) += dm-mod.o
31obj-$(CONFIG_DM_BUFIO) += dm-bufio.o
30obj-$(CONFIG_DM_CRYPT) += dm-crypt.o 32obj-$(CONFIG_DM_CRYPT) += dm-crypt.o
31obj-$(CONFIG_DM_DELAY) += dm-delay.o 33obj-$(CONFIG_DM_DELAY) += dm-delay.o
32obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o 34obj-$(CONFIG_DM_FLAKEY) += dm-flakey.o
@@ -34,10 +36,12 @@ obj-$(CONFIG_DM_MULTIPATH) += dm-multipath.o dm-round-robin.o
34obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o 36obj-$(CONFIG_DM_MULTIPATH_QL) += dm-queue-length.o
35obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o 37obj-$(CONFIG_DM_MULTIPATH_ST) += dm-service-time.o
36obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o 38obj-$(CONFIG_DM_SNAPSHOT) += dm-snapshot.o
39obj-$(CONFIG_DM_PERSISTENT_DATA) += persistent-data/
37obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o 40obj-$(CONFIG_DM_MIRROR) += dm-mirror.o dm-log.o dm-region-hash.o
38obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o 41obj-$(CONFIG_DM_LOG_USERSPACE) += dm-log-userspace.o
39obj-$(CONFIG_DM_ZERO) += dm-zero.o 42obj-$(CONFIG_DM_ZERO) += dm-zero.o
40obj-$(CONFIG_DM_RAID) += dm-raid.o 43obj-$(CONFIG_DM_RAID) += dm-raid.o
44obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o
41 45
42ifeq ($(CONFIG_DM_UEVENT),y) 46ifeq ($(CONFIG_DM_UEVENT),y)
43dm-mod-objs += dm-uevent.o 47dm-mod-objs += dm-uevent.o
diff --git a/drivers/md/dm-bufio.c b/drivers/md/dm-bufio.c
new file mode 100644
index 00000000000..cb246667dd5
--- /dev/null
+++ b/drivers/md/dm-bufio.c
@@ -0,0 +1,1699 @@
1/*
2 * Copyright (C) 2009-2011 Red Hat, Inc.
3 *
4 * Author: Mikulas Patocka <mpatocka@redhat.com>
5 *
6 * This file is released under the GPL.
7 */
8
9#include "dm-bufio.h"
10
11#include <linux/device-mapper.h>
12#include <linux/dm-io.h>
13#include <linux/slab.h>
14#include <linux/vmalloc.h>
15#include <linux/version.h>
16#include <linux/shrinker.h>
17
18#define DM_MSG_PREFIX "bufio"
19
20/*
21 * Memory management policy:
22 * Limit the number of buffers to DM_BUFIO_MEMORY_PERCENT of main memory
23 * or DM_BUFIO_VMALLOC_PERCENT of vmalloc memory (whichever is lower).
24 * Always allocate at least DM_BUFIO_MIN_BUFFERS buffers.
25 * Start background writeback when there are DM_BUFIO_WRITEBACK_PERCENT
26 * dirty buffers.
27 */
28#define DM_BUFIO_MIN_BUFFERS 8
29
30#define DM_BUFIO_MEMORY_PERCENT 2
31#define DM_BUFIO_VMALLOC_PERCENT 25
32#define DM_BUFIO_WRITEBACK_PERCENT 75
33
34/*
35 * Check buffer ages in this interval (seconds)
36 */
37#define DM_BUFIO_WORK_TIMER_SECS 10
38
39/*
40 * Free buffers when they are older than this (seconds)
41 */
42#define DM_BUFIO_DEFAULT_AGE_SECS 60
43
44/*
45 * The number of bvec entries that are embedded directly in the buffer.
46 * If the chunk size is larger, dm-io is used to do the io.
47 */
48#define DM_BUFIO_INLINE_VECS 16
49
50/*
51 * Buffer hash
52 */
53#define DM_BUFIO_HASH_BITS 20
54#define DM_BUFIO_HASH(block) \
55 ((((block) >> DM_BUFIO_HASH_BITS) ^ (block)) & \
56 ((1 << DM_BUFIO_HASH_BITS) - 1))
57
58/*
59 * Don't try to use kmem_cache_alloc for blocks larger than this.
60 * For explanation, see alloc_buffer_data below.
61 */
62#define DM_BUFIO_BLOCK_SIZE_SLAB_LIMIT (PAGE_SIZE >> 1)
63#define DM_BUFIO_BLOCK_SIZE_GFP_LIMIT (PAGE_SIZE << (MAX_ORDER - 1))
64
65/*
66 * dm_buffer->list_mode
67 */
68#define LIST_CLEAN 0
69#define LIST_DIRTY 1
70#define LIST_SIZE 2
71
72/*
73 * Linking of buffers:
74 * All buffers are linked to cache_hash with their hash_list field.
75 *
76 * Clean buffers that are not being written (B_WRITING not set)
77 * are linked to lru[LIST_CLEAN] with their lru_list field.
78 *
79 * Dirty and clean buffers that are being written are linked to
80 * lru[LIST_DIRTY] with their lru_list field. When the write
81 * finishes, the buffer cannot be relinked immediately (because we
82 * are in an interrupt context and relinking requires process
83 * context), so some clean-not-writing buffers can be held on
84 * dirty_lru too. They are later added to lru in the process
85 * context.
86 */
87struct dm_bufio_client {
88 struct mutex lock;
89
90 struct list_head lru[LIST_SIZE];
91 unsigned long n_buffers[LIST_SIZE];
92
93 struct block_device *bdev;
94 unsigned block_size;
95 unsigned char sectors_per_block_bits;
96 unsigned char pages_per_block_bits;
97 unsigned char blocks_per_page_bits;
98 unsigned aux_size;
99 void (*alloc_callback)(struct dm_buffer *);
100 void (*write_callback)(struct dm_buffer *);
101
102 struct dm_io_client *dm_io;
103
104 struct list_head reserved_buffers;
105 unsigned need_reserved_buffers;
106
107 struct hlist_head *cache_hash;
108 wait_queue_head_t free_buffer_wait;
109
110 int async_write_error;
111
112 struct list_head client_list;
113 struct shrinker shrinker;
114};
115
116/*
117 * Buffer state bits.
118 */
119#define B_READING 0
120#define B_WRITING 1
121#define B_DIRTY 2
122
123/*
124 * Describes how the block was allocated:
125 * kmem_cache_alloc(), __get_free_pages() or vmalloc().
126 * See the comment at alloc_buffer_data.
127 */
128enum data_mode {
129 DATA_MODE_SLAB = 0,
130 DATA_MODE_GET_FREE_PAGES = 1,
131 DATA_MODE_VMALLOC = 2,
132 DATA_MODE_LIMIT = 3
133};
134
135struct dm_buffer {
136 struct hlist_node hash_list;
137 struct list_head lru_list;
138 sector_t block;
139 void *data;
140 enum data_mode data_mode;
141 unsigned char list_mode; /* LIST_* */
142 unsigned hold_count;
143 int read_error;
144 int write_error;
145 unsigned long state;
146 unsigned long last_accessed;
147 struct dm_bufio_client *c;
148 struct bio bio;
149 struct bio_vec bio_vec[DM_BUFIO_INLINE_VECS];
150};
151
152/*----------------------------------------------------------------*/
153
154static struct kmem_cache *dm_bufio_caches[PAGE_SHIFT - SECTOR_SHIFT];
155static char *dm_bufio_cache_names[PAGE_SHIFT - SECTOR_SHIFT];
156
157static inline int dm_bufio_cache_index(struct dm_bufio_client *c)
158{
159 unsigned ret = c->blocks_per_page_bits - 1;
160
161 BUG_ON(ret >= ARRAY_SIZE(dm_bufio_caches));
162
163 return ret;
164}
165
166#define DM_BUFIO_CACHE(c) (dm_bufio_caches[dm_bufio_cache_index(c)])
167#define DM_BUFIO_CACHE_NAME(c) (dm_bufio_cache_names[dm_bufio_cache_index(c)])
168
169#define dm_bufio_in_request() (!!current->bio_list)
170
171static void dm_bufio_lock(struct dm_bufio_client *c)
172{
173 mutex_lock_nested(&c->lock, dm_bufio_in_request());
174}
175
176static int dm_bufio_trylock(struct dm_bufio_client *c)
177{
178 return mutex_trylock(&c->lock);
179}
180
181static void dm_bufio_unlock(struct dm_bufio_client *c)
182{
183 mutex_unlock(&c->lock);
184}
185
186/*
187 * FIXME Move to sched.h?
188 */
189#ifdef CONFIG_PREEMPT_VOLUNTARY
190# define dm_bufio_cond_resched() \
191do { \
192 if (unlikely(need_resched())) \
193 _cond_resched(); \
194} while (0)
195#else
196# define dm_bufio_cond_resched() do { } while (0)
197#endif
198
199/*----------------------------------------------------------------*/
200
201/*
202 * Default cache size: available memory divided by the ratio.
203 */
204static unsigned long dm_bufio_default_cache_size;
205
206/*
207 * Total cache size set by the user.
208 */
209static unsigned long dm_bufio_cache_size;
210
211/*
212 * A copy of dm_bufio_cache_size because dm_bufio_cache_size can change
213 * at any time. If it disagrees, the user has changed cache size.
214 */
215static unsigned long dm_bufio_cache_size_latch;
216
217static DEFINE_SPINLOCK(param_spinlock);
218
219/*
220 * Buffers are freed after this timeout
221 */
222static unsigned dm_bufio_max_age = DM_BUFIO_DEFAULT_AGE_SECS;
223
224static unsigned long dm_bufio_peak_allocated;
225static unsigned long dm_bufio_allocated_kmem_cache;
226static unsigned long dm_bufio_allocated_get_free_pages;
227static unsigned long dm_bufio_allocated_vmalloc;
228static unsigned long dm_bufio_current_allocated;
229
230/*----------------------------------------------------------------*/
231
232/*
233 * Per-client cache: dm_bufio_cache_size / dm_bufio_client_count
234 */
235static unsigned long dm_bufio_cache_size_per_client;
236
237/*
238 * The current number of clients.
239 */
240static int dm_bufio_client_count;
241
242/*
243 * The list of all clients.
244 */
245static LIST_HEAD(dm_bufio_all_clients);
246
247/*
248 * This mutex protects dm_bufio_cache_size_latch,
249 * dm_bufio_cache_size_per_client and dm_bufio_client_count
250 */
251static DEFINE_MUTEX(dm_bufio_clients_lock);
252
253/*----------------------------------------------------------------*/
254
255static void adjust_total_allocated(enum data_mode data_mode, long diff)
256{
257 static unsigned long * const class_ptr[DATA_MODE_LIMIT] = {
258 &dm_bufio_allocated_kmem_cache,
259 &dm_bufio_allocated_get_free_pages,
260 &dm_bufio_allocated_vmalloc,
261 };
262
263 spin_lock(&param_spinlock);
264
265 *class_ptr[data_mode] += diff;
266
267 dm_bufio_current_allocated += diff;
268
269 if (dm_bufio_current_allocated > dm_bufio_peak_allocated)
270 dm_bufio_peak_allocated = dm_bufio_current_allocated;
271
272 spin_unlock(&param_spinlock);
273}
274
275/*
276 * Change the number of clients and recalculate per-client limit.
277 */
278static void __cache_size_refresh(void)
279{
280 BUG_ON(!mutex_is_locked(&dm_bufio_clients_lock));
281 BUG_ON(dm_bufio_client_count < 0);
282
283 dm_bufio_cache_size_latch = dm_bufio_cache_size;
284
285 barrier();
286
287 /*
288 * Use default if set to 0 and report the actual cache size used.
289 */
290 if (!dm_bufio_cache_size_latch) {
291 (void)cmpxchg(&dm_bufio_cache_size, 0,
292 dm_bufio_default_cache_size);
293 dm_bufio_cache_size_latch = dm_bufio_default_cache_size;
294 }
295
296 dm_bufio_cache_size_per_client = dm_bufio_cache_size_latch /
297 (dm_bufio_client_count ? : 1);
298}
299
300/*
301 * Allocating buffer data.
302 *
303 * Small buffers are allocated with kmem_cache, to use space optimally.
304 *
305 * For large buffers, we choose between get_free_pages and vmalloc.
306 * Each has advantages and disadvantages.
307 *
308 * __get_free_pages can randomly fail if the memory is fragmented.
309 * __vmalloc won't randomly fail, but vmalloc space is limited (it may be
310 * as low as 128M) so using it for caching is not appropriate.
311 *
312 * If the allocation may fail we use __get_free_pages. Memory fragmentation
313 * won't have a fatal effect here, but it just causes flushes of some other
314 * buffers and more I/O will be performed. Don't use __get_free_pages if it
315 * always fails (i.e. order >= MAX_ORDER).
316 *
317 * If the allocation shouldn't fail we use __vmalloc. This is only for the
318 * initial reserve allocation, so there's no risk of wasting all vmalloc
319 * space.
320 */
321static void *alloc_buffer_data(struct dm_bufio_client *c, gfp_t gfp_mask,
322 enum data_mode *data_mode)
323{
324 if (c->block_size <= DM_BUFIO_BLOCK_SIZE_SLAB_LIMIT) {
325 *data_mode = DATA_MODE_SLAB;
326 return kmem_cache_alloc(DM_BUFIO_CACHE(c), gfp_mask);
327 }
328
329 if (c->block_size <= DM_BUFIO_BLOCK_SIZE_GFP_LIMIT &&
330 gfp_mask & __GFP_NORETRY) {
331 *data_mode = DATA_MODE_GET_FREE_PAGES;
332 return (void *)__get_free_pages(gfp_mask,
333 c->pages_per_block_bits);
334 }
335
336 *data_mode = DATA_MODE_VMALLOC;
337 return __vmalloc(c->block_size, gfp_mask, PAGE_KERNEL);
338}
339
340/*
341 * Free buffer's data.
342 */
343static void free_buffer_data(struct dm_bufio_client *c,
344 void *data, enum data_mode data_mode)
345{
346 switch (data_mode) {
347 case DATA_MODE_SLAB:
348 kmem_cache_free(DM_BUFIO_CACHE(c), data);
349 break;
350
351 case DATA_MODE_GET_FREE_PAGES:
352 free_pages((unsigned long)data, c->pages_per_block_bits);
353 break;
354
355 case DATA_MODE_VMALLOC:
356 vfree(data);
357 break;
358
359 default:
360 DMCRIT("dm_bufio_free_buffer_data: bad data mode: %d",
361 data_mode);
362 BUG();
363 }
364}
365
366/*
367 * Allocate buffer and its data.
368 */
369static struct dm_buffer *alloc_buffer(struct dm_bufio_client *c, gfp_t gfp_mask)
370{
371 struct dm_buffer *b = kmalloc(sizeof(struct dm_buffer) + c->aux_size,
372 gfp_mask);
373
374 if (!b)
375 return NULL;
376
377 b->c = c;
378
379 b->data = alloc_buffer_data(c, gfp_mask, &b->data_mode);
380 if (!b->data) {
381 kfree(b);
382 return NULL;
383 }
384
385 adjust_total_allocated(b->data_mode, (long)c->block_size);
386
387 return b;
388}
389
390/*
391 * Free buffer and its data.
392 */
393static void free_buffer(struct dm_buffer *b)
394{
395 struct dm_bufio_client *c = b->c;
396
397 adjust_total_allocated(b->data_mode, -(long)c->block_size);
398
399 free_buffer_data(c, b->data, b->data_mode);
400 kfree(b);
401}
402
403/*
404 * Link buffer to the hash list and clean or dirty queue.
405 */
406static void __link_buffer(struct dm_buffer *b, sector_t block, int dirty)
407{
408 struct dm_bufio_client *c = b->c;
409
410 c->n_buffers[dirty]++;
411 b->block = block;
412 b->list_mode = dirty;
413 list_add(&b->lru_list, &c->lru[dirty]);
414 hlist_add_head(&b->hash_list, &c->cache_hash[DM_BUFIO_HASH(block)]);
415 b->last_accessed = jiffies;
416}
417
418/*
419 * Unlink buffer from the hash list and dirty or clean queue.
420 */
421static void __unlink_buffer(struct dm_buffer *b)
422{
423 struct dm_bufio_client *c = b->c;
424
425 BUG_ON(!c->n_buffers[b->list_mode]);
426
427 c->n_buffers[b->list_mode]--;
428 hlist_del(&b->hash_list);
429 list_del(&b->lru_list);
430}
431
432/*
433 * Place the buffer to the head of dirty or clean LRU queue.
434 */
435static void __relink_lru(struct dm_buffer *b, int dirty)
436{
437 struct dm_bufio_client *c = b->c;
438
439 BUG_ON(!c->n_buffers[b->list_mode]);
440
441 c->n_buffers[b->list_mode]--;
442 c->n_buffers[dirty]++;
443 b->list_mode = dirty;
444 list_del(&b->lru_list);
445 list_add(&b->lru_list, &c->lru[dirty]);
446}
447
448/*----------------------------------------------------------------
449 * Submit I/O on the buffer.
450 *
451 * Bio interface is faster but it has some problems:
452 * the vector list is limited (increasing this limit increases
453 * memory-consumption per buffer, so it is not viable);
454 *
455 * the memory must be direct-mapped, not vmalloced;
456 *
457 * the I/O driver can reject requests spuriously if it thinks that
458 * the requests are too big for the device or if they cross a
459 * controller-defined memory boundary.
460 *
461 * If the buffer is small enough (up to DM_BUFIO_INLINE_VECS pages) and
462 * it is not vmalloced, try using the bio interface.
463 *
464 * If the buffer is big, if it is vmalloced or if the underlying device
465 * rejects the bio because it is too large, use dm-io layer to do the I/O.
466 * The dm-io layer splits the I/O into multiple requests, avoiding the above
467 * shortcomings.
468 *--------------------------------------------------------------*/
469
470/*
471 * dm-io completion routine. It just calls b->bio.bi_end_io, pretending
472 * that the request was handled directly with bio interface.
473 */
474static void dmio_complete(unsigned long error, void *context)
475{
476 struct dm_buffer *b = context;
477
478 b->bio.bi_end_io(&b->bio, error ? -EIO : 0);
479}
480
481static void use_dmio(struct dm_buffer *b, int rw, sector_t block,
482 bio_end_io_t *end_io)
483{
484 int r;
485 struct dm_io_request io_req = {
486 .bi_rw = rw,
487 .notify.fn = dmio_complete,
488 .notify.context = b,
489 .client = b->c->dm_io,
490 };
491 struct dm_io_region region = {
492 .bdev = b->c->bdev,
493 .sector = block << b->c->sectors_per_block_bits,
494 .count = b->c->block_size >> SECTOR_SHIFT,
495 };
496
497 if (b->data_mode != DATA_MODE_VMALLOC) {
498 io_req.mem.type = DM_IO_KMEM;
499 io_req.mem.ptr.addr = b->data;
500 } else {
501 io_req.mem.type = DM_IO_VMA;
502 io_req.mem.ptr.vma = b->data;
503 }
504
505 b->bio.bi_end_io = end_io;
506
507 r = dm_io(&io_req, 1, &region, NULL);
508 if (r)
509 end_io(&b->bio, r);
510}
511
512static void use_inline_bio(struct dm_buffer *b, int rw, sector_t block,
513 bio_end_io_t *end_io)
514{
515 char *ptr;
516 int len;
517
518 bio_init(&b->bio);
519 b->bio.bi_io_vec = b->bio_vec;
520 b->bio.bi_max_vecs = DM_BUFIO_INLINE_VECS;
521 b->bio.bi_sector = block << b->c->sectors_per_block_bits;
522 b->bio.bi_bdev = b->c->bdev;
523 b->bio.bi_end_io = end_io;
524
525 /*
526 * We assume that if len >= PAGE_SIZE ptr is page-aligned.
527 * If len < PAGE_SIZE the buffer doesn't cross page boundary.
528 */
529 ptr = b->data;
530 len = b->c->block_size;
531
532 if (len >= PAGE_SIZE)
533 BUG_ON((unsigned long)ptr & (PAGE_SIZE - 1));
534 else
535 BUG_ON((unsigned long)ptr & (len - 1));
536
537 do {
538 if (!bio_add_page(&b->bio, virt_to_page(ptr),
539 len < PAGE_SIZE ? len : PAGE_SIZE,
540 virt_to_phys(ptr) & (PAGE_SIZE - 1))) {
541 BUG_ON(b->c->block_size <= PAGE_SIZE);
542 use_dmio(b, rw, block, end_io);
543 return;
544 }
545
546 len -= PAGE_SIZE;
547 ptr += PAGE_SIZE;
548 } while (len > 0);
549
550 submit_bio(rw, &b->bio);
551}
552
553static void submit_io(struct dm_buffer *b, int rw, sector_t block,
554 bio_end_io_t *end_io)
555{
556 if (rw == WRITE && b->c->write_callback)
557 b->c->write_callback(b);
558
559 if (b->c->block_size <= DM_BUFIO_INLINE_VECS * PAGE_SIZE &&
560 b->data_mode != DATA_MODE_VMALLOC)
561 use_inline_bio(b, rw, block, end_io);
562 else
563 use_dmio(b, rw, block, end_io);
564}
565
566/*----------------------------------------------------------------
567 * Writing dirty buffers
568 *--------------------------------------------------------------*/
569
570/*
571 * The endio routine for write.
572 *
573 * Set the error, clear B_WRITING bit and wake anyone who was waiting on
574 * it.
575 */
576static void write_endio(struct bio *bio, int error)
577{
578 struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);
579
580 b->write_error = error;
581 if (error) {
582 struct dm_bufio_client *c = b->c;
583 (void)cmpxchg(&c->async_write_error, 0, error);
584 }
585
586 BUG_ON(!test_bit(B_WRITING, &b->state));
587
588 smp_mb__before_clear_bit();
589 clear_bit(B_WRITING, &b->state);
590 smp_mb__after_clear_bit();
591
592 wake_up_bit(&b->state, B_WRITING);
593}
594
595/*
596 * This function is called when wait_on_bit is actually waiting.
597 */
598static int do_io_schedule(void *word)
599{
600 io_schedule();
601
602 return 0;
603}
604
605/*
606 * Initiate a write on a dirty buffer, but don't wait for it.
607 *
608 * - If the buffer is not dirty, exit.
609 * - If there some previous write going on, wait for it to finish (we can't
610 * have two writes on the same buffer simultaneously).
611 * - Submit our write and don't wait on it. We set B_WRITING indicating
612 * that there is a write in progress.
613 */
614static void __write_dirty_buffer(struct dm_buffer *b)
615{
616 if (!test_bit(B_DIRTY, &b->state))
617 return;
618
619 clear_bit(B_DIRTY, &b->state);
620 wait_on_bit_lock(&b->state, B_WRITING,
621 do_io_schedule, TASK_UNINTERRUPTIBLE);
622
623 submit_io(b, WRITE, b->block, write_endio);
624}
625
626/*
627 * Wait until any activity on the buffer finishes. Possibly write the
628 * buffer if it is dirty. When this function finishes, there is no I/O
629 * running on the buffer and the buffer is not dirty.
630 */
631static void __make_buffer_clean(struct dm_buffer *b)
632{
633 BUG_ON(b->hold_count);
634
635 if (!b->state) /* fast case */
636 return;
637
638 wait_on_bit(&b->state, B_READING, do_io_schedule, TASK_UNINTERRUPTIBLE);
639 __write_dirty_buffer(b);
640 wait_on_bit(&b->state, B_WRITING, do_io_schedule, TASK_UNINTERRUPTIBLE);
641}
642
643/*
644 * Find some buffer that is not held by anybody, clean it, unlink it and
645 * return it.
646 */
647static struct dm_buffer *__get_unclaimed_buffer(struct dm_bufio_client *c)
648{
649 struct dm_buffer *b;
650
651 list_for_each_entry_reverse(b, &c->lru[LIST_CLEAN], lru_list) {
652 BUG_ON(test_bit(B_WRITING, &b->state));
653 BUG_ON(test_bit(B_DIRTY, &b->state));
654
655 if (!b->hold_count) {
656 __make_buffer_clean(b);
657 __unlink_buffer(b);
658 return b;
659 }
660 dm_bufio_cond_resched();
661 }
662
663 list_for_each_entry_reverse(b, &c->lru[LIST_DIRTY], lru_list) {
664 BUG_ON(test_bit(B_READING, &b->state));
665
666 if (!b->hold_count) {
667 __make_buffer_clean(b);
668 __unlink_buffer(b);
669 return b;
670 }
671 dm_bufio_cond_resched();
672 }
673
674 return NULL;
675}
676
677/*
678 * Wait until some other threads free some buffer or release hold count on
679 * some buffer.
680 *
681 * This function is entered with c->lock held, drops it and regains it
682 * before exiting.
683 */
684static void __wait_for_free_buffer(struct dm_bufio_client *c)
685{
686 DECLARE_WAITQUEUE(wait, current);
687
688 add_wait_queue(&c->free_buffer_wait, &wait);
689 set_task_state(current, TASK_UNINTERRUPTIBLE);
690 dm_bufio_unlock(c);
691
692 io_schedule();
693
694 set_task_state(current, TASK_RUNNING);
695 remove_wait_queue(&c->free_buffer_wait, &wait);
696
697 dm_bufio_lock(c);
698}
699
700/*
701 * Allocate a new buffer. If the allocation is not possible, wait until
702 * some other thread frees a buffer.
703 *
704 * May drop the lock and regain it.
705 */
706static struct dm_buffer *__alloc_buffer_wait_no_callback(struct dm_bufio_client *c)
707{
708 struct dm_buffer *b;
709
710 /*
711 * dm-bufio is resistant to allocation failures (it just keeps
712 * one buffer reserved in cases all the allocations fail).
713 * So set flags to not try too hard:
714 * GFP_NOIO: don't recurse into the I/O layer
715 * __GFP_NORETRY: don't retry and rather return failure
716 * __GFP_NOMEMALLOC: don't use emergency reserves
717 * __GFP_NOWARN: don't print a warning in case of failure
718 *
719 * For debugging, if we set the cache size to 1, no new buffers will
720 * be allocated.
721 */
722 while (1) {
723 if (dm_bufio_cache_size_latch != 1) {
724 b = alloc_buffer(c, GFP_NOIO | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN);
725 if (b)
726 return b;
727 }
728
729 if (!list_empty(&c->reserved_buffers)) {
730 b = list_entry(c->reserved_buffers.next,
731 struct dm_buffer, lru_list);
732 list_del(&b->lru_list);
733 c->need_reserved_buffers++;
734
735 return b;
736 }
737
738 b = __get_unclaimed_buffer(c);
739 if (b)
740 return b;
741
742 __wait_for_free_buffer(c);
743 }
744}
745
746static struct dm_buffer *__alloc_buffer_wait(struct dm_bufio_client *c)
747{
748 struct dm_buffer *b = __alloc_buffer_wait_no_callback(c);
749
750 if (c->alloc_callback)
751 c->alloc_callback(b);
752
753 return b;
754}
755
756/*
757 * Free a buffer and wake other threads waiting for free buffers.
758 */
759static void __free_buffer_wake(struct dm_buffer *b)
760{
761 struct dm_bufio_client *c = b->c;
762
763 if (!c->need_reserved_buffers)
764 free_buffer(b);
765 else {
766 list_add(&b->lru_list, &c->reserved_buffers);
767 c->need_reserved_buffers--;
768 }
769
770 wake_up(&c->free_buffer_wait);
771}
772
773static void __write_dirty_buffers_async(struct dm_bufio_client *c, int no_wait)
774{
775 struct dm_buffer *b, *tmp;
776
777 list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_DIRTY], lru_list) {
778 BUG_ON(test_bit(B_READING, &b->state));
779
780 if (!test_bit(B_DIRTY, &b->state) &&
781 !test_bit(B_WRITING, &b->state)) {
782 __relink_lru(b, LIST_CLEAN);
783 continue;
784 }
785
786 if (no_wait && test_bit(B_WRITING, &b->state))
787 return;
788
789 __write_dirty_buffer(b);
790 dm_bufio_cond_resched();
791 }
792}
793
794/*
795 * Get writeback threshold and buffer limit for a given client.
796 */
797static void __get_memory_limit(struct dm_bufio_client *c,
798 unsigned long *threshold_buffers,
799 unsigned long *limit_buffers)
800{
801 unsigned long buffers;
802
803 if (dm_bufio_cache_size != dm_bufio_cache_size_latch) {
804 mutex_lock(&dm_bufio_clients_lock);
805 __cache_size_refresh();
806 mutex_unlock(&dm_bufio_clients_lock);
807 }
808
809 buffers = dm_bufio_cache_size_per_client >>
810 (c->sectors_per_block_bits + SECTOR_SHIFT);
811
812 if (buffers < DM_BUFIO_MIN_BUFFERS)
813 buffers = DM_BUFIO_MIN_BUFFERS;
814
815 *limit_buffers = buffers;
816 *threshold_buffers = buffers * DM_BUFIO_WRITEBACK_PERCENT / 100;
817}
818
819/*
820 * Check if we're over watermark.
821 * If we are over threshold_buffers, start freeing buffers.
822 * If we're over "limit_buffers", block until we get under the limit.
823 */
824static void __check_watermark(struct dm_bufio_client *c)
825{
826 unsigned long threshold_buffers, limit_buffers;
827
828 __get_memory_limit(c, &threshold_buffers, &limit_buffers);
829
830 while (c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY] >
831 limit_buffers) {
832
833 struct dm_buffer *b = __get_unclaimed_buffer(c);
834
835 if (!b)
836 return;
837
838 __free_buffer_wake(b);
839 dm_bufio_cond_resched();
840 }
841
842 if (c->n_buffers[LIST_DIRTY] > threshold_buffers)
843 __write_dirty_buffers_async(c, 1);
844}
845
846/*
847 * Find a buffer in the hash.
848 */
849static struct dm_buffer *__find(struct dm_bufio_client *c, sector_t block)
850{
851 struct dm_buffer *b;
852 struct hlist_node *hn;
853
854 hlist_for_each_entry(b, hn, &c->cache_hash[DM_BUFIO_HASH(block)],
855 hash_list) {
856 dm_bufio_cond_resched();
857 if (b->block == block)
858 return b;
859 }
860
861 return NULL;
862}
863
864/*----------------------------------------------------------------
865 * Getting a buffer
866 *--------------------------------------------------------------*/
867
868enum new_flag {
869 NF_FRESH = 0,
870 NF_READ = 1,
871 NF_GET = 2
872};
873
874static struct dm_buffer *__bufio_new(struct dm_bufio_client *c, sector_t block,
875 enum new_flag nf, struct dm_buffer **bp,
876 int *need_submit)
877{
878 struct dm_buffer *b, *new_b = NULL;
879
880 *need_submit = 0;
881
882 b = __find(c, block);
883 if (b) {
884 b->hold_count++;
885 __relink_lru(b, test_bit(B_DIRTY, &b->state) ||
886 test_bit(B_WRITING, &b->state));
887 return b;
888 }
889
890 if (nf == NF_GET)
891 return NULL;
892
893 new_b = __alloc_buffer_wait(c);
894
895 /*
896 * We've had a period where the mutex was unlocked, so need to
897 * recheck the hash table.
898 */
899 b = __find(c, block);
900 if (b) {
901 __free_buffer_wake(new_b);
902 b->hold_count++;
903 __relink_lru(b, test_bit(B_DIRTY, &b->state) ||
904 test_bit(B_WRITING, &b->state));
905 return b;
906 }
907
908 __check_watermark(c);
909
910 b = new_b;
911 b->hold_count = 1;
912 b->read_error = 0;
913 b->write_error = 0;
914 __link_buffer(b, block, LIST_CLEAN);
915
916 if (nf == NF_FRESH) {
917 b->state = 0;
918 return b;
919 }
920
921 b->state = 1 << B_READING;
922 *need_submit = 1;
923
924 return b;
925}
926
927/*
928 * The endio routine for reading: set the error, clear the bit and wake up
929 * anyone waiting on the buffer.
930 */
931static void read_endio(struct bio *bio, int error)
932{
933 struct dm_buffer *b = container_of(bio, struct dm_buffer, bio);
934
935 b->read_error = error;
936
937 BUG_ON(!test_bit(B_READING, &b->state));
938
939 smp_mb__before_clear_bit();
940 clear_bit(B_READING, &b->state);
941 smp_mb__after_clear_bit();
942
943 wake_up_bit(&b->state, B_READING);
944}
945
946/*
947 * A common routine for dm_bufio_new and dm_bufio_read. Operation of these
948 * functions is similar except that dm_bufio_new doesn't read the
949 * buffer from the disk (assuming that the caller overwrites all the data
950 * and uses dm_bufio_mark_buffer_dirty to write new data back).
951 */
952static void *new_read(struct dm_bufio_client *c, sector_t block,
953 enum new_flag nf, struct dm_buffer **bp)
954{
955 int need_submit;
956 struct dm_buffer *b;
957
958 dm_bufio_lock(c);
959 b = __bufio_new(c, block, nf, bp, &need_submit);
960 dm_bufio_unlock(c);
961
962 if (!b || IS_ERR(b))
963 return b;
964
965 if (need_submit)
966 submit_io(b, READ, b->block, read_endio);
967
968 wait_on_bit(&b->state, B_READING, do_io_schedule, TASK_UNINTERRUPTIBLE);
969
970 if (b->read_error) {
971 int error = b->read_error;
972
973 dm_bufio_release(b);
974
975 return ERR_PTR(error);
976 }
977
978 *bp = b;
979
980 return b->data;
981}
982
983void *dm_bufio_get(struct dm_bufio_client *c, sector_t block,
984 struct dm_buffer **bp)
985{
986 return new_read(c, block, NF_GET, bp);
987}
988EXPORT_SYMBOL_GPL(dm_bufio_get);
989
990void *dm_bufio_read(struct dm_bufio_client *c, sector_t block,
991 struct dm_buffer **bp)
992{
993 BUG_ON(dm_bufio_in_request());
994
995 return new_read(c, block, NF_READ, bp);
996}
997EXPORT_SYMBOL_GPL(dm_bufio_read);
998
999void *dm_bufio_new(struct dm_bufio_client *c, sector_t block,
1000 struct dm_buffer **bp)
1001{
1002 BUG_ON(dm_bufio_in_request());
1003
1004 return new_read(c, block, NF_FRESH, bp);
1005}
1006EXPORT_SYMBOL_GPL(dm_bufio_new);
1007
1008void dm_bufio_release(struct dm_buffer *b)
1009{
1010 struct dm_bufio_client *c = b->c;
1011
1012 dm_bufio_lock(c);
1013
1014 BUG_ON(test_bit(B_READING, &b->state));
1015 BUG_ON(!b->hold_count);
1016
1017 b->hold_count--;
1018 if (!b->hold_count) {
1019 wake_up(&c->free_buffer_wait);
1020
1021 /*
1022 * If there were errors on the buffer, and the buffer is not
1023 * to be written, free the buffer. There is no point in caching
1024 * invalid buffer.
1025 */
1026 if ((b->read_error || b->write_error) &&
1027 !test_bit(B_WRITING, &b->state) &&
1028 !test_bit(B_DIRTY, &b->state)) {
1029 __unlink_buffer(b);
1030 __free_buffer_wake(b);
1031 }
1032 }
1033
1034 dm_bufio_unlock(c);
1035}
1036EXPORT_SYMBOL_GPL(dm_bufio_release);
1037
1038void dm_bufio_mark_buffer_dirty(struct dm_buffer *b)
1039{
1040 struct dm_bufio_client *c = b->c;
1041
1042 dm_bufio_lock(c);
1043
1044 if (!test_and_set_bit(B_DIRTY, &b->state))
1045 __relink_lru(b, LIST_DIRTY);
1046
1047 dm_bufio_unlock(c);
1048}
1049EXPORT_SYMBOL_GPL(dm_bufio_mark_buffer_dirty);
1050
1051void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c)
1052{
1053 BUG_ON(dm_bufio_in_request());
1054
1055 dm_bufio_lock(c);
1056 __write_dirty_buffers_async(c, 0);
1057 dm_bufio_unlock(c);
1058}
1059EXPORT_SYMBOL_GPL(dm_bufio_write_dirty_buffers_async);
1060
1061/*
1062 * For performance, it is essential that the buffers are written asynchronously
1063 * and simultaneously (so that the block layer can merge the writes) and then
1064 * waited upon.
1065 *
1066 * Finally, we flush hardware disk cache.
1067 */
1068int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c)
1069{
1070 int a, f;
1071 unsigned long buffers_processed = 0;
1072 struct dm_buffer *b, *tmp;
1073
1074 dm_bufio_lock(c);
1075 __write_dirty_buffers_async(c, 0);
1076
1077again:
1078 list_for_each_entry_safe_reverse(b, tmp, &c->lru[LIST_DIRTY], lru_list) {
1079 int dropped_lock = 0;
1080
1081 if (buffers_processed < c->n_buffers[LIST_DIRTY])
1082 buffers_processed++;
1083
1084 BUG_ON(test_bit(B_READING, &b->state));
1085
1086 if (test_bit(B_WRITING, &b->state)) {
1087 if (buffers_processed < c->n_buffers[LIST_DIRTY]) {
1088 dropped_lock = 1;
1089 b->hold_count++;
1090 dm_bufio_unlock(c);
1091 wait_on_bit(&b->state, B_WRITING,
1092 do_io_schedule,
1093 TASK_UNINTERRUPTIBLE);
1094 dm_bufio_lock(c);
1095 b->hold_count--;
1096 } else
1097 wait_on_bit(&b->state, B_WRITING,
1098 do_io_schedule,
1099 TASK_UNINTERRUPTIBLE);
1100 }
1101
1102 if (!test_bit(B_DIRTY, &b->state) &&
1103 !test_bit(B_WRITING, &b->state))
1104 __relink_lru(b, LIST_CLEAN);
1105
1106 dm_bufio_cond_resched();
1107
1108 /*
1109 * If we dropped the lock, the list is no longer consistent,
1110 * so we must restart the search.
1111 *
1112 * In the most common case, the buffer just processed is
1113 * relinked to the clean list, so we won't loop scanning the
1114 * same buffer again and again.
1115 *
1116 * This may livelock if there is another thread simultaneously
1117 * dirtying buffers, so we count the number of buffers walked
1118 * and if it exceeds the total number of buffers, it means that
1119 * someone is doing some writes simultaneously with us. In
1120 * this case, stop, dropping the lock.
1121 */
1122 if (dropped_lock)
1123 goto again;
1124 }
1125 wake_up(&c->free_buffer_wait);
1126 dm_bufio_unlock(c);
1127
1128 a = xchg(&c->async_write_error, 0);
1129 f = dm_bufio_issue_flush(c);
1130 if (a)
1131 return a;
1132
1133 return f;
1134}
1135EXPORT_SYMBOL_GPL(dm_bufio_write_dirty_buffers);
1136
1137/*
1138 * Use dm-io to send and empty barrier flush the device.
1139 */
1140int dm_bufio_issue_flush(struct dm_bufio_client *c)
1141{
1142 struct dm_io_request io_req = {
1143 .bi_rw = REQ_FLUSH,
1144 .mem.type = DM_IO_KMEM,
1145 .mem.ptr.addr = NULL,
1146 .client = c->dm_io,
1147 };
1148 struct dm_io_region io_reg = {
1149 .bdev = c->bdev,
1150 .sector = 0,
1151 .count = 0,
1152 };
1153
1154 BUG_ON(dm_bufio_in_request());
1155
1156 return dm_io(&io_req, 1, &io_reg, NULL);
1157}
1158EXPORT_SYMBOL_GPL(dm_bufio_issue_flush);
1159
1160/*
1161 * We first delete any other buffer that may be at that new location.
1162 *
1163 * Then, we write the buffer to the original location if it was dirty.
1164 *
1165 * Then, if we are the only one who is holding the buffer, relink the buffer
1166 * in the hash queue for the new location.
1167 *
1168 * If there was someone else holding the buffer, we write it to the new
1169 * location but not relink it, because that other user needs to have the buffer
1170 * at the same place.
1171 */
1172void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block)
1173{
1174 struct dm_bufio_client *c = b->c;
1175 struct dm_buffer *new;
1176
1177 BUG_ON(dm_bufio_in_request());
1178
1179 dm_bufio_lock(c);
1180
1181retry:
1182 new = __find(c, new_block);
1183 if (new) {
1184 if (new->hold_count) {
1185 __wait_for_free_buffer(c);
1186 goto retry;
1187 }
1188
1189 /*
1190 * FIXME: Is there any point waiting for a write that's going
1191 * to be overwritten in a bit?
1192 */
1193 __make_buffer_clean(new);
1194 __unlink_buffer(new);
1195 __free_buffer_wake(new);
1196 }
1197
1198 BUG_ON(!b->hold_count);
1199 BUG_ON(test_bit(B_READING, &b->state));
1200
1201 __write_dirty_buffer(b);
1202 if (b->hold_count == 1) {
1203 wait_on_bit(&b->state, B_WRITING,
1204 do_io_schedule, TASK_UNINTERRUPTIBLE);
1205 set_bit(B_DIRTY, &b->state);
1206 __unlink_buffer(b);
1207 __link_buffer(b, new_block, LIST_DIRTY);
1208 } else {
1209 sector_t old_block;
1210 wait_on_bit_lock(&b->state, B_WRITING,
1211 do_io_schedule, TASK_UNINTERRUPTIBLE);
1212 /*
1213 * Relink buffer to "new_block" so that write_callback
1214 * sees "new_block" as a block number.
1215 * After the write, link the buffer back to old_block.
1216 * All this must be done in bufio lock, so that block number
1217 * change isn't visible to other threads.
1218 */
1219 old_block = b->block;
1220 __unlink_buffer(b);
1221 __link_buffer(b, new_block, b->list_mode);
1222 submit_io(b, WRITE, new_block, write_endio);
1223 wait_on_bit(&b->state, B_WRITING,
1224 do_io_schedule, TASK_UNINTERRUPTIBLE);
1225 __unlink_buffer(b);
1226 __link_buffer(b, old_block, b->list_mode);
1227 }
1228
1229 dm_bufio_unlock(c);
1230 dm_bufio_release(b);
1231}
1232EXPORT_SYMBOL_GPL(dm_bufio_release_move);
1233
1234unsigned dm_bufio_get_block_size(struct dm_bufio_client *c)
1235{
1236 return c->block_size;
1237}
1238EXPORT_SYMBOL_GPL(dm_bufio_get_block_size);
1239
1240sector_t dm_bufio_get_device_size(struct dm_bufio_client *c)
1241{
1242 return i_size_read(c->bdev->bd_inode) >>
1243 (SECTOR_SHIFT + c->sectors_per_block_bits);
1244}
1245EXPORT_SYMBOL_GPL(dm_bufio_get_device_size);
1246
1247sector_t dm_bufio_get_block_number(struct dm_buffer *b)
1248{
1249 return b->block;
1250}
1251EXPORT_SYMBOL_GPL(dm_bufio_get_block_number);
1252
1253void *dm_bufio_get_block_data(struct dm_buffer *b)
1254{
1255 return b->data;
1256}
1257EXPORT_SYMBOL_GPL(dm_bufio_get_block_data);
1258
1259void *dm_bufio_get_aux_data(struct dm_buffer *b)
1260{
1261 return b + 1;
1262}
1263EXPORT_SYMBOL_GPL(dm_bufio_get_aux_data);
1264
1265struct dm_bufio_client *dm_bufio_get_client(struct dm_buffer *b)
1266{
1267 return b->c;
1268}
1269EXPORT_SYMBOL_GPL(dm_bufio_get_client);
1270
1271static void drop_buffers(struct dm_bufio_client *c)
1272{
1273 struct dm_buffer *b;
1274 int i;
1275
1276 BUG_ON(dm_bufio_in_request());
1277
1278 /*
1279 * An optimization so that the buffers are not written one-by-one.
1280 */
1281 dm_bufio_write_dirty_buffers_async(c);
1282
1283 dm_bufio_lock(c);
1284
1285 while ((b = __get_unclaimed_buffer(c)))
1286 __free_buffer_wake(b);
1287
1288 for (i = 0; i < LIST_SIZE; i++)
1289 list_for_each_entry(b, &c->lru[i], lru_list)
1290 DMERR("leaked buffer %llx, hold count %u, list %d",
1291 (unsigned long long)b->block, b->hold_count, i);
1292
1293 for (i = 0; i < LIST_SIZE; i++)
1294 BUG_ON(!list_empty(&c->lru[i]));
1295
1296 dm_bufio_unlock(c);
1297}
1298
1299/*
1300 * Test if the buffer is unused and too old, and commit it.
1301 * At if noio is set, we must not do any I/O because we hold
1302 * dm_bufio_clients_lock and we would risk deadlock if the I/O gets rerouted to
1303 * different bufio client.
1304 */
1305static int __cleanup_old_buffer(struct dm_buffer *b, gfp_t gfp,
1306 unsigned long max_jiffies)
1307{
1308 if (jiffies - b->last_accessed < max_jiffies)
1309 return 1;
1310
1311 if (!(gfp & __GFP_IO)) {
1312 if (test_bit(B_READING, &b->state) ||
1313 test_bit(B_WRITING, &b->state) ||
1314 test_bit(B_DIRTY, &b->state))
1315 return 1;
1316 }
1317
1318 if (b->hold_count)
1319 return 1;
1320
1321 __make_buffer_clean(b);
1322 __unlink_buffer(b);
1323 __free_buffer_wake(b);
1324
1325 return 0;
1326}
1327
1328static void __scan(struct dm_bufio_client *c, unsigned long nr_to_scan,
1329 struct shrink_control *sc)
1330{
1331 int l;
1332 struct dm_buffer *b, *tmp;
1333
1334 for (l = 0; l < LIST_SIZE; l++) {
1335 list_for_each_entry_safe_reverse(b, tmp, &c->lru[l], lru_list)
1336 if (!__cleanup_old_buffer(b, sc->gfp_mask, 0) &&
1337 !--nr_to_scan)
1338 return;
1339 dm_bufio_cond_resched();
1340 }
1341}
1342
1343static int shrink(struct shrinker *shrinker, struct shrink_control *sc)
1344{
1345 struct dm_bufio_client *c =
1346 container_of(shrinker, struct dm_bufio_client, shrinker);
1347 unsigned long r;
1348 unsigned long nr_to_scan = sc->nr_to_scan;
1349
1350 if (sc->gfp_mask & __GFP_IO)
1351 dm_bufio_lock(c);
1352 else if (!dm_bufio_trylock(c))
1353 return !nr_to_scan ? 0 : -1;
1354
1355 if (nr_to_scan)
1356 __scan(c, nr_to_scan, sc);
1357
1358 r = c->n_buffers[LIST_CLEAN] + c->n_buffers[LIST_DIRTY];
1359 if (r > INT_MAX)
1360 r = INT_MAX;
1361
1362 dm_bufio_unlock(c);
1363
1364 return r;
1365}
1366
1367/*
1368 * Create the buffering interface
1369 */
1370struct dm_bufio_client *dm_bufio_client_create(struct block_device *bdev, unsigned block_size,
1371 unsigned reserved_buffers, unsigned aux_size,
1372 void (*alloc_callback)(struct dm_buffer *),
1373 void (*write_callback)(struct dm_buffer *))
1374{
1375 int r;
1376 struct dm_bufio_client *c;
1377 unsigned i;
1378
1379 BUG_ON(block_size < 1 << SECTOR_SHIFT ||
1380 (block_size & (block_size - 1)));
1381
1382 c = kmalloc(sizeof(*c), GFP_KERNEL);
1383 if (!c) {
1384 r = -ENOMEM;
1385 goto bad_client;
1386 }
1387 c->cache_hash = vmalloc(sizeof(struct hlist_head) << DM_BUFIO_HASH_BITS);
1388 if (!c->cache_hash) {
1389 r = -ENOMEM;
1390 goto bad_hash;
1391 }
1392
1393 c->bdev = bdev;
1394 c->block_size = block_size;
1395 c->sectors_per_block_bits = ffs(block_size) - 1 - SECTOR_SHIFT;
1396 c->pages_per_block_bits = (ffs(block_size) - 1 >= PAGE_SHIFT) ?
1397 ffs(block_size) - 1 - PAGE_SHIFT : 0;
1398 c->blocks_per_page_bits = (ffs(block_size) - 1 < PAGE_SHIFT ?
1399 PAGE_SHIFT - (ffs(block_size) - 1) : 0);
1400
1401 c->aux_size = aux_size;
1402 c->alloc_callback = alloc_callback;
1403 c->write_callback = write_callback;
1404
1405 for (i = 0; i < LIST_SIZE; i++) {
1406 INIT_LIST_HEAD(&c->lru[i]);
1407 c->n_buffers[i] = 0;
1408 }
1409
1410 for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
1411 INIT_HLIST_HEAD(&c->cache_hash[i]);
1412
1413 mutex_init(&c->lock);
1414 INIT_LIST_HEAD(&c->reserved_buffers);
1415 c->need_reserved_buffers = reserved_buffers;
1416
1417 init_waitqueue_head(&c->free_buffer_wait);
1418 c->async_write_error = 0;
1419
1420 c->dm_io = dm_io_client_create();
1421 if (IS_ERR(c->dm_io)) {
1422 r = PTR_ERR(c->dm_io);
1423 goto bad_dm_io;
1424 }
1425
1426 mutex_lock(&dm_bufio_clients_lock);
1427 if (c->blocks_per_page_bits) {
1428 if (!DM_BUFIO_CACHE_NAME(c)) {
1429 DM_BUFIO_CACHE_NAME(c) = kasprintf(GFP_KERNEL, "dm_bufio_cache-%u", c->block_size);
1430 if (!DM_BUFIO_CACHE_NAME(c)) {
1431 r = -ENOMEM;
1432 mutex_unlock(&dm_bufio_clients_lock);
1433 goto bad_cache;
1434 }
1435 }
1436
1437 if (!DM_BUFIO_CACHE(c)) {
1438 DM_BUFIO_CACHE(c) = kmem_cache_create(DM_BUFIO_CACHE_NAME(c),
1439 c->block_size,
1440 c->block_size, 0, NULL);
1441 if (!DM_BUFIO_CACHE(c)) {
1442 r = -ENOMEM;
1443 mutex_unlock(&dm_bufio_clients_lock);
1444 goto bad_cache;
1445 }
1446 }
1447 }
1448 mutex_unlock(&dm_bufio_clients_lock);
1449
1450 while (c->need_reserved_buffers) {
1451 struct dm_buffer *b = alloc_buffer(c, GFP_KERNEL);
1452
1453 if (!b) {
1454 r = -ENOMEM;
1455 goto bad_buffer;
1456 }
1457 __free_buffer_wake(b);
1458 }
1459
1460 mutex_lock(&dm_bufio_clients_lock);
1461 dm_bufio_client_count++;
1462 list_add(&c->client_list, &dm_bufio_all_clients);
1463 __cache_size_refresh();
1464 mutex_unlock(&dm_bufio_clients_lock);
1465
1466 c->shrinker.shrink = shrink;
1467 c->shrinker.seeks = 1;
1468 c->shrinker.batch = 0;
1469 register_shrinker(&c->shrinker);
1470
1471 return c;
1472
1473bad_buffer:
1474bad_cache:
1475 while (!list_empty(&c->reserved_buffers)) {
1476 struct dm_buffer *b = list_entry(c->reserved_buffers.next,
1477 struct dm_buffer, lru_list);
1478 list_del(&b->lru_list);
1479 free_buffer(b);
1480 }
1481 dm_io_client_destroy(c->dm_io);
1482bad_dm_io:
1483 vfree(c->cache_hash);
1484bad_hash:
1485 kfree(c);
1486bad_client:
1487 return ERR_PTR(r);
1488}
1489EXPORT_SYMBOL_GPL(dm_bufio_client_create);
1490
1491/*
1492 * Free the buffering interface.
1493 * It is required that there are no references on any buffers.
1494 */
1495void dm_bufio_client_destroy(struct dm_bufio_client *c)
1496{
1497 unsigned i;
1498
1499 drop_buffers(c);
1500
1501 unregister_shrinker(&c->shrinker);
1502
1503 mutex_lock(&dm_bufio_clients_lock);
1504
1505 list_del(&c->client_list);
1506 dm_bufio_client_count--;
1507 __cache_size_refresh();
1508
1509 mutex_unlock(&dm_bufio_clients_lock);
1510
1511 for (i = 0; i < 1 << DM_BUFIO_HASH_BITS; i++)
1512 BUG_ON(!hlist_empty(&c->cache_hash[i]));
1513
1514 BUG_ON(c->need_reserved_buffers);
1515
1516 while (!list_empty(&c->reserved_buffers)) {
1517 struct dm_buffer *b = list_entry(c->reserved_buffers.next,
1518 struct dm_buffer, lru_list);
1519 list_del(&b->lru_list);
1520 free_buffer(b);
1521 }
1522
1523 for (i = 0; i < LIST_SIZE; i++)
1524 if (c->n_buffers[i])
1525 DMERR("leaked buffer count %d: %ld", i, c->n_buffers[i]);
1526
1527 for (i = 0; i < LIST_SIZE; i++)
1528 BUG_ON(c->n_buffers[i]);
1529
1530 dm_io_client_destroy(c->dm_io);
1531 vfree(c->cache_hash);
1532 kfree(c);
1533}
1534EXPORT_SYMBOL_GPL(dm_bufio_client_destroy);
1535
1536static void cleanup_old_buffers(void)
1537{
1538 unsigned long max_age = dm_bufio_max_age;
1539 struct dm_bufio_client *c;
1540
1541 barrier();
1542
1543 if (max_age > ULONG_MAX / HZ)
1544 max_age = ULONG_MAX / HZ;
1545
1546 mutex_lock(&dm_bufio_clients_lock);
1547 list_for_each_entry(c, &dm_bufio_all_clients, client_list) {
1548 if (!dm_bufio_trylock(c))
1549 continue;
1550
1551 while (!list_empty(&c->lru[LIST_CLEAN])) {
1552 struct dm_buffer *b;
1553 b = list_entry(c->lru[LIST_CLEAN].prev,
1554 struct dm_buffer, lru_list);
1555 if (__cleanup_old_buffer(b, 0, max_age * HZ))
1556 break;
1557 dm_bufio_cond_resched();
1558 }
1559
1560 dm_bufio_unlock(c);
1561 dm_bufio_cond_resched();
1562 }
1563 mutex_unlock(&dm_bufio_clients_lock);
1564}
1565
1566static struct workqueue_struct *dm_bufio_wq;
1567static struct delayed_work dm_bufio_work;
1568
1569static void work_fn(struct work_struct *w)
1570{
1571 cleanup_old_buffers();
1572
1573 queue_delayed_work(dm_bufio_wq, &dm_bufio_work,
1574 DM_BUFIO_WORK_TIMER_SECS * HZ);
1575}
1576
1577/*----------------------------------------------------------------
1578 * Module setup
1579 *--------------------------------------------------------------*/
1580
1581/*
1582 * This is called only once for the whole dm_bufio module.
1583 * It initializes memory limit.
1584 */
1585static int __init dm_bufio_init(void)
1586{
1587 __u64 mem;
1588
1589 memset(&dm_bufio_caches, 0, sizeof dm_bufio_caches);
1590 memset(&dm_bufio_cache_names, 0, sizeof dm_bufio_cache_names);
1591
1592 mem = (__u64)((totalram_pages - totalhigh_pages) *
1593 DM_BUFIO_MEMORY_PERCENT / 100) << PAGE_SHIFT;
1594
1595 if (mem > ULONG_MAX)
1596 mem = ULONG_MAX;
1597
1598#ifdef CONFIG_MMU
1599 /*
1600 * Get the size of vmalloc space the same way as VMALLOC_TOTAL
1601 * in fs/proc/internal.h
1602 */
1603 if (mem > (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100)
1604 mem = (VMALLOC_END - VMALLOC_START) * DM_BUFIO_VMALLOC_PERCENT / 100;
1605#endif
1606
1607 dm_bufio_default_cache_size = mem;
1608
1609 mutex_lock(&dm_bufio_clients_lock);
1610 __cache_size_refresh();
1611 mutex_unlock(&dm_bufio_clients_lock);
1612
1613 dm_bufio_wq = create_singlethread_workqueue("dm_bufio_cache");
1614 if (!dm_bufio_wq)
1615 return -ENOMEM;
1616
1617 INIT_DELAYED_WORK(&dm_bufio_work, work_fn);
1618 queue_delayed_work(dm_bufio_wq, &dm_bufio_work,
1619 DM_BUFIO_WORK_TIMER_SECS * HZ);
1620
1621 return 0;
1622}
1623
1624/*
1625 * This is called once when unloading the dm_bufio module.
1626 */
1627static void __exit dm_bufio_exit(void)
1628{
1629 int bug = 0;
1630 int i;
1631
1632 cancel_delayed_work_sync(&dm_bufio_work);
1633 destroy_workqueue(dm_bufio_wq);
1634
1635 for (i = 0; i < ARRAY_SIZE(dm_bufio_caches); i++) {
1636 struct kmem_cache *kc = dm_bufio_caches[i];
1637
1638 if (kc)
1639 kmem_cache_destroy(kc);
1640 }
1641
1642 for (i = 0; i < ARRAY_SIZE(dm_bufio_cache_names); i++)
1643 kfree(dm_bufio_cache_names[i]);
1644
1645 if (dm_bufio_client_count) {
1646 DMCRIT("%s: dm_bufio_client_count leaked: %d",
1647 __func__, dm_bufio_client_count);
1648 bug = 1;
1649 }
1650
1651 if (dm_bufio_current_allocated) {
1652 DMCRIT("%s: dm_bufio_current_allocated leaked: %lu",
1653 __func__, dm_bufio_current_allocated);
1654 bug = 1;
1655 }
1656
1657 if (dm_bufio_allocated_get_free_pages) {
1658 DMCRIT("%s: dm_bufio_allocated_get_free_pages leaked: %lu",
1659 __func__, dm_bufio_allocated_get_free_pages);
1660 bug = 1;
1661 }
1662
1663 if (dm_bufio_allocated_vmalloc) {
1664 DMCRIT("%s: dm_bufio_vmalloc leaked: %lu",
1665 __func__, dm_bufio_allocated_vmalloc);
1666 bug = 1;
1667 }
1668
1669 if (bug)
1670 BUG();
1671}
1672
1673module_init(dm_bufio_init)
1674module_exit(dm_bufio_exit)
1675
1676module_param_named(max_cache_size_bytes, dm_bufio_cache_size, ulong, S_IRUGO | S_IWUSR);
1677MODULE_PARM_DESC(max_cache_size_bytes, "Size of metadata cache");
1678
1679module_param_named(max_age_seconds, dm_bufio_max_age, uint, S_IRUGO | S_IWUSR);
1680MODULE_PARM_DESC(max_age_seconds, "Max age of a buffer in seconds");
1681
1682module_param_named(peak_allocated_bytes, dm_bufio_peak_allocated, ulong, S_IRUGO | S_IWUSR);
1683MODULE_PARM_DESC(peak_allocated_bytes, "Tracks the maximum allocated memory");
1684
1685module_param_named(allocated_kmem_cache_bytes, dm_bufio_allocated_kmem_cache, ulong, S_IRUGO);
1686MODULE_PARM_DESC(allocated_kmem_cache_bytes, "Memory allocated with kmem_cache_alloc");
1687
1688module_param_named(allocated_get_free_pages_bytes, dm_bufio_allocated_get_free_pages, ulong, S_IRUGO);
1689MODULE_PARM_DESC(allocated_get_free_pages_bytes, "Memory allocated with get_free_pages");
1690
1691module_param_named(allocated_vmalloc_bytes, dm_bufio_allocated_vmalloc, ulong, S_IRUGO);
1692MODULE_PARM_DESC(allocated_vmalloc_bytes, "Memory allocated with vmalloc");
1693
1694module_param_named(current_allocated_bytes, dm_bufio_current_allocated, ulong, S_IRUGO);
1695MODULE_PARM_DESC(current_allocated_bytes, "Memory currently used by the cache");
1696
1697MODULE_AUTHOR("Mikulas Patocka <dm-devel@redhat.com>");
1698MODULE_DESCRIPTION(DM_NAME " buffered I/O library");
1699MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm-bufio.h b/drivers/md/dm-bufio.h
new file mode 100644
index 00000000000..5c4c3a04e38
--- /dev/null
+++ b/drivers/md/dm-bufio.h
@@ -0,0 +1,112 @@
1/*
2 * Copyright (C) 2009-2011 Red Hat, Inc.
3 *
4 * Author: Mikulas Patocka <mpatocka@redhat.com>
5 *
6 * This file is released under the GPL.
7 */
8
9#ifndef DM_BUFIO_H
10#define DM_BUFIO_H
11
12#include <linux/blkdev.h>
13#include <linux/types.h>
14
15/*----------------------------------------------------------------*/
16
17struct dm_bufio_client;
18struct dm_buffer;
19
20/*
21 * Create a buffered IO cache on a given device
22 */
23struct dm_bufio_client *
24dm_bufio_client_create(struct block_device *bdev, unsigned block_size,
25 unsigned reserved_buffers, unsigned aux_size,
26 void (*alloc_callback)(struct dm_buffer *),
27 void (*write_callback)(struct dm_buffer *));
28
29/*
30 * Release a buffered IO cache.
31 */
32void dm_bufio_client_destroy(struct dm_bufio_client *c);
33
34/*
35 * WARNING: to avoid deadlocks, these conditions are observed:
36 *
37 * - At most one thread can hold at most "reserved_buffers" simultaneously.
38 * - Each other threads can hold at most one buffer.
39 * - Threads which call only dm_bufio_get can hold unlimited number of
40 * buffers.
41 */
42
43/*
44 * Read a given block from disk. Returns pointer to data. Returns a
45 * pointer to dm_buffer that can be used to release the buffer or to make
46 * it dirty.
47 */
48void *dm_bufio_read(struct dm_bufio_client *c, sector_t block,
49 struct dm_buffer **bp);
50
51/*
52 * Like dm_bufio_read, but return buffer from cache, don't read
53 * it. If the buffer is not in the cache, return NULL.
54 */
55void *dm_bufio_get(struct dm_bufio_client *c, sector_t block,
56 struct dm_buffer **bp);
57
58/*
59 * Like dm_bufio_read, but don't read anything from the disk. It is
60 * expected that the caller initializes the buffer and marks it dirty.
61 */
62void *dm_bufio_new(struct dm_bufio_client *c, sector_t block,
63 struct dm_buffer **bp);
64
65/*
66 * Release a reference obtained with dm_bufio_{read,get,new}. The data
67 * pointer and dm_buffer pointer is no longer valid after this call.
68 */
69void dm_bufio_release(struct dm_buffer *b);
70
71/*
72 * Mark a buffer dirty. It should be called after the buffer is modified.
73 *
74 * In case of memory pressure, the buffer may be written after
75 * dm_bufio_mark_buffer_dirty, but before dm_bufio_write_dirty_buffers. So
76 * dm_bufio_write_dirty_buffers guarantees that the buffer is on-disk but
77 * the actual writing may occur earlier.
78 */
79void dm_bufio_mark_buffer_dirty(struct dm_buffer *b);
80
81/*
82 * Initiate writing of dirty buffers, without waiting for completion.
83 */
84void dm_bufio_write_dirty_buffers_async(struct dm_bufio_client *c);
85
86/*
87 * Write all dirty buffers. Guarantees that all dirty buffers created prior
88 * to this call are on disk when this call exits.
89 */
90int dm_bufio_write_dirty_buffers(struct dm_bufio_client *c);
91
92/*
93 * Send an empty write barrier to the device to flush hardware disk cache.
94 */
95int dm_bufio_issue_flush(struct dm_bufio_client *c);
96
97/*
98 * Like dm_bufio_release but also move the buffer to the new
99 * block. dm_bufio_write_dirty_buffers is needed to commit the new block.
100 */
101void dm_bufio_release_move(struct dm_buffer *b, sector_t new_block);
102
103unsigned dm_bufio_get_block_size(struct dm_bufio_client *c);
104sector_t dm_bufio_get_device_size(struct dm_bufio_client *c);
105sector_t dm_bufio_get_block_number(struct dm_buffer *b);
106void *dm_bufio_get_block_data(struct dm_buffer *b);
107void *dm_bufio_get_aux_data(struct dm_buffer *b);
108struct dm_bufio_client *dm_bufio_get_client(struct dm_buffer *b);
109
110/*----------------------------------------------------------------*/
111
112#endif
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index 2e9a3ca37bd..31c2dc25886 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -1215,6 +1215,7 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
1215 struct hash_cell *hc; 1215 struct hash_cell *hc;
1216 struct dm_table *t; 1216 struct dm_table *t;
1217 struct mapped_device *md; 1217 struct mapped_device *md;
1218 struct target_type *immutable_target_type;
1218 1219
1219 md = find_device(param); 1220 md = find_device(param);
1220 if (!md) 1221 if (!md)
@@ -1230,6 +1231,16 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
1230 goto out; 1231 goto out;
1231 } 1232 }
1232 1233
1234 immutable_target_type = dm_get_immutable_target_type(md);
1235 if (immutable_target_type &&
1236 (immutable_target_type != dm_table_get_immutable_target_type(t))) {
1237 DMWARN("can't replace immutable target type %s",
1238 immutable_target_type->name);
1239 dm_table_destroy(t);
1240 r = -EINVAL;
1241 goto out;
1242 }
1243
1233 /* Protect md->type and md->queue against concurrent table loads. */ 1244 /* Protect md->type and md->queue against concurrent table loads. */
1234 dm_lock_md_type(md); 1245 dm_lock_md_type(md);
1235 if (dm_get_md_type(md) == DM_TYPE_NONE) 1246 if (dm_get_md_type(md) == DM_TYPE_NONE)
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index 32ac70861d6..bed444c93d8 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -66,6 +66,8 @@ struct dm_kcopyd_client {
66 struct list_head pages_jobs; 66 struct list_head pages_jobs;
67}; 67};
68 68
69static struct page_list zero_page_list;
70
69static void wake(struct dm_kcopyd_client *kc) 71static void wake(struct dm_kcopyd_client *kc)
70{ 72{
71 queue_work(kc->kcopyd_wq, &kc->kcopyd_work); 73 queue_work(kc->kcopyd_wq, &kc->kcopyd_work);
@@ -254,6 +256,9 @@ int __init dm_kcopyd_init(void)
254 if (!_job_cache) 256 if (!_job_cache)
255 return -ENOMEM; 257 return -ENOMEM;
256 258
259 zero_page_list.next = &zero_page_list;
260 zero_page_list.page = ZERO_PAGE(0);
261
257 return 0; 262 return 0;
258} 263}
259 264
@@ -322,7 +327,7 @@ static int run_complete_job(struct kcopyd_job *job)
322 dm_kcopyd_notify_fn fn = job->fn; 327 dm_kcopyd_notify_fn fn = job->fn;
323 struct dm_kcopyd_client *kc = job->kc; 328 struct dm_kcopyd_client *kc = job->kc;
324 329
325 if (job->pages) 330 if (job->pages && job->pages != &zero_page_list)
326 kcopyd_put_pages(kc, job->pages); 331 kcopyd_put_pages(kc, job->pages);
327 /* 332 /*
328 * If this is the master job, the sub jobs have already 333 * If this is the master job, the sub jobs have already
@@ -484,6 +489,8 @@ static void dispatch_job(struct kcopyd_job *job)
484 atomic_inc(&kc->nr_jobs); 489 atomic_inc(&kc->nr_jobs);
485 if (unlikely(!job->source.count)) 490 if (unlikely(!job->source.count))
486 push(&kc->complete_jobs, job); 491 push(&kc->complete_jobs, job);
492 else if (job->pages == &zero_page_list)
493 push(&kc->io_jobs, job);
487 else 494 else
488 push(&kc->pages_jobs, job); 495 push(&kc->pages_jobs, job);
489 wake(kc); 496 wake(kc);
@@ -592,14 +599,20 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
592 job->flags = flags; 599 job->flags = flags;
593 job->read_err = 0; 600 job->read_err = 0;
594 job->write_err = 0; 601 job->write_err = 0;
595 job->rw = READ;
596
597 job->source = *from;
598 602
599 job->num_dests = num_dests; 603 job->num_dests = num_dests;
600 memcpy(&job->dests, dests, sizeof(*dests) * num_dests); 604 memcpy(&job->dests, dests, sizeof(*dests) * num_dests);
601 605
602 job->pages = NULL; 606 if (from) {
607 job->source = *from;
608 job->pages = NULL;
609 job->rw = READ;
610 } else {
611 memset(&job->source, 0, sizeof job->source);
612 job->source.count = job->dests[0].count;
613 job->pages = &zero_page_list;
614 job->rw = WRITE;
615 }
603 616
604 job->fn = fn; 617 job->fn = fn;
605 job->context = context; 618 job->context = context;
@@ -617,6 +630,14 @@ int dm_kcopyd_copy(struct dm_kcopyd_client *kc, struct dm_io_region *from,
617} 630}
618EXPORT_SYMBOL(dm_kcopyd_copy); 631EXPORT_SYMBOL(dm_kcopyd_copy);
619 632
633int dm_kcopyd_zero(struct dm_kcopyd_client *kc,
634 unsigned num_dests, struct dm_io_region *dests,
635 unsigned flags, dm_kcopyd_notify_fn fn, void *context)
636{
637 return dm_kcopyd_copy(kc, NULL, num_dests, dests, flags, fn, context);
638}
639EXPORT_SYMBOL(dm_kcopyd_zero);
640
620void *dm_kcopyd_prepare_callback(struct dm_kcopyd_client *kc, 641void *dm_kcopyd_prepare_callback(struct dm_kcopyd_client *kc,
621 dm_kcopyd_notify_fn fn, void *context) 642 dm_kcopyd_notify_fn fn, void *context)
622{ 643{
diff --git a/drivers/md/dm-log-userspace-base.c b/drivers/md/dm-log-userspace-base.c
index 1021c898601..8db3862dade 100644
--- a/drivers/md/dm-log-userspace-base.c
+++ b/drivers/md/dm-log-userspace-base.c
@@ -30,6 +30,7 @@ struct flush_entry {
30 30
31struct log_c { 31struct log_c {
32 struct dm_target *ti; 32 struct dm_target *ti;
33 struct dm_dev *log_dev;
33 uint32_t region_size; 34 uint32_t region_size;
34 region_t region_count; 35 region_t region_count;
35 uint64_t luid; 36 uint64_t luid;
@@ -146,7 +147,7 @@ static int build_constructor_string(struct dm_target *ti,
146 * <UUID> <other args> 147 * <UUID> <other args>
147 * Where 'other args' is the userspace implementation specific log 148 * Where 'other args' is the userspace implementation specific log
148 * arguments. An example might be: 149 * arguments. An example might be:
149 * <UUID> clustered_disk <arg count> <log dev> <region_size> [[no]sync] 150 * <UUID> clustered-disk <arg count> <log dev> <region_size> [[no]sync]
150 * 151 *
151 * So, this module will strip off the <UUID> for identification purposes 152 * So, this module will strip off the <UUID> for identification purposes
152 * when communicating with userspace about a log; but will pass on everything 153 * when communicating with userspace about a log; but will pass on everything
@@ -161,13 +162,15 @@ static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
161 struct log_c *lc = NULL; 162 struct log_c *lc = NULL;
162 uint64_t rdata; 163 uint64_t rdata;
163 size_t rdata_size = sizeof(rdata); 164 size_t rdata_size = sizeof(rdata);
165 char *devices_rdata = NULL;
166 size_t devices_rdata_size = DM_NAME_LEN;
164 167
165 if (argc < 3) { 168 if (argc < 3) {
166 DMWARN("Too few arguments to userspace dirty log"); 169 DMWARN("Too few arguments to userspace dirty log");
167 return -EINVAL; 170 return -EINVAL;
168 } 171 }
169 172
170 lc = kmalloc(sizeof(*lc), GFP_KERNEL); 173 lc = kzalloc(sizeof(*lc), GFP_KERNEL);
171 if (!lc) { 174 if (!lc) {
172 DMWARN("Unable to allocate userspace log context."); 175 DMWARN("Unable to allocate userspace log context.");
173 return -ENOMEM; 176 return -ENOMEM;
@@ -195,9 +198,19 @@ static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
195 return str_size; 198 return str_size;
196 } 199 }
197 200
198 /* Send table string */ 201 devices_rdata = kzalloc(devices_rdata_size, GFP_KERNEL);
202 if (!devices_rdata) {
203 DMERR("Failed to allocate memory for device information");
204 r = -ENOMEM;
205 goto out;
206 }
207
208 /*
209 * Send table string and get back any opened device.
210 */
199 r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_CTR, 211 r = dm_consult_userspace(lc->uuid, lc->luid, DM_ULOG_CTR,
200 ctr_str, str_size, NULL, NULL); 212 ctr_str, str_size,
213 devices_rdata, &devices_rdata_size);
201 214
202 if (r < 0) { 215 if (r < 0) {
203 if (r == -ESRCH) 216 if (r == -ESRCH)
@@ -220,7 +233,20 @@ static int userspace_ctr(struct dm_dirty_log *log, struct dm_target *ti,
220 lc->region_size = (uint32_t)rdata; 233 lc->region_size = (uint32_t)rdata;
221 lc->region_count = dm_sector_div_up(ti->len, lc->region_size); 234 lc->region_count = dm_sector_div_up(ti->len, lc->region_size);
222 235
236 if (devices_rdata_size) {
237 if (devices_rdata[devices_rdata_size - 1] != '\0') {
238 DMERR("DM_ULOG_CTR device return string not properly terminated");
239 r = -EINVAL;
240 goto out;
241 }
242 r = dm_get_device(ti, devices_rdata,
243 dm_table_get_mode(ti->table), &lc->log_dev);
244 if (r)
245 DMERR("Failed to register %s with device-mapper",
246 devices_rdata);
247 }
223out: 248out:
249 kfree(devices_rdata);
224 if (r) { 250 if (r) {
225 kfree(lc); 251 kfree(lc);
226 kfree(ctr_str); 252 kfree(ctr_str);
@@ -241,6 +267,9 @@ static void userspace_dtr(struct dm_dirty_log *log)
241 NULL, 0, 267 NULL, 0,
242 NULL, NULL); 268 NULL, NULL);
243 269
270 if (lc->log_dev)
271 dm_put_device(lc->ti, lc->log_dev);
272
244 kfree(lc->usr_argv_str); 273 kfree(lc->usr_argv_str);
245 kfree(lc); 274 kfree(lc);
246 275
diff --git a/drivers/md/dm-raid.c b/drivers/md/dm-raid.c
index 37a37266a1e..11fa96df4b0 100644
--- a/drivers/md/dm-raid.c
+++ b/drivers/md/dm-raid.c
@@ -1017,30 +1017,56 @@ static int raid_status(struct dm_target *ti, status_type_t type,
1017 struct raid_set *rs = ti->private; 1017 struct raid_set *rs = ti->private;
1018 unsigned raid_param_cnt = 1; /* at least 1 for chunksize */ 1018 unsigned raid_param_cnt = 1; /* at least 1 for chunksize */
1019 unsigned sz = 0; 1019 unsigned sz = 0;
1020 int i; 1020 int i, array_in_sync = 0;
1021 sector_t sync; 1021 sector_t sync;
1022 1022
1023 switch (type) { 1023 switch (type) {
1024 case STATUSTYPE_INFO: 1024 case STATUSTYPE_INFO:
1025 DMEMIT("%s %d ", rs->raid_type->name, rs->md.raid_disks); 1025 DMEMIT("%s %d ", rs->raid_type->name, rs->md.raid_disks);
1026 1026
1027 for (i = 0; i < rs->md.raid_disks; i++) {
1028 if (test_bit(Faulty, &rs->dev[i].rdev.flags))
1029 DMEMIT("D");
1030 else if (test_bit(In_sync, &rs->dev[i].rdev.flags))
1031 DMEMIT("A");
1032 else
1033 DMEMIT("a");
1034 }
1035
1036 if (test_bit(MD_RECOVERY_RUNNING, &rs->md.recovery)) 1027 if (test_bit(MD_RECOVERY_RUNNING, &rs->md.recovery))
1037 sync = rs->md.curr_resync_completed; 1028 sync = rs->md.curr_resync_completed;
1038 else 1029 else
1039 sync = rs->md.recovery_cp; 1030 sync = rs->md.recovery_cp;
1040 1031
1041 if (sync > rs->md.resync_max_sectors) 1032 if (sync >= rs->md.resync_max_sectors) {
1033 array_in_sync = 1;
1042 sync = rs->md.resync_max_sectors; 1034 sync = rs->md.resync_max_sectors;
1035 } else {
1036 /*
1037 * The array may be doing an initial sync, or it may
1038 * be rebuilding individual components. If all the
1039 * devices are In_sync, then it is the array that is
1040 * being initialized.
1041 */
1042 for (i = 0; i < rs->md.raid_disks; i++)
1043 if (!test_bit(In_sync, &rs->dev[i].rdev.flags))
1044 array_in_sync = 1;
1045 }
1046 /*
1047 * Status characters:
1048 * 'D' = Dead/Failed device
1049 * 'a' = Alive but not in-sync
1050 * 'A' = Alive and in-sync
1051 */
1052 for (i = 0; i < rs->md.raid_disks; i++) {
1053 if (test_bit(Faulty, &rs->dev[i].rdev.flags))
1054 DMEMIT("D");
1055 else if (!array_in_sync ||
1056 !test_bit(In_sync, &rs->dev[i].rdev.flags))
1057 DMEMIT("a");
1058 else
1059 DMEMIT("A");
1060 }
1043 1061
1062 /*
1063 * In-sync ratio:
1064 * The in-sync ratio shows the progress of:
1065 * - Initializing the array
1066 * - Rebuilding a subset of devices of the array
1067 * The user can distinguish between the two by referring
1068 * to the status characters.
1069 */
1044 DMEMIT(" %llu/%llu", 1070 DMEMIT(" %llu/%llu",
1045 (unsigned long long) sync, 1071 (unsigned long long) sync,
1046 (unsigned long long) rs->md.resync_max_sectors); 1072 (unsigned long long) rs->md.resync_max_sectors);
diff --git a/drivers/md/dm-table.c b/drivers/md/dm-table.c
index bc04518e9d8..8e913213014 100644
--- a/drivers/md/dm-table.c
+++ b/drivers/md/dm-table.c
@@ -54,7 +54,9 @@ struct dm_table {
54 sector_t *highs; 54 sector_t *highs;
55 struct dm_target *targets; 55 struct dm_target *targets;
56 56
57 struct target_type *immutable_target_type;
57 unsigned integrity_supported:1; 58 unsigned integrity_supported:1;
59 unsigned singleton:1;
58 60
59 /* 61 /*
60 * Indicates the rw permissions for the new logical 62 * Indicates the rw permissions for the new logical
@@ -740,6 +742,12 @@ int dm_table_add_target(struct dm_table *t, const char *type,
740 char **argv; 742 char **argv;
741 struct dm_target *tgt; 743 struct dm_target *tgt;
742 744
745 if (t->singleton) {
746 DMERR("%s: target type %s must appear alone in table",
747 dm_device_name(t->md), t->targets->type->name);
748 return -EINVAL;
749 }
750
743 if ((r = check_space(t))) 751 if ((r = check_space(t)))
744 return r; 752 return r;
745 753
@@ -758,6 +766,36 @@ int dm_table_add_target(struct dm_table *t, const char *type,
758 return -EINVAL; 766 return -EINVAL;
759 } 767 }
760 768
769 if (dm_target_needs_singleton(tgt->type)) {
770 if (t->num_targets) {
771 DMERR("%s: target type %s must appear alone in table",
772 dm_device_name(t->md), type);
773 return -EINVAL;
774 }
775 t->singleton = 1;
776 }
777
778 if (dm_target_always_writeable(tgt->type) && !(t->mode & FMODE_WRITE)) {
779 DMERR("%s: target type %s may not be included in read-only tables",
780 dm_device_name(t->md), type);
781 return -EINVAL;
782 }
783
784 if (t->immutable_target_type) {
785 if (t->immutable_target_type != tgt->type) {
786 DMERR("%s: immutable target type %s cannot be mixed with other target types",
787 dm_device_name(t->md), t->immutable_target_type->name);
788 return -EINVAL;
789 }
790 } else if (dm_target_is_immutable(tgt->type)) {
791 if (t->num_targets) {
792 DMERR("%s: immutable target type %s cannot be mixed with other target types",
793 dm_device_name(t->md), tgt->type->name);
794 return -EINVAL;
795 }
796 t->immutable_target_type = tgt->type;
797 }
798
761 tgt->table = t; 799 tgt->table = t;
762 tgt->begin = start; 800 tgt->begin = start;
763 tgt->len = len; 801 tgt->len = len;
@@ -915,6 +953,11 @@ unsigned dm_table_get_type(struct dm_table *t)
915 return t->type; 953 return t->type;
916} 954}
917 955
956struct target_type *dm_table_get_immutable_target_type(struct dm_table *t)
957{
958 return t->immutable_target_type;
959}
960
918bool dm_table_request_based(struct dm_table *t) 961bool dm_table_request_based(struct dm_table *t)
919{ 962{
920 return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED; 963 return dm_table_get_type(t) == DM_TYPE_REQUEST_BASED;
@@ -1299,6 +1342,31 @@ static bool dm_table_discard_zeroes_data(struct dm_table *t)
1299 return 1; 1342 return 1;
1300} 1343}
1301 1344
1345static int device_is_nonrot(struct dm_target *ti, struct dm_dev *dev,
1346 sector_t start, sector_t len, void *data)
1347{
1348 struct request_queue *q = bdev_get_queue(dev->bdev);
1349
1350 return q && blk_queue_nonrot(q);
1351}
1352
1353static bool dm_table_is_nonrot(struct dm_table *t)
1354{
1355 struct dm_target *ti;
1356 unsigned i = 0;
1357
1358 /* Ensure that all underlying device are non-rotational. */
1359 while (i < dm_table_get_num_targets(t)) {
1360 ti = dm_table_get_target(t, i++);
1361
1362 if (!ti->type->iterate_devices ||
1363 !ti->type->iterate_devices(ti, device_is_nonrot, NULL))
1364 return 0;
1365 }
1366
1367 return 1;
1368}
1369
1302void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q, 1370void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
1303 struct queue_limits *limits) 1371 struct queue_limits *limits)
1304{ 1372{
@@ -1324,6 +1392,11 @@ void dm_table_set_restrictions(struct dm_table *t, struct request_queue *q,
1324 if (!dm_table_discard_zeroes_data(t)) 1392 if (!dm_table_discard_zeroes_data(t))
1325 q->limits.discard_zeroes_data = 0; 1393 q->limits.discard_zeroes_data = 0;
1326 1394
1395 if (dm_table_is_nonrot(t))
1396 queue_flag_set_unlocked(QUEUE_FLAG_NONROT, q);
1397 else
1398 queue_flag_clear_unlocked(QUEUE_FLAG_NONROT, q);
1399
1327 dm_table_set_integrity(t); 1400 dm_table_set_integrity(t);
1328 1401
1329 /* 1402 /*
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
new file mode 100644
index 00000000000..59c4f0446ff
--- /dev/null
+++ b/drivers/md/dm-thin-metadata.c
@@ -0,0 +1,1391 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-thin-metadata.h"
8#include "persistent-data/dm-btree.h"
9#include "persistent-data/dm-space-map.h"
10#include "persistent-data/dm-space-map-disk.h"
11#include "persistent-data/dm-transaction-manager.h"
12
13#include <linux/list.h>
14#include <linux/device-mapper.h>
15#include <linux/workqueue.h>
16
17/*--------------------------------------------------------------------------
18 * As far as the metadata goes, there is:
19 *
20 * - A superblock in block zero, taking up fewer than 512 bytes for
21 * atomic writes.
22 *
23 * - A space map managing the metadata blocks.
24 *
25 * - A space map managing the data blocks.
26 *
27 * - A btree mapping our internal thin dev ids onto struct disk_device_details.
28 *
29 * - A hierarchical btree, with 2 levels which effectively maps (thin
30 * dev id, virtual block) -> block_time. Block time is a 64-bit
31 * field holding the time in the low 24 bits, and block in the top 48
32 * bits.
33 *
34 * BTrees consist solely of btree_nodes, that fill a block. Some are
35 * internal nodes, as such their values are a __le64 pointing to other
36 * nodes. Leaf nodes can store data of any reasonable size (ie. much
37 * smaller than the block size). The nodes consist of the header,
38 * followed by an array of keys, followed by an array of values. We have
39 * to binary search on the keys so they're all held together to help the
40 * cpu cache.
41 *
42 * Space maps have 2 btrees:
43 *
44 * - One maps a uint64_t onto a struct index_entry. Which points to a
45 * bitmap block, and has some details about how many free entries there
46 * are etc.
47 *
48 * - The bitmap blocks have a header (for the checksum). Then the rest
49 * of the block is pairs of bits. With the meaning being:
50 *
51 * 0 - ref count is 0
52 * 1 - ref count is 1
53 * 2 - ref count is 2
54 * 3 - ref count is higher than 2
55 *
56 * - If the count is higher than 2 then the ref count is entered in a
57 * second btree that directly maps the block_address to a uint32_t ref
58 * count.
59 *
60 * The space map metadata variant doesn't have a bitmaps btree. Instead
61 * it has one single blocks worth of index_entries. This avoids
62 * recursive issues with the bitmap btree needing to allocate space in
63 * order to insert. With a small data block size such as 64k the
64 * metadata support data devices that are hundreds of terrabytes.
65 *
66 * The space maps allocate space linearly from front to back. Space that
67 * is freed in a transaction is never recycled within that transaction.
68 * To try and avoid fragmenting _free_ space the allocator always goes
69 * back and fills in gaps.
70 *
71 * All metadata io is in THIN_METADATA_BLOCK_SIZE sized/aligned chunks
72 * from the block manager.
73 *--------------------------------------------------------------------------*/
74
75#define DM_MSG_PREFIX "thin metadata"
76
77#define THIN_SUPERBLOCK_MAGIC 27022010
78#define THIN_SUPERBLOCK_LOCATION 0
79#define THIN_VERSION 1
80#define THIN_METADATA_CACHE_SIZE 64
81#define SECTOR_TO_BLOCK_SHIFT 3
82
83/* This should be plenty */
84#define SPACE_MAP_ROOT_SIZE 128
85
86/*
87 * Little endian on-disk superblock and device details.
88 */
89struct thin_disk_superblock {
90 __le32 csum; /* Checksum of superblock except for this field. */
91 __le32 flags;
92 __le64 blocknr; /* This block number, dm_block_t. */
93
94 __u8 uuid[16];
95 __le64 magic;
96 __le32 version;
97 __le32 time;
98
99 __le64 trans_id;
100
101 /*
102 * Root held by userspace transactions.
103 */
104 __le64 held_root;
105
106 __u8 data_space_map_root[SPACE_MAP_ROOT_SIZE];
107 __u8 metadata_space_map_root[SPACE_MAP_ROOT_SIZE];
108
109 /*
110 * 2-level btree mapping (dev_id, (dev block, time)) -> data block
111 */
112 __le64 data_mapping_root;
113
114 /*
115 * Device detail root mapping dev_id -> device_details
116 */
117 __le64 device_details_root;
118
119 __le32 data_block_size; /* In 512-byte sectors. */
120
121 __le32 metadata_block_size; /* In 512-byte sectors. */
122 __le64 metadata_nr_blocks;
123
124 __le32 compat_flags;
125 __le32 compat_ro_flags;
126 __le32 incompat_flags;
127} __packed;
128
129struct disk_device_details {
130 __le64 mapped_blocks;
131 __le64 transaction_id; /* When created. */
132 __le32 creation_time;
133 __le32 snapshotted_time;
134} __packed;
135
136struct dm_pool_metadata {
137 struct hlist_node hash;
138
139 struct block_device *bdev;
140 struct dm_block_manager *bm;
141 struct dm_space_map *metadata_sm;
142 struct dm_space_map *data_sm;
143 struct dm_transaction_manager *tm;
144 struct dm_transaction_manager *nb_tm;
145
146 /*
147 * Two-level btree.
148 * First level holds thin_dev_t.
149 * Second level holds mappings.
150 */
151 struct dm_btree_info info;
152
153 /*
154 * Non-blocking version of the above.
155 */
156 struct dm_btree_info nb_info;
157
158 /*
159 * Just the top level for deleting whole devices.
160 */
161 struct dm_btree_info tl_info;
162
163 /*
164 * Just the bottom level for creating new devices.
165 */
166 struct dm_btree_info bl_info;
167
168 /*
169 * Describes the device details btree.
170 */
171 struct dm_btree_info details_info;
172
173 struct rw_semaphore root_lock;
174 uint32_t time;
175 int need_commit;
176 dm_block_t root;
177 dm_block_t details_root;
178 struct list_head thin_devices;
179 uint64_t trans_id;
180 unsigned long flags;
181 sector_t data_block_size;
182};
183
184struct dm_thin_device {
185 struct list_head list;
186 struct dm_pool_metadata *pmd;
187 dm_thin_id id;
188
189 int open_count;
190 int changed;
191 uint64_t mapped_blocks;
192 uint64_t transaction_id;
193 uint32_t creation_time;
194 uint32_t snapshotted_time;
195};
196
197/*----------------------------------------------------------------
198 * superblock validator
199 *--------------------------------------------------------------*/
200
201#define SUPERBLOCK_CSUM_XOR 160774
202
203static void sb_prepare_for_write(struct dm_block_validator *v,
204 struct dm_block *b,
205 size_t block_size)
206{
207 struct thin_disk_superblock *disk_super = dm_block_data(b);
208
209 disk_super->blocknr = cpu_to_le64(dm_block_location(b));
210 disk_super->csum = cpu_to_le32(dm_bm_checksum(&disk_super->flags,
211 block_size - sizeof(__le32),
212 SUPERBLOCK_CSUM_XOR));
213}
214
215static int sb_check(struct dm_block_validator *v,
216 struct dm_block *b,
217 size_t block_size)
218{
219 struct thin_disk_superblock *disk_super = dm_block_data(b);
220 __le32 csum_le;
221
222 if (dm_block_location(b) != le64_to_cpu(disk_super->blocknr)) {
223 DMERR("sb_check failed: blocknr %llu: "
224 "wanted %llu", le64_to_cpu(disk_super->blocknr),
225 (unsigned long long)dm_block_location(b));
226 return -ENOTBLK;
227 }
228
229 if (le64_to_cpu(disk_super->magic) != THIN_SUPERBLOCK_MAGIC) {
230 DMERR("sb_check failed: magic %llu: "
231 "wanted %llu", le64_to_cpu(disk_super->magic),
232 (unsigned long long)THIN_SUPERBLOCK_MAGIC);
233 return -EILSEQ;
234 }
235
236 csum_le = cpu_to_le32(dm_bm_checksum(&disk_super->flags,
237 block_size - sizeof(__le32),
238 SUPERBLOCK_CSUM_XOR));
239 if (csum_le != disk_super->csum) {
240 DMERR("sb_check failed: csum %u: wanted %u",
241 le32_to_cpu(csum_le), le32_to_cpu(disk_super->csum));
242 return -EILSEQ;
243 }
244
245 return 0;
246}
247
248static struct dm_block_validator sb_validator = {
249 .name = "superblock",
250 .prepare_for_write = sb_prepare_for_write,
251 .check = sb_check
252};
253
254/*----------------------------------------------------------------
255 * Methods for the btree value types
256 *--------------------------------------------------------------*/
257
258static uint64_t pack_block_time(dm_block_t b, uint32_t t)
259{
260 return (b << 24) | t;
261}
262
263static void unpack_block_time(uint64_t v, dm_block_t *b, uint32_t *t)
264{
265 *b = v >> 24;
266 *t = v & ((1 << 24) - 1);
267}
268
269static void data_block_inc(void *context, void *value_le)
270{
271 struct dm_space_map *sm = context;
272 __le64 v_le;
273 uint64_t b;
274 uint32_t t;
275
276 memcpy(&v_le, value_le, sizeof(v_le));
277 unpack_block_time(le64_to_cpu(v_le), &b, &t);
278 dm_sm_inc_block(sm, b);
279}
280
281static void data_block_dec(void *context, void *value_le)
282{
283 struct dm_space_map *sm = context;
284 __le64 v_le;
285 uint64_t b;
286 uint32_t t;
287
288 memcpy(&v_le, value_le, sizeof(v_le));
289 unpack_block_time(le64_to_cpu(v_le), &b, &t);
290 dm_sm_dec_block(sm, b);
291}
292
293static int data_block_equal(void *context, void *value1_le, void *value2_le)
294{
295 __le64 v1_le, v2_le;
296 uint64_t b1, b2;
297 uint32_t t;
298
299 memcpy(&v1_le, value1_le, sizeof(v1_le));
300 memcpy(&v2_le, value2_le, sizeof(v2_le));
301 unpack_block_time(le64_to_cpu(v1_le), &b1, &t);
302 unpack_block_time(le64_to_cpu(v2_le), &b2, &t);
303
304 return b1 == b2;
305}
306
307static void subtree_inc(void *context, void *value)
308{
309 struct dm_btree_info *info = context;
310 __le64 root_le;
311 uint64_t root;
312
313 memcpy(&root_le, value, sizeof(root_le));
314 root = le64_to_cpu(root_le);
315 dm_tm_inc(info->tm, root);
316}
317
318static void subtree_dec(void *context, void *value)
319{
320 struct dm_btree_info *info = context;
321 __le64 root_le;
322 uint64_t root;
323
324 memcpy(&root_le, value, sizeof(root_le));
325 root = le64_to_cpu(root_le);
326 if (dm_btree_del(info, root))
327 DMERR("btree delete failed\n");
328}
329
330static int subtree_equal(void *context, void *value1_le, void *value2_le)
331{
332 __le64 v1_le, v2_le;
333 memcpy(&v1_le, value1_le, sizeof(v1_le));
334 memcpy(&v2_le, value2_le, sizeof(v2_le));
335
336 return v1_le == v2_le;
337}
338
339/*----------------------------------------------------------------*/
340
341static int superblock_all_zeroes(struct dm_block_manager *bm, int *result)
342{
343 int r;
344 unsigned i;
345 struct dm_block *b;
346 __le64 *data_le, zero = cpu_to_le64(0);
347 unsigned block_size = dm_bm_block_size(bm) / sizeof(__le64);
348
349 /*
350 * We can't use a validator here - it may be all zeroes.
351 */
352 r = dm_bm_read_lock(bm, THIN_SUPERBLOCK_LOCATION, NULL, &b);
353 if (r)
354 return r;
355
356 data_le = dm_block_data(b);
357 *result = 1;
358 for (i = 0; i < block_size; i++) {
359 if (data_le[i] != zero) {
360 *result = 0;
361 break;
362 }
363 }
364
365 return dm_bm_unlock(b);
366}
367
368static int init_pmd(struct dm_pool_metadata *pmd,
369 struct dm_block_manager *bm,
370 dm_block_t nr_blocks, int create)
371{
372 int r;
373 struct dm_space_map *sm, *data_sm;
374 struct dm_transaction_manager *tm;
375 struct dm_block *sblock;
376
377 if (create) {
378 r = dm_tm_create_with_sm(bm, THIN_SUPERBLOCK_LOCATION,
379 &sb_validator, &tm, &sm, &sblock);
380 if (r < 0) {
381 DMERR("tm_create_with_sm failed");
382 return r;
383 }
384
385 data_sm = dm_sm_disk_create(tm, nr_blocks);
386 if (IS_ERR(data_sm)) {
387 DMERR("sm_disk_create failed");
388 r = PTR_ERR(data_sm);
389 goto bad;
390 }
391 } else {
392 struct thin_disk_superblock *disk_super = NULL;
393 size_t space_map_root_offset =
394 offsetof(struct thin_disk_superblock, metadata_space_map_root);
395
396 r = dm_tm_open_with_sm(bm, THIN_SUPERBLOCK_LOCATION,
397 &sb_validator, space_map_root_offset,
398 SPACE_MAP_ROOT_SIZE, &tm, &sm, &sblock);
399 if (r < 0) {
400 DMERR("tm_open_with_sm failed");
401 return r;
402 }
403
404 disk_super = dm_block_data(sblock);
405 data_sm = dm_sm_disk_open(tm, disk_super->data_space_map_root,
406 sizeof(disk_super->data_space_map_root));
407 if (IS_ERR(data_sm)) {
408 DMERR("sm_disk_open failed");
409 r = PTR_ERR(data_sm);
410 goto bad;
411 }
412 }
413
414
415 r = dm_tm_unlock(tm, sblock);
416 if (r < 0) {
417 DMERR("couldn't unlock superblock");
418 goto bad_data_sm;
419 }
420
421 pmd->bm = bm;
422 pmd->metadata_sm = sm;
423 pmd->data_sm = data_sm;
424 pmd->tm = tm;
425 pmd->nb_tm = dm_tm_create_non_blocking_clone(tm);
426 if (!pmd->nb_tm) {
427 DMERR("could not create clone tm");
428 r = -ENOMEM;
429 goto bad_data_sm;
430 }
431
432 pmd->info.tm = tm;
433 pmd->info.levels = 2;
434 pmd->info.value_type.context = pmd->data_sm;
435 pmd->info.value_type.size = sizeof(__le64);
436 pmd->info.value_type.inc = data_block_inc;
437 pmd->info.value_type.dec = data_block_dec;
438 pmd->info.value_type.equal = data_block_equal;
439
440 memcpy(&pmd->nb_info, &pmd->info, sizeof(pmd->nb_info));
441 pmd->nb_info.tm = pmd->nb_tm;
442
443 pmd->tl_info.tm = tm;
444 pmd->tl_info.levels = 1;
445 pmd->tl_info.value_type.context = &pmd->info;
446 pmd->tl_info.value_type.size = sizeof(__le64);
447 pmd->tl_info.value_type.inc = subtree_inc;
448 pmd->tl_info.value_type.dec = subtree_dec;
449 pmd->tl_info.value_type.equal = subtree_equal;
450
451 pmd->bl_info.tm = tm;
452 pmd->bl_info.levels = 1;
453 pmd->bl_info.value_type.context = pmd->data_sm;
454 pmd->bl_info.value_type.size = sizeof(__le64);
455 pmd->bl_info.value_type.inc = data_block_inc;
456 pmd->bl_info.value_type.dec = data_block_dec;
457 pmd->bl_info.value_type.equal = data_block_equal;
458
459 pmd->details_info.tm = tm;
460 pmd->details_info.levels = 1;
461 pmd->details_info.value_type.context = NULL;
462 pmd->details_info.value_type.size = sizeof(struct disk_device_details);
463 pmd->details_info.value_type.inc = NULL;
464 pmd->details_info.value_type.dec = NULL;
465 pmd->details_info.value_type.equal = NULL;
466
467 pmd->root = 0;
468
469 init_rwsem(&pmd->root_lock);
470 pmd->time = 0;
471 pmd->need_commit = 0;
472 pmd->details_root = 0;
473 pmd->trans_id = 0;
474 pmd->flags = 0;
475 INIT_LIST_HEAD(&pmd->thin_devices);
476
477 return 0;
478
479bad_data_sm:
480 dm_sm_destroy(data_sm);
481bad:
482 dm_tm_destroy(tm);
483 dm_sm_destroy(sm);
484
485 return r;
486}
487
488static int __begin_transaction(struct dm_pool_metadata *pmd)
489{
490 int r;
491 u32 features;
492 struct thin_disk_superblock *disk_super;
493 struct dm_block *sblock;
494
495 /*
496 * __maybe_commit_transaction() resets these
497 */
498 WARN_ON(pmd->need_commit);
499
500 /*
501 * We re-read the superblock every time. Shouldn't need to do this
502 * really.
503 */
504 r = dm_bm_read_lock(pmd->bm, THIN_SUPERBLOCK_LOCATION,
505 &sb_validator, &sblock);
506 if (r)
507 return r;
508
509 disk_super = dm_block_data(sblock);
510 pmd->time = le32_to_cpu(disk_super->time);
511 pmd->root = le64_to_cpu(disk_super->data_mapping_root);
512 pmd->details_root = le64_to_cpu(disk_super->device_details_root);
513 pmd->trans_id = le64_to_cpu(disk_super->trans_id);
514 pmd->flags = le32_to_cpu(disk_super->flags);
515 pmd->data_block_size = le32_to_cpu(disk_super->data_block_size);
516
517 features = le32_to_cpu(disk_super->incompat_flags) & ~THIN_FEATURE_INCOMPAT_SUPP;
518 if (features) {
519 DMERR("could not access metadata due to "
520 "unsupported optional features (%lx).",
521 (unsigned long)features);
522 r = -EINVAL;
523 goto out;
524 }
525
526 /*
527 * Check for read-only metadata to skip the following RDWR checks.
528 */
529 if (get_disk_ro(pmd->bdev->bd_disk))
530 goto out;
531
532 features = le32_to_cpu(disk_super->compat_ro_flags) & ~THIN_FEATURE_COMPAT_RO_SUPP;
533 if (features) {
534 DMERR("could not access metadata RDWR due to "
535 "unsupported optional features (%lx).",
536 (unsigned long)features);
537 r = -EINVAL;
538 }
539
540out:
541 dm_bm_unlock(sblock);
542 return r;
543}
544
545static int __write_changed_details(struct dm_pool_metadata *pmd)
546{
547 int r;
548 struct dm_thin_device *td, *tmp;
549 struct disk_device_details details;
550 uint64_t key;
551
552 list_for_each_entry_safe(td, tmp, &pmd->thin_devices, list) {
553 if (!td->changed)
554 continue;
555
556 key = td->id;
557
558 details.mapped_blocks = cpu_to_le64(td->mapped_blocks);
559 details.transaction_id = cpu_to_le64(td->transaction_id);
560 details.creation_time = cpu_to_le32(td->creation_time);
561 details.snapshotted_time = cpu_to_le32(td->snapshotted_time);
562 __dm_bless_for_disk(&details);
563
564 r = dm_btree_insert(&pmd->details_info, pmd->details_root,
565 &key, &details, &pmd->details_root);
566 if (r)
567 return r;
568
569 if (td->open_count)
570 td->changed = 0;
571 else {
572 list_del(&td->list);
573 kfree(td);
574 }
575
576 pmd->need_commit = 1;
577 }
578
579 return 0;
580}
581
582static int __commit_transaction(struct dm_pool_metadata *pmd)
583{
584 /*
585 * FIXME: Associated pool should be made read-only on failure.
586 */
587 int r;
588 size_t metadata_len, data_len;
589 struct thin_disk_superblock *disk_super;
590 struct dm_block *sblock;
591
592 /*
593 * We need to know if the thin_disk_superblock exceeds a 512-byte sector.
594 */
595 BUILD_BUG_ON(sizeof(struct thin_disk_superblock) > 512);
596
597 r = __write_changed_details(pmd);
598 if (r < 0)
599 goto out;
600
601 if (!pmd->need_commit)
602 goto out;
603
604 r = dm_sm_commit(pmd->data_sm);
605 if (r < 0)
606 goto out;
607
608 r = dm_tm_pre_commit(pmd->tm);
609 if (r < 0)
610 goto out;
611
612 r = dm_sm_root_size(pmd->metadata_sm, &metadata_len);
613 if (r < 0)
614 goto out;
615
616 r = dm_sm_root_size(pmd->metadata_sm, &data_len);
617 if (r < 0)
618 goto out;
619
620 r = dm_bm_write_lock(pmd->bm, THIN_SUPERBLOCK_LOCATION,
621 &sb_validator, &sblock);
622 if (r)
623 goto out;
624
625 disk_super = dm_block_data(sblock);
626 disk_super->time = cpu_to_le32(pmd->time);
627 disk_super->data_mapping_root = cpu_to_le64(pmd->root);
628 disk_super->device_details_root = cpu_to_le64(pmd->details_root);
629 disk_super->trans_id = cpu_to_le64(pmd->trans_id);
630 disk_super->flags = cpu_to_le32(pmd->flags);
631
632 r = dm_sm_copy_root(pmd->metadata_sm, &disk_super->metadata_space_map_root,
633 metadata_len);
634 if (r < 0)
635 goto out_locked;
636
637 r = dm_sm_copy_root(pmd->data_sm, &disk_super->data_space_map_root,
638 data_len);
639 if (r < 0)
640 goto out_locked;
641
642 r = dm_tm_commit(pmd->tm, sblock);
643 if (!r)
644 pmd->need_commit = 0;
645
646out:
647 return r;
648
649out_locked:
650 dm_bm_unlock(sblock);
651 return r;
652}
653
654struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
655 sector_t data_block_size)
656{
657 int r;
658 struct thin_disk_superblock *disk_super;
659 struct dm_pool_metadata *pmd;
660 sector_t bdev_size = i_size_read(bdev->bd_inode) >> SECTOR_SHIFT;
661 struct dm_block_manager *bm;
662 int create;
663 struct dm_block *sblock;
664
665 pmd = kmalloc(sizeof(*pmd), GFP_KERNEL);
666 if (!pmd) {
667 DMERR("could not allocate metadata struct");
668 return ERR_PTR(-ENOMEM);
669 }
670
671 /*
672 * Max hex locks:
673 * 3 for btree insert +
674 * 2 for btree lookup used within space map
675 */
676 bm = dm_block_manager_create(bdev, THIN_METADATA_BLOCK_SIZE,
677 THIN_METADATA_CACHE_SIZE, 5);
678 if (!bm) {
679 DMERR("could not create block manager");
680 kfree(pmd);
681 return ERR_PTR(-ENOMEM);
682 }
683
684 r = superblock_all_zeroes(bm, &create);
685 if (r) {
686 dm_block_manager_destroy(bm);
687 kfree(pmd);
688 return ERR_PTR(r);
689 }
690
691
692 r = init_pmd(pmd, bm, 0, create);
693 if (r) {
694 dm_block_manager_destroy(bm);
695 kfree(pmd);
696 return ERR_PTR(r);
697 }
698 pmd->bdev = bdev;
699
700 if (!create) {
701 r = __begin_transaction(pmd);
702 if (r < 0)
703 goto bad;
704 return pmd;
705 }
706
707 /*
708 * Create.
709 */
710 r = dm_bm_write_lock(pmd->bm, THIN_SUPERBLOCK_LOCATION,
711 &sb_validator, &sblock);
712 if (r)
713 goto bad;
714
715 disk_super = dm_block_data(sblock);
716 disk_super->magic = cpu_to_le64(THIN_SUPERBLOCK_MAGIC);
717 disk_super->version = cpu_to_le32(THIN_VERSION);
718 disk_super->time = 0;
719 disk_super->metadata_block_size = cpu_to_le32(THIN_METADATA_BLOCK_SIZE >> SECTOR_SHIFT);
720 disk_super->metadata_nr_blocks = cpu_to_le64(bdev_size >> SECTOR_TO_BLOCK_SHIFT);
721 disk_super->data_block_size = cpu_to_le32(data_block_size);
722
723 r = dm_bm_unlock(sblock);
724 if (r < 0)
725 goto bad;
726
727 r = dm_btree_empty(&pmd->info, &pmd->root);
728 if (r < 0)
729 goto bad;
730
731 r = dm_btree_empty(&pmd->details_info, &pmd->details_root);
732 if (r < 0) {
733 DMERR("couldn't create devices root");
734 goto bad;
735 }
736
737 pmd->flags = 0;
738 pmd->need_commit = 1;
739 r = dm_pool_commit_metadata(pmd);
740 if (r < 0) {
741 DMERR("%s: dm_pool_commit_metadata() failed, error = %d",
742 __func__, r);
743 goto bad;
744 }
745
746 return pmd;
747
748bad:
749 if (dm_pool_metadata_close(pmd) < 0)
750 DMWARN("%s: dm_pool_metadata_close() failed.", __func__);
751 return ERR_PTR(r);
752}
753
754int dm_pool_metadata_close(struct dm_pool_metadata *pmd)
755{
756 int r;
757 unsigned open_devices = 0;
758 struct dm_thin_device *td, *tmp;
759
760 down_read(&pmd->root_lock);
761 list_for_each_entry_safe(td, tmp, &pmd->thin_devices, list) {
762 if (td->open_count)
763 open_devices++;
764 else {
765 list_del(&td->list);
766 kfree(td);
767 }
768 }
769 up_read(&pmd->root_lock);
770
771 if (open_devices) {
772 DMERR("attempt to close pmd when %u device(s) are still open",
773 open_devices);
774 return -EBUSY;
775 }
776
777 r = __commit_transaction(pmd);
778 if (r < 0)
779 DMWARN("%s: __commit_transaction() failed, error = %d",
780 __func__, r);
781
782 dm_tm_destroy(pmd->tm);
783 dm_tm_destroy(pmd->nb_tm);
784 dm_block_manager_destroy(pmd->bm);
785 dm_sm_destroy(pmd->metadata_sm);
786 dm_sm_destroy(pmd->data_sm);
787 kfree(pmd);
788
789 return 0;
790}
791
792static int __open_device(struct dm_pool_metadata *pmd,
793 dm_thin_id dev, int create,
794 struct dm_thin_device **td)
795{
796 int r, changed = 0;
797 struct dm_thin_device *td2;
798 uint64_t key = dev;
799 struct disk_device_details details_le;
800
801 /*
802 * Check the device isn't already open.
803 */
804 list_for_each_entry(td2, &pmd->thin_devices, list)
805 if (td2->id == dev) {
806 td2->open_count++;
807 *td = td2;
808 return 0;
809 }
810
811 /*
812 * Check the device exists.
813 */
814 r = dm_btree_lookup(&pmd->details_info, pmd->details_root,
815 &key, &details_le);
816 if (r) {
817 if (r != -ENODATA || !create)
818 return r;
819
820 changed = 1;
821 details_le.mapped_blocks = 0;
822 details_le.transaction_id = cpu_to_le64(pmd->trans_id);
823 details_le.creation_time = cpu_to_le32(pmd->time);
824 details_le.snapshotted_time = cpu_to_le32(pmd->time);
825 }
826
827 *td = kmalloc(sizeof(**td), GFP_NOIO);
828 if (!*td)
829 return -ENOMEM;
830
831 (*td)->pmd = pmd;
832 (*td)->id = dev;
833 (*td)->open_count = 1;
834 (*td)->changed = changed;
835 (*td)->mapped_blocks = le64_to_cpu(details_le.mapped_blocks);
836 (*td)->transaction_id = le64_to_cpu(details_le.transaction_id);
837 (*td)->creation_time = le32_to_cpu(details_le.creation_time);
838 (*td)->snapshotted_time = le32_to_cpu(details_le.snapshotted_time);
839
840 list_add(&(*td)->list, &pmd->thin_devices);
841
842 return 0;
843}
844
845static void __close_device(struct dm_thin_device *td)
846{
847 --td->open_count;
848}
849
850static int __create_thin(struct dm_pool_metadata *pmd,
851 dm_thin_id dev)
852{
853 int r;
854 dm_block_t dev_root;
855 uint64_t key = dev;
856 struct disk_device_details details_le;
857 struct dm_thin_device *td;
858 __le64 value;
859
860 r = dm_btree_lookup(&pmd->details_info, pmd->details_root,
861 &key, &details_le);
862 if (!r)
863 return -EEXIST;
864
865 /*
866 * Create an empty btree for the mappings.
867 */
868 r = dm_btree_empty(&pmd->bl_info, &dev_root);
869 if (r)
870 return r;
871
872 /*
873 * Insert it into the main mapping tree.
874 */
875 value = cpu_to_le64(dev_root);
876 __dm_bless_for_disk(&value);
877 r = dm_btree_insert(&pmd->tl_info, pmd->root, &key, &value, &pmd->root);
878 if (r) {
879 dm_btree_del(&pmd->bl_info, dev_root);
880 return r;
881 }
882
883 r = __open_device(pmd, dev, 1, &td);
884 if (r) {
885 __close_device(td);
886 dm_btree_remove(&pmd->tl_info, pmd->root, &key, &pmd->root);
887 dm_btree_del(&pmd->bl_info, dev_root);
888 return r;
889 }
890 td->changed = 1;
891 __close_device(td);
892
893 return r;
894}
895
896int dm_pool_create_thin(struct dm_pool_metadata *pmd, dm_thin_id dev)
897{
898 int r;
899
900 down_write(&pmd->root_lock);
901 r = __create_thin(pmd, dev);
902 up_write(&pmd->root_lock);
903
904 return r;
905}
906
907static int __set_snapshot_details(struct dm_pool_metadata *pmd,
908 struct dm_thin_device *snap,
909 dm_thin_id origin, uint32_t time)
910{
911 int r;
912 struct dm_thin_device *td;
913
914 r = __open_device(pmd, origin, 0, &td);
915 if (r)
916 return r;
917
918 td->changed = 1;
919 td->snapshotted_time = time;
920
921 snap->mapped_blocks = td->mapped_blocks;
922 snap->snapshotted_time = time;
923 __close_device(td);
924
925 return 0;
926}
927
928static int __create_snap(struct dm_pool_metadata *pmd,
929 dm_thin_id dev, dm_thin_id origin)
930{
931 int r;
932 dm_block_t origin_root;
933 uint64_t key = origin, dev_key = dev;
934 struct dm_thin_device *td;
935 struct disk_device_details details_le;
936 __le64 value;
937
938 /* check this device is unused */
939 r = dm_btree_lookup(&pmd->details_info, pmd->details_root,
940 &dev_key, &details_le);
941 if (!r)
942 return -EEXIST;
943
944 /* find the mapping tree for the origin */
945 r = dm_btree_lookup(&pmd->tl_info, pmd->root, &key, &value);
946 if (r)
947 return r;
948 origin_root = le64_to_cpu(value);
949
950 /* clone the origin, an inc will do */
951 dm_tm_inc(pmd->tm, origin_root);
952
953 /* insert into the main mapping tree */
954 value = cpu_to_le64(origin_root);
955 __dm_bless_for_disk(&value);
956 key = dev;
957 r = dm_btree_insert(&pmd->tl_info, pmd->root, &key, &value, &pmd->root);
958 if (r) {
959 dm_tm_dec(pmd->tm, origin_root);
960 return r;
961 }
962
963 pmd->time++;
964
965 r = __open_device(pmd, dev, 1, &td);
966 if (r)
967 goto bad;
968
969 r = __set_snapshot_details(pmd, td, origin, pmd->time);
970 if (r)
971 goto bad;
972
973 __close_device(td);
974 return 0;
975
976bad:
977 __close_device(td);
978 dm_btree_remove(&pmd->tl_info, pmd->root, &key, &pmd->root);
979 dm_btree_remove(&pmd->details_info, pmd->details_root,
980 &key, &pmd->details_root);
981 return r;
982}
983
984int dm_pool_create_snap(struct dm_pool_metadata *pmd,
985 dm_thin_id dev,
986 dm_thin_id origin)
987{
988 int r;
989
990 down_write(&pmd->root_lock);
991 r = __create_snap(pmd, dev, origin);
992 up_write(&pmd->root_lock);
993
994 return r;
995}
996
997static int __delete_device(struct dm_pool_metadata *pmd, dm_thin_id dev)
998{
999 int r;
1000 uint64_t key = dev;
1001 struct dm_thin_device *td;
1002
1003 /* TODO: failure should mark the transaction invalid */
1004 r = __open_device(pmd, dev, 0, &td);
1005 if (r)
1006 return r;
1007
1008 if (td->open_count > 1) {
1009 __close_device(td);
1010 return -EBUSY;
1011 }
1012
1013 list_del(&td->list);
1014 kfree(td);
1015 r = dm_btree_remove(&pmd->details_info, pmd->details_root,
1016 &key, &pmd->details_root);
1017 if (r)
1018 return r;
1019
1020 r = dm_btree_remove(&pmd->tl_info, pmd->root, &key, &pmd->root);
1021 if (r)
1022 return r;
1023
1024 pmd->need_commit = 1;
1025
1026 return 0;
1027}
1028
1029int dm_pool_delete_thin_device(struct dm_pool_metadata *pmd,
1030 dm_thin_id dev)
1031{
1032 int r;
1033
1034 down_write(&pmd->root_lock);
1035 r = __delete_device(pmd, dev);
1036 up_write(&pmd->root_lock);
1037
1038 return r;
1039}
1040
1041int dm_pool_set_metadata_transaction_id(struct dm_pool_metadata *pmd,
1042 uint64_t current_id,
1043 uint64_t new_id)
1044{
1045 down_write(&pmd->root_lock);
1046 if (pmd->trans_id != current_id) {
1047 up_write(&pmd->root_lock);
1048 DMERR("mismatched transaction id");
1049 return -EINVAL;
1050 }
1051
1052 pmd->trans_id = new_id;
1053 pmd->need_commit = 1;
1054 up_write(&pmd->root_lock);
1055
1056 return 0;
1057}
1058
1059int dm_pool_get_metadata_transaction_id(struct dm_pool_metadata *pmd,
1060 uint64_t *result)
1061{
1062 down_read(&pmd->root_lock);
1063 *result = pmd->trans_id;
1064 up_read(&pmd->root_lock);
1065
1066 return 0;
1067}
1068
1069static int __get_held_metadata_root(struct dm_pool_metadata *pmd,
1070 dm_block_t *result)
1071{
1072 int r;
1073 struct thin_disk_superblock *disk_super;
1074 struct dm_block *sblock;
1075
1076 r = dm_bm_write_lock(pmd->bm, THIN_SUPERBLOCK_LOCATION,
1077 &sb_validator, &sblock);
1078 if (r)
1079 return r;
1080
1081 disk_super = dm_block_data(sblock);
1082 *result = le64_to_cpu(disk_super->held_root);
1083
1084 return dm_bm_unlock(sblock);
1085}
1086
1087int dm_pool_get_held_metadata_root(struct dm_pool_metadata *pmd,
1088 dm_block_t *result)
1089{
1090 int r;
1091
1092 down_read(&pmd->root_lock);
1093 r = __get_held_metadata_root(pmd, result);
1094 up_read(&pmd->root_lock);
1095
1096 return r;
1097}
1098
1099int dm_pool_open_thin_device(struct dm_pool_metadata *pmd, dm_thin_id dev,
1100 struct dm_thin_device **td)
1101{
1102 int r;
1103
1104 down_write(&pmd->root_lock);
1105 r = __open_device(pmd, dev, 0, td);
1106 up_write(&pmd->root_lock);
1107
1108 return r;
1109}
1110
1111int dm_pool_close_thin_device(struct dm_thin_device *td)
1112{
1113 down_write(&td->pmd->root_lock);
1114 __close_device(td);
1115 up_write(&td->pmd->root_lock);
1116
1117 return 0;
1118}
1119
1120dm_thin_id dm_thin_dev_id(struct dm_thin_device *td)
1121{
1122 return td->id;
1123}
1124
1125static int __snapshotted_since(struct dm_thin_device *td, uint32_t time)
1126{
1127 return td->snapshotted_time > time;
1128}
1129
1130int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
1131 int can_block, struct dm_thin_lookup_result *result)
1132{
1133 int r;
1134 uint64_t block_time = 0;
1135 __le64 value;
1136 struct dm_pool_metadata *pmd = td->pmd;
1137 dm_block_t keys[2] = { td->id, block };
1138
1139 if (can_block) {
1140 down_read(&pmd->root_lock);
1141 r = dm_btree_lookup(&pmd->info, pmd->root, keys, &value);
1142 if (!r)
1143 block_time = le64_to_cpu(value);
1144 up_read(&pmd->root_lock);
1145
1146 } else if (down_read_trylock(&pmd->root_lock)) {
1147 r = dm_btree_lookup(&pmd->nb_info, pmd->root, keys, &value);
1148 if (!r)
1149 block_time = le64_to_cpu(value);
1150 up_read(&pmd->root_lock);
1151
1152 } else
1153 return -EWOULDBLOCK;
1154
1155 if (!r) {
1156 dm_block_t exception_block;
1157 uint32_t exception_time;
1158 unpack_block_time(block_time, &exception_block,
1159 &exception_time);
1160 result->block = exception_block;
1161 result->shared = __snapshotted_since(td, exception_time);
1162 }
1163
1164 return r;
1165}
1166
1167static int __insert(struct dm_thin_device *td, dm_block_t block,
1168 dm_block_t data_block)
1169{
1170 int r, inserted;
1171 __le64 value;
1172 struct dm_pool_metadata *pmd = td->pmd;
1173 dm_block_t keys[2] = { td->id, block };
1174
1175 pmd->need_commit = 1;
1176 value = cpu_to_le64(pack_block_time(data_block, pmd->time));
1177 __dm_bless_for_disk(&value);
1178
1179 r = dm_btree_insert_notify(&pmd->info, pmd->root, keys, &value,
1180 &pmd->root, &inserted);
1181 if (r)
1182 return r;
1183
1184 if (inserted) {
1185 td->mapped_blocks++;
1186 td->changed = 1;
1187 }
1188
1189 return 0;
1190}
1191
1192int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
1193 dm_block_t data_block)
1194{
1195 int r;
1196
1197 down_write(&td->pmd->root_lock);
1198 r = __insert(td, block, data_block);
1199 up_write(&td->pmd->root_lock);
1200
1201 return r;
1202}
1203
1204static int __remove(struct dm_thin_device *td, dm_block_t block)
1205{
1206 int r;
1207 struct dm_pool_metadata *pmd = td->pmd;
1208 dm_block_t keys[2] = { td->id, block };
1209
1210 r = dm_btree_remove(&pmd->info, pmd->root, keys, &pmd->root);
1211 if (r)
1212 return r;
1213
1214 pmd->need_commit = 1;
1215
1216 return 0;
1217}
1218
1219int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block)
1220{
1221 int r;
1222
1223 down_write(&td->pmd->root_lock);
1224 r = __remove(td, block);
1225 up_write(&td->pmd->root_lock);
1226
1227 return r;
1228}
1229
1230int dm_pool_alloc_data_block(struct dm_pool_metadata *pmd, dm_block_t *result)
1231{
1232 int r;
1233
1234 down_write(&pmd->root_lock);
1235
1236 r = dm_sm_new_block(pmd->data_sm, result);
1237 pmd->need_commit = 1;
1238
1239 up_write(&pmd->root_lock);
1240
1241 return r;
1242}
1243
1244int dm_pool_commit_metadata(struct dm_pool_metadata *pmd)
1245{
1246 int r;
1247
1248 down_write(&pmd->root_lock);
1249
1250 r = __commit_transaction(pmd);
1251 if (r <= 0)
1252 goto out;
1253
1254 /*
1255 * Open the next transaction.
1256 */
1257 r = __begin_transaction(pmd);
1258out:
1259 up_write(&pmd->root_lock);
1260 return r;
1261}
1262
1263int dm_pool_get_free_block_count(struct dm_pool_metadata *pmd, dm_block_t *result)
1264{
1265 int r;
1266
1267 down_read(&pmd->root_lock);
1268 r = dm_sm_get_nr_free(pmd->data_sm, result);
1269 up_read(&pmd->root_lock);
1270
1271 return r;
1272}
1273
1274int dm_pool_get_free_metadata_block_count(struct dm_pool_metadata *pmd,
1275 dm_block_t *result)
1276{
1277 int r;
1278
1279 down_read(&pmd->root_lock);
1280 r = dm_sm_get_nr_free(pmd->metadata_sm, result);
1281 up_read(&pmd->root_lock);
1282
1283 return r;
1284}
1285
1286int dm_pool_get_metadata_dev_size(struct dm_pool_metadata *pmd,
1287 dm_block_t *result)
1288{
1289 int r;
1290
1291 down_read(&pmd->root_lock);
1292 r = dm_sm_get_nr_blocks(pmd->metadata_sm, result);
1293 up_read(&pmd->root_lock);
1294
1295 return r;
1296}
1297
1298int dm_pool_get_data_block_size(struct dm_pool_metadata *pmd, sector_t *result)
1299{
1300 down_read(&pmd->root_lock);
1301 *result = pmd->data_block_size;
1302 up_read(&pmd->root_lock);
1303
1304 return 0;
1305}
1306
1307int dm_pool_get_data_dev_size(struct dm_pool_metadata *pmd, dm_block_t *result)
1308{
1309 int r;
1310
1311 down_read(&pmd->root_lock);
1312 r = dm_sm_get_nr_blocks(pmd->data_sm, result);
1313 up_read(&pmd->root_lock);
1314
1315 return r;
1316}
1317
1318int dm_thin_get_mapped_count(struct dm_thin_device *td, dm_block_t *result)
1319{
1320 struct dm_pool_metadata *pmd = td->pmd;
1321
1322 down_read(&pmd->root_lock);
1323 *result = td->mapped_blocks;
1324 up_read(&pmd->root_lock);
1325
1326 return 0;
1327}
1328
1329static int __highest_block(struct dm_thin_device *td, dm_block_t *result)
1330{
1331 int r;
1332 __le64 value_le;
1333 dm_block_t thin_root;
1334 struct dm_pool_metadata *pmd = td->pmd;
1335
1336 r = dm_btree_lookup(&pmd->tl_info, pmd->root, &td->id, &value_le);
1337 if (r)
1338 return r;
1339
1340 thin_root = le64_to_cpu(value_le);
1341
1342 return dm_btree_find_highest_key(&pmd->bl_info, thin_root, result);
1343}
1344
1345int dm_thin_get_highest_mapped_block(struct dm_thin_device *td,
1346 dm_block_t *result)
1347{
1348 int r;
1349 struct dm_pool_metadata *pmd = td->pmd;
1350
1351 down_read(&pmd->root_lock);
1352 r = __highest_block(td, result);
1353 up_read(&pmd->root_lock);
1354
1355 return r;
1356}
1357
1358static int __resize_data_dev(struct dm_pool_metadata *pmd, dm_block_t new_count)
1359{
1360 int r;
1361 dm_block_t old_count;
1362
1363 r = dm_sm_get_nr_blocks(pmd->data_sm, &old_count);
1364 if (r)
1365 return r;
1366
1367 if (new_count == old_count)
1368 return 0;
1369
1370 if (new_count < old_count) {
1371 DMERR("cannot reduce size of data device");
1372 return -EINVAL;
1373 }
1374
1375 r = dm_sm_extend(pmd->data_sm, new_count - old_count);
1376 if (!r)
1377 pmd->need_commit = 1;
1378
1379 return r;
1380}
1381
1382int dm_pool_resize_data_dev(struct dm_pool_metadata *pmd, dm_block_t new_count)
1383{
1384 int r;
1385
1386 down_write(&pmd->root_lock);
1387 r = __resize_data_dev(pmd, new_count);
1388 up_write(&pmd->root_lock);
1389
1390 return r;
1391}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
new file mode 100644
index 00000000000..859c1689687
--- /dev/null
+++ b/drivers/md/dm-thin-metadata.h
@@ -0,0 +1,156 @@
1/*
2 * Copyright (C) 2010-2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef DM_THIN_METADATA_H
8#define DM_THIN_METADATA_H
9
10#include "persistent-data/dm-block-manager.h"
11
12#define THIN_METADATA_BLOCK_SIZE 4096
13
14/*----------------------------------------------------------------*/
15
16struct dm_pool_metadata;
17struct dm_thin_device;
18
19/*
20 * Device identifier
21 */
22typedef uint64_t dm_thin_id;
23
24/*
25 * Reopens or creates a new, empty metadata volume.
26 */
27struct dm_pool_metadata *dm_pool_metadata_open(struct block_device *bdev,
28 sector_t data_block_size);
29
30int dm_pool_metadata_close(struct dm_pool_metadata *pmd);
31
32/*
33 * Compat feature flags. Any incompat flags beyond the ones
34 * specified below will prevent use of the thin metadata.
35 */
36#define THIN_FEATURE_COMPAT_SUPP 0UL
37#define THIN_FEATURE_COMPAT_RO_SUPP 0UL
38#define THIN_FEATURE_INCOMPAT_SUPP 0UL
39
40/*
41 * Device creation/deletion.
42 */
43int dm_pool_create_thin(struct dm_pool_metadata *pmd, dm_thin_id dev);
44
45/*
46 * An internal snapshot.
47 *
48 * You can only snapshot a quiesced origin i.e. one that is either
49 * suspended or not instanced at all.
50 */
51int dm_pool_create_snap(struct dm_pool_metadata *pmd, dm_thin_id dev,
52 dm_thin_id origin);
53
54/*
55 * Deletes a virtual device from the metadata. It _is_ safe to call this
56 * when that device is open. Operations on that device will just start
57 * failing. You still need to call close() on the device.
58 */
59int dm_pool_delete_thin_device(struct dm_pool_metadata *pmd,
60 dm_thin_id dev);
61
62/*
63 * Commits _all_ metadata changes: device creation, deletion, mapping
64 * updates.
65 */
66int dm_pool_commit_metadata(struct dm_pool_metadata *pmd);
67
68/*
69 * Set/get userspace transaction id.
70 */
71int dm_pool_set_metadata_transaction_id(struct dm_pool_metadata *pmd,
72 uint64_t current_id,
73 uint64_t new_id);
74
75int dm_pool_get_metadata_transaction_id(struct dm_pool_metadata *pmd,
76 uint64_t *result);
77
78/*
79 * Hold/get root for userspace transaction.
80 */
81int dm_pool_hold_metadata_root(struct dm_pool_metadata *pmd);
82
83int dm_pool_get_held_metadata_root(struct dm_pool_metadata *pmd,
84 dm_block_t *result);
85
86/*
87 * Actions on a single virtual device.
88 */
89
90/*
91 * Opening the same device more than once will fail with -EBUSY.
92 */
93int dm_pool_open_thin_device(struct dm_pool_metadata *pmd, dm_thin_id dev,
94 struct dm_thin_device **td);
95
96int dm_pool_close_thin_device(struct dm_thin_device *td);
97
98dm_thin_id dm_thin_dev_id(struct dm_thin_device *td);
99
100struct dm_thin_lookup_result {
101 dm_block_t block;
102 int shared;
103};
104
105/*
106 * Returns:
107 * -EWOULDBLOCK iff @can_block is set and would block.
108 * -ENODATA iff that mapping is not present.
109 * 0 success
110 */
111int dm_thin_find_block(struct dm_thin_device *td, dm_block_t block,
112 int can_block, struct dm_thin_lookup_result *result);
113
114/*
115 * Obtain an unused block.
116 */
117int dm_pool_alloc_data_block(struct dm_pool_metadata *pmd, dm_block_t *result);
118
119/*
120 * Insert or remove block.
121 */
122int dm_thin_insert_block(struct dm_thin_device *td, dm_block_t block,
123 dm_block_t data_block);
124
125int dm_thin_remove_block(struct dm_thin_device *td, dm_block_t block);
126
127/*
128 * Queries.
129 */
130int dm_thin_get_highest_mapped_block(struct dm_thin_device *td,
131 dm_block_t *highest_mapped);
132
133int dm_thin_get_mapped_count(struct dm_thin_device *td, dm_block_t *result);
134
135int dm_pool_get_free_block_count(struct dm_pool_metadata *pmd,
136 dm_block_t *result);
137
138int dm_pool_get_free_metadata_block_count(struct dm_pool_metadata *pmd,
139 dm_block_t *result);
140
141int dm_pool_get_metadata_dev_size(struct dm_pool_metadata *pmd,
142 dm_block_t *result);
143
144int dm_pool_get_data_block_size(struct dm_pool_metadata *pmd, sector_t *result);
145
146int dm_pool_get_data_dev_size(struct dm_pool_metadata *pmd, dm_block_t *result);
147
148/*
149 * Returns -ENOSPC if the new size is too small and already allocated
150 * blocks would be lost.
151 */
152int dm_pool_resize_data_dev(struct dm_pool_metadata *pmd, dm_block_t new_size);
153
154/*----------------------------------------------------------------*/
155
156#endif
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
new file mode 100644
index 00000000000..c3087575fef
--- /dev/null
+++ b/drivers/md/dm-thin.c
@@ -0,0 +1,2428 @@
1/*
2 * Copyright (C) 2011 Red Hat UK.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-thin-metadata.h"
8
9#include <linux/device-mapper.h>
10#include <linux/dm-io.h>
11#include <linux/dm-kcopyd.h>
12#include <linux/list.h>
13#include <linux/init.h>
14#include <linux/module.h>
15#include <linux/slab.h>
16
17#define DM_MSG_PREFIX "thin"
18
19/*
20 * Tunable constants
21 */
22#define ENDIO_HOOK_POOL_SIZE 10240
23#define DEFERRED_SET_SIZE 64
24#define MAPPING_POOL_SIZE 1024
25#define PRISON_CELLS 1024
26
27/*
28 * The block size of the device holding pool data must be
29 * between 64KB and 1GB.
30 */
31#define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (64 * 1024 >> SECTOR_SHIFT)
32#define DATA_DEV_BLOCK_SIZE_MAX_SECTORS (1024 * 1024 * 1024 >> SECTOR_SHIFT)
33
34/*
35 * The metadata device is currently limited in size. The limitation is
36 * checked lower down in dm-space-map-metadata, but we also check it here
37 * so we can fail early.
38 *
39 * We have one block of index, which can hold 255 index entries. Each
40 * index entry contains allocation info about 16k metadata blocks.
41 */
42#define METADATA_DEV_MAX_SECTORS (255 * (1 << 14) * (THIN_METADATA_BLOCK_SIZE / (1 << SECTOR_SHIFT)))
43
44/*
45 * Device id is restricted to 24 bits.
46 */
47#define MAX_DEV_ID ((1 << 24) - 1)
48
49/*
50 * How do we handle breaking sharing of data blocks?
51 * =================================================
52 *
53 * We use a standard copy-on-write btree to store the mappings for the
54 * devices (note I'm talking about copy-on-write of the metadata here, not
55 * the data). When you take an internal snapshot you clone the root node
56 * of the origin btree. After this there is no concept of an origin or a
57 * snapshot. They are just two device trees that happen to point to the
58 * same data blocks.
59 *
60 * When we get a write in we decide if it's to a shared data block using
61 * some timestamp magic. If it is, we have to break sharing.
62 *
63 * Let's say we write to a shared block in what was the origin. The
64 * steps are:
65 *
66 * i) plug io further to this physical block. (see bio_prison code).
67 *
68 * ii) quiesce any read io to that shared data block. Obviously
69 * including all devices that share this block. (see deferred_set code)
70 *
71 * iii) copy the data block to a newly allocate block. This step can be
72 * missed out if the io covers the block. (schedule_copy).
73 *
74 * iv) insert the new mapping into the origin's btree
75 * (process_prepared_mappings). This act of inserting breaks some
76 * sharing of btree nodes between the two devices. Breaking sharing only
77 * effects the btree of that specific device. Btrees for the other
78 * devices that share the block never change. The btree for the origin
79 * device as it was after the last commit is untouched, ie. we're using
80 * persistent data structures in the functional programming sense.
81 *
82 * v) unplug io to this physical block, including the io that triggered
83 * the breaking of sharing.
84 *
85 * Steps (ii) and (iii) occur in parallel.
86 *
87 * The metadata _doesn't_ need to be committed before the io continues. We
88 * get away with this because the io is always written to a _new_ block.
89 * If there's a crash, then:
90 *
91 * - The origin mapping will point to the old origin block (the shared
92 * one). This will contain the data as it was before the io that triggered
93 * the breaking of sharing came in.
94 *
95 * - The snap mapping still points to the old block. As it would after
96 * the commit.
97 *
98 * The downside of this scheme is the timestamp magic isn't perfect, and
99 * will continue to think that data block in the snapshot device is shared
100 * even after the write to the origin has broken sharing. I suspect data
101 * blocks will typically be shared by many different devices, so we're
102 * breaking sharing n + 1 times, rather than n, where n is the number of
103 * devices that reference this data block. At the moment I think the
104 * benefits far, far outweigh the disadvantages.
105 */
106
107/*----------------------------------------------------------------*/
108
109/*
110 * Sometimes we can't deal with a bio straight away. We put them in prison
111 * where they can't cause any mischief. Bios are put in a cell identified
112 * by a key, multiple bios can be in the same cell. When the cell is
113 * subsequently unlocked the bios become available.
114 */
115struct bio_prison;
116
117struct cell_key {
118 int virtual;
119 dm_thin_id dev;
120 dm_block_t block;
121};
122
123struct cell {
124 struct hlist_node list;
125 struct bio_prison *prison;
126 struct cell_key key;
127 unsigned count;
128 struct bio_list bios;
129};
130
131struct bio_prison {
132 spinlock_t lock;
133 mempool_t *cell_pool;
134
135 unsigned nr_buckets;
136 unsigned hash_mask;
137 struct hlist_head *cells;
138};
139
140static uint32_t calc_nr_buckets(unsigned nr_cells)
141{
142 uint32_t n = 128;
143
144 nr_cells /= 4;
145 nr_cells = min(nr_cells, 8192u);
146
147 while (n < nr_cells)
148 n <<= 1;
149
150 return n;
151}
152
153/*
154 * @nr_cells should be the number of cells you want in use _concurrently_.
155 * Don't confuse it with the number of distinct keys.
156 */
157static struct bio_prison *prison_create(unsigned nr_cells)
158{
159 unsigned i;
160 uint32_t nr_buckets = calc_nr_buckets(nr_cells);
161 size_t len = sizeof(struct bio_prison) +
162 (sizeof(struct hlist_head) * nr_buckets);
163 struct bio_prison *prison = kmalloc(len, GFP_KERNEL);
164
165 if (!prison)
166 return NULL;
167
168 spin_lock_init(&prison->lock);
169 prison->cell_pool = mempool_create_kmalloc_pool(nr_cells,
170 sizeof(struct cell));
171 if (!prison->cell_pool) {
172 kfree(prison);
173 return NULL;
174 }
175
176 prison->nr_buckets = nr_buckets;
177 prison->hash_mask = nr_buckets - 1;
178 prison->cells = (struct hlist_head *) (prison + 1);
179 for (i = 0; i < nr_buckets; i++)
180 INIT_HLIST_HEAD(prison->cells + i);
181
182 return prison;
183}
184
185static void prison_destroy(struct bio_prison *prison)
186{
187 mempool_destroy(prison->cell_pool);
188 kfree(prison);
189}
190
191static uint32_t hash_key(struct bio_prison *prison, struct cell_key *key)
192{
193 const unsigned long BIG_PRIME = 4294967291UL;
194 uint64_t hash = key->block * BIG_PRIME;
195
196 return (uint32_t) (hash & prison->hash_mask);
197}
198
199static int keys_equal(struct cell_key *lhs, struct cell_key *rhs)
200{
201 return (lhs->virtual == rhs->virtual) &&
202 (lhs->dev == rhs->dev) &&
203 (lhs->block == rhs->block);
204}
205
206static struct cell *__search_bucket(struct hlist_head *bucket,
207 struct cell_key *key)
208{
209 struct cell *cell;
210 struct hlist_node *tmp;
211
212 hlist_for_each_entry(cell, tmp, bucket, list)
213 if (keys_equal(&cell->key, key))
214 return cell;
215
216 return NULL;
217}
218
219/*
220 * This may block if a new cell needs allocating. You must ensure that
221 * cells will be unlocked even if the calling thread is blocked.
222 *
223 * Returns the number of entries in the cell prior to the new addition
224 * or < 0 on failure.
225 */
226static int bio_detain(struct bio_prison *prison, struct cell_key *key,
227 struct bio *inmate, struct cell **ref)
228{
229 int r;
230 unsigned long flags;
231 uint32_t hash = hash_key(prison, key);
232 struct cell *uninitialized_var(cell), *cell2 = NULL;
233
234 BUG_ON(hash > prison->nr_buckets);
235
236 spin_lock_irqsave(&prison->lock, flags);
237 cell = __search_bucket(prison->cells + hash, key);
238
239 if (!cell) {
240 /*
241 * Allocate a new cell
242 */
243 spin_unlock_irqrestore(&prison->lock, flags);
244 cell2 = mempool_alloc(prison->cell_pool, GFP_NOIO);
245 spin_lock_irqsave(&prison->lock, flags);
246
247 /*
248 * We've been unlocked, so we have to double check that
249 * nobody else has inserted this cell in the meantime.
250 */
251 cell = __search_bucket(prison->cells + hash, key);
252
253 if (!cell) {
254 cell = cell2;
255 cell2 = NULL;
256
257 cell->prison = prison;
258 memcpy(&cell->key, key, sizeof(cell->key));
259 cell->count = 0;
260 bio_list_init(&cell->bios);
261 hlist_add_head(&cell->list, prison->cells + hash);
262 }
263 }
264
265 r = cell->count++;
266 bio_list_add(&cell->bios, inmate);
267 spin_unlock_irqrestore(&prison->lock, flags);
268
269 if (cell2)
270 mempool_free(cell2, prison->cell_pool);
271
272 *ref = cell;
273
274 return r;
275}
276
277/*
278 * @inmates must have been initialised prior to this call
279 */
280static void __cell_release(struct cell *cell, struct bio_list *inmates)
281{
282 struct bio_prison *prison = cell->prison;
283
284 hlist_del(&cell->list);
285
286 if (inmates)
287 bio_list_merge(inmates, &cell->bios);
288
289 mempool_free(cell, prison->cell_pool);
290}
291
292static void cell_release(struct cell *cell, struct bio_list *bios)
293{
294 unsigned long flags;
295 struct bio_prison *prison = cell->prison;
296
297 spin_lock_irqsave(&prison->lock, flags);
298 __cell_release(cell, bios);
299 spin_unlock_irqrestore(&prison->lock, flags);
300}
301
302/*
303 * There are a couple of places where we put a bio into a cell briefly
304 * before taking it out again. In these situations we know that no other
305 * bio may be in the cell. This function releases the cell, and also does
306 * a sanity check.
307 */
308static void cell_release_singleton(struct cell *cell, struct bio *bio)
309{
310 struct bio_prison *prison = cell->prison;
311 struct bio_list bios;
312 struct bio *b;
313 unsigned long flags;
314
315 bio_list_init(&bios);
316
317 spin_lock_irqsave(&prison->lock, flags);
318 __cell_release(cell, &bios);
319 spin_unlock_irqrestore(&prison->lock, flags);
320
321 b = bio_list_pop(&bios);
322 BUG_ON(b != bio);
323 BUG_ON(!bio_list_empty(&bios));
324}
325
326static void cell_error(struct cell *cell)
327{
328 struct bio_prison *prison = cell->prison;
329 struct bio_list bios;
330 struct bio *bio;
331 unsigned long flags;
332
333 bio_list_init(&bios);
334
335 spin_lock_irqsave(&prison->lock, flags);
336 __cell_release(cell, &bios);
337 spin_unlock_irqrestore(&prison->lock, flags);
338
339 while ((bio = bio_list_pop(&bios)))
340 bio_io_error(bio);
341}
342
343/*----------------------------------------------------------------*/
344
345/*
346 * We use the deferred set to keep track of pending reads to shared blocks.
347 * We do this to ensure the new mapping caused by a write isn't performed
348 * until these prior reads have completed. Otherwise the insertion of the
349 * new mapping could free the old block that the read bios are mapped to.
350 */
351
352struct deferred_set;
353struct deferred_entry {
354 struct deferred_set *ds;
355 unsigned count;
356 struct list_head work_items;
357};
358
359struct deferred_set {
360 spinlock_t lock;
361 unsigned current_entry;
362 unsigned sweeper;
363 struct deferred_entry entries[DEFERRED_SET_SIZE];
364};
365
366static void ds_init(struct deferred_set *ds)
367{
368 int i;
369
370 spin_lock_init(&ds->lock);
371 ds->current_entry = 0;
372 ds->sweeper = 0;
373 for (i = 0; i < DEFERRED_SET_SIZE; i++) {
374 ds->entries[i].ds = ds;
375 ds->entries[i].count = 0;
376 INIT_LIST_HEAD(&ds->entries[i].work_items);
377 }
378}
379
380static struct deferred_entry *ds_inc(struct deferred_set *ds)
381{
382 unsigned long flags;
383 struct deferred_entry *entry;
384
385 spin_lock_irqsave(&ds->lock, flags);
386 entry = ds->entries + ds->current_entry;
387 entry->count++;
388 spin_unlock_irqrestore(&ds->lock, flags);
389
390 return entry;
391}
392
393static unsigned ds_next(unsigned index)
394{
395 return (index + 1) % DEFERRED_SET_SIZE;
396}
397
398static void __sweep(struct deferred_set *ds, struct list_head *head)
399{
400 while ((ds->sweeper != ds->current_entry) &&
401 !ds->entries[ds->sweeper].count) {
402 list_splice_init(&ds->entries[ds->sweeper].work_items, head);
403 ds->sweeper = ds_next(ds->sweeper);
404 }
405
406 if ((ds->sweeper == ds->current_entry) && !ds->entries[ds->sweeper].count)
407 list_splice_init(&ds->entries[ds->sweeper].work_items, head);
408}
409
410static void ds_dec(struct deferred_entry *entry, struct list_head *head)
411{
412 unsigned long flags;
413
414 spin_lock_irqsave(&entry->ds->lock, flags);
415 BUG_ON(!entry->count);
416 --entry->count;
417 __sweep(entry->ds, head);
418 spin_unlock_irqrestore(&entry->ds->lock, flags);
419}
420
421/*
422 * Returns 1 if deferred or 0 if no pending items to delay job.
423 */
424static int ds_add_work(struct deferred_set *ds, struct list_head *work)
425{
426 int r = 1;
427 unsigned long flags;
428 unsigned next_entry;
429
430 spin_lock_irqsave(&ds->lock, flags);
431 if ((ds->sweeper == ds->current_entry) &&
432 !ds->entries[ds->current_entry].count)
433 r = 0;
434 else {
435 list_add(work, &ds->entries[ds->current_entry].work_items);
436 next_entry = ds_next(ds->current_entry);
437 if (!ds->entries[next_entry].count)
438 ds->current_entry = next_entry;
439 }
440 spin_unlock_irqrestore(&ds->lock, flags);
441
442 return r;
443}
444
445/*----------------------------------------------------------------*/
446
447/*
448 * Key building.
449 */
450static void build_data_key(struct dm_thin_device *td,
451 dm_block_t b, struct cell_key *key)
452{
453 key->virtual = 0;
454 key->dev = dm_thin_dev_id(td);
455 key->block = b;
456}
457
458static void build_virtual_key(struct dm_thin_device *td, dm_block_t b,
459 struct cell_key *key)
460{
461 key->virtual = 1;
462 key->dev = dm_thin_dev_id(td);
463 key->block = b;
464}
465
466/*----------------------------------------------------------------*/
467
468/*
469 * A pool device ties together a metadata device and a data device. It
470 * also provides the interface for creating and destroying internal
471 * devices.
472 */
473struct new_mapping;
474struct pool {
475 struct list_head list;
476 struct dm_target *ti; /* Only set if a pool target is bound */
477
478 struct mapped_device *pool_md;
479 struct block_device *md_dev;
480 struct dm_pool_metadata *pmd;
481
482 uint32_t sectors_per_block;
483 unsigned block_shift;
484 dm_block_t offset_mask;
485 dm_block_t low_water_blocks;
486
487 unsigned zero_new_blocks:1;
488 unsigned low_water_triggered:1; /* A dm event has been sent */
489 unsigned no_free_space:1; /* A -ENOSPC warning has been issued */
490
491 struct bio_prison *prison;
492 struct dm_kcopyd_client *copier;
493
494 struct workqueue_struct *wq;
495 struct work_struct worker;
496
497 unsigned ref_count;
498
499 spinlock_t lock;
500 struct bio_list deferred_bios;
501 struct bio_list deferred_flush_bios;
502 struct list_head prepared_mappings;
503
504 struct bio_list retry_on_resume_list;
505
506 struct deferred_set ds; /* FIXME: move to thin_c */
507
508 struct new_mapping *next_mapping;
509 mempool_t *mapping_pool;
510 mempool_t *endio_hook_pool;
511};
512
513/*
514 * Target context for a pool.
515 */
516struct pool_c {
517 struct dm_target *ti;
518 struct pool *pool;
519 struct dm_dev *data_dev;
520 struct dm_dev *metadata_dev;
521 struct dm_target_callbacks callbacks;
522
523 dm_block_t low_water_blocks;
524 unsigned zero_new_blocks:1;
525};
526
527/*
528 * Target context for a thin.
529 */
530struct thin_c {
531 struct dm_dev *pool_dev;
532 dm_thin_id dev_id;
533
534 struct pool *pool;
535 struct dm_thin_device *td;
536};
537
538/*----------------------------------------------------------------*/
539
540/*
541 * A global list of pools that uses a struct mapped_device as a key.
542 */
543static struct dm_thin_pool_table {
544 struct mutex mutex;
545 struct list_head pools;
546} dm_thin_pool_table;
547
548static void pool_table_init(void)
549{
550 mutex_init(&dm_thin_pool_table.mutex);
551 INIT_LIST_HEAD(&dm_thin_pool_table.pools);
552}
553
554static void __pool_table_insert(struct pool *pool)
555{
556 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
557 list_add(&pool->list, &dm_thin_pool_table.pools);
558}
559
560static void __pool_table_remove(struct pool *pool)
561{
562 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
563 list_del(&pool->list);
564}
565
566static struct pool *__pool_table_lookup(struct mapped_device *md)
567{
568 struct pool *pool = NULL, *tmp;
569
570 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
571
572 list_for_each_entry(tmp, &dm_thin_pool_table.pools, list) {
573 if (tmp->pool_md == md) {
574 pool = tmp;
575 break;
576 }
577 }
578
579 return pool;
580}
581
582static struct pool *__pool_table_lookup_metadata_dev(struct block_device *md_dev)
583{
584 struct pool *pool = NULL, *tmp;
585
586 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
587
588 list_for_each_entry(tmp, &dm_thin_pool_table.pools, list) {
589 if (tmp->md_dev == md_dev) {
590 pool = tmp;
591 break;
592 }
593 }
594
595 return pool;
596}
597
598/*----------------------------------------------------------------*/
599
600static void __requeue_bio_list(struct thin_c *tc, struct bio_list *master)
601{
602 struct bio *bio;
603 struct bio_list bios;
604
605 bio_list_init(&bios);
606 bio_list_merge(&bios, master);
607 bio_list_init(master);
608
609 while ((bio = bio_list_pop(&bios))) {
610 if (dm_get_mapinfo(bio)->ptr == tc)
611 bio_endio(bio, DM_ENDIO_REQUEUE);
612 else
613 bio_list_add(master, bio);
614 }
615}
616
617static void requeue_io(struct thin_c *tc)
618{
619 struct pool *pool = tc->pool;
620 unsigned long flags;
621
622 spin_lock_irqsave(&pool->lock, flags);
623 __requeue_bio_list(tc, &pool->deferred_bios);
624 __requeue_bio_list(tc, &pool->retry_on_resume_list);
625 spin_unlock_irqrestore(&pool->lock, flags);
626}
627
628/*
629 * This section of code contains the logic for processing a thin device's IO.
630 * Much of the code depends on pool object resources (lists, workqueues, etc)
631 * but most is exclusively called from the thin target rather than the thin-pool
632 * target.
633 */
634
635static dm_block_t get_bio_block(struct thin_c *tc, struct bio *bio)
636{
637 return bio->bi_sector >> tc->pool->block_shift;
638}
639
640static void remap(struct thin_c *tc, struct bio *bio, dm_block_t block)
641{
642 struct pool *pool = tc->pool;
643
644 bio->bi_bdev = tc->pool_dev->bdev;
645 bio->bi_sector = (block << pool->block_shift) +
646 (bio->bi_sector & pool->offset_mask);
647}
648
649static void remap_and_issue(struct thin_c *tc, struct bio *bio,
650 dm_block_t block)
651{
652 struct pool *pool = tc->pool;
653 unsigned long flags;
654
655 remap(tc, bio, block);
656
657 /*
658 * Batch together any FUA/FLUSH bios we find and then issue
659 * a single commit for them in process_deferred_bios().
660 */
661 if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
662 spin_lock_irqsave(&pool->lock, flags);
663 bio_list_add(&pool->deferred_flush_bios, bio);
664 spin_unlock_irqrestore(&pool->lock, flags);
665 } else
666 generic_make_request(bio);
667}
668
669/*
670 * wake_worker() is used when new work is queued and when pool_resume is
671 * ready to continue deferred IO processing.
672 */
673static void wake_worker(struct pool *pool)
674{
675 queue_work(pool->wq, &pool->worker);
676}
677
678/*----------------------------------------------------------------*/
679
680/*
681 * Bio endio functions.
682 */
683struct endio_hook {
684 struct thin_c *tc;
685 bio_end_io_t *saved_bi_end_io;
686 struct deferred_entry *entry;
687};
688
689struct new_mapping {
690 struct list_head list;
691
692 int prepared;
693
694 struct thin_c *tc;
695 dm_block_t virt_block;
696 dm_block_t data_block;
697 struct cell *cell;
698 int err;
699
700 /*
701 * If the bio covers the whole area of a block then we can avoid
702 * zeroing or copying. Instead this bio is hooked. The bio will
703 * still be in the cell, so care has to be taken to avoid issuing
704 * the bio twice.
705 */
706 struct bio *bio;
707 bio_end_io_t *saved_bi_end_io;
708};
709
710static void __maybe_add_mapping(struct new_mapping *m)
711{
712 struct pool *pool = m->tc->pool;
713
714 if (list_empty(&m->list) && m->prepared) {
715 list_add(&m->list, &pool->prepared_mappings);
716 wake_worker(pool);
717 }
718}
719
720static void copy_complete(int read_err, unsigned long write_err, void *context)
721{
722 unsigned long flags;
723 struct new_mapping *m = context;
724 struct pool *pool = m->tc->pool;
725
726 m->err = read_err || write_err ? -EIO : 0;
727
728 spin_lock_irqsave(&pool->lock, flags);
729 m->prepared = 1;
730 __maybe_add_mapping(m);
731 spin_unlock_irqrestore(&pool->lock, flags);
732}
733
734static void overwrite_endio(struct bio *bio, int err)
735{
736 unsigned long flags;
737 struct new_mapping *m = dm_get_mapinfo(bio)->ptr;
738 struct pool *pool = m->tc->pool;
739
740 m->err = err;
741
742 spin_lock_irqsave(&pool->lock, flags);
743 m->prepared = 1;
744 __maybe_add_mapping(m);
745 spin_unlock_irqrestore(&pool->lock, flags);
746}
747
748static void shared_read_endio(struct bio *bio, int err)
749{
750 struct list_head mappings;
751 struct new_mapping *m, *tmp;
752 struct endio_hook *h = dm_get_mapinfo(bio)->ptr;
753 unsigned long flags;
754 struct pool *pool = h->tc->pool;
755
756 bio->bi_end_io = h->saved_bi_end_io;
757 bio_endio(bio, err);
758
759 INIT_LIST_HEAD(&mappings);
760 ds_dec(h->entry, &mappings);
761
762 spin_lock_irqsave(&pool->lock, flags);
763 list_for_each_entry_safe(m, tmp, &mappings, list) {
764 list_del(&m->list);
765 INIT_LIST_HEAD(&m->list);
766 __maybe_add_mapping(m);
767 }
768 spin_unlock_irqrestore(&pool->lock, flags);
769
770 mempool_free(h, pool->endio_hook_pool);
771}
772
773/*----------------------------------------------------------------*/
774
775/*
776 * Workqueue.
777 */
778
779/*
780 * Prepared mapping jobs.
781 */
782
783/*
784 * This sends the bios in the cell back to the deferred_bios list.
785 */
786static void cell_defer(struct thin_c *tc, struct cell *cell,
787 dm_block_t data_block)
788{
789 struct pool *pool = tc->pool;
790 unsigned long flags;
791
792 spin_lock_irqsave(&pool->lock, flags);
793 cell_release(cell, &pool->deferred_bios);
794 spin_unlock_irqrestore(&tc->pool->lock, flags);
795
796 wake_worker(pool);
797}
798
799/*
800 * Same as cell_defer above, except it omits one particular detainee,
801 * a write bio that covers the block and has already been processed.
802 */
803static void cell_defer_except(struct thin_c *tc, struct cell *cell,
804 struct bio *exception)
805{
806 struct bio_list bios;
807 struct bio *bio;
808 struct pool *pool = tc->pool;
809 unsigned long flags;
810
811 bio_list_init(&bios);
812 cell_release(cell, &bios);
813
814 spin_lock_irqsave(&pool->lock, flags);
815 while ((bio = bio_list_pop(&bios)))
816 if (bio != exception)
817 bio_list_add(&pool->deferred_bios, bio);
818 spin_unlock_irqrestore(&pool->lock, flags);
819
820 wake_worker(pool);
821}
822
823static void process_prepared_mapping(struct new_mapping *m)
824{
825 struct thin_c *tc = m->tc;
826 struct bio *bio;
827 int r;
828
829 bio = m->bio;
830 if (bio)
831 bio->bi_end_io = m->saved_bi_end_io;
832
833 if (m->err) {
834 cell_error(m->cell);
835 return;
836 }
837
838 /*
839 * Commit the prepared block into the mapping btree.
840 * Any I/O for this block arriving after this point will get
841 * remapped to it directly.
842 */
843 r = dm_thin_insert_block(tc->td, m->virt_block, m->data_block);
844 if (r) {
845 DMERR("dm_thin_insert_block() failed");
846 cell_error(m->cell);
847 return;
848 }
849
850 /*
851 * Release any bios held while the block was being provisioned.
852 * If we are processing a write bio that completely covers the block,
853 * we already processed it so can ignore it now when processing
854 * the bios in the cell.
855 */
856 if (bio) {
857 cell_defer_except(tc, m->cell, bio);
858 bio_endio(bio, 0);
859 } else
860 cell_defer(tc, m->cell, m->data_block);
861
862 list_del(&m->list);
863 mempool_free(m, tc->pool->mapping_pool);
864}
865
866static void process_prepared_mappings(struct pool *pool)
867{
868 unsigned long flags;
869 struct list_head maps;
870 struct new_mapping *m, *tmp;
871
872 INIT_LIST_HEAD(&maps);
873 spin_lock_irqsave(&pool->lock, flags);
874 list_splice_init(&pool->prepared_mappings, &maps);
875 spin_unlock_irqrestore(&pool->lock, flags);
876
877 list_for_each_entry_safe(m, tmp, &maps, list)
878 process_prepared_mapping(m);
879}
880
881/*
882 * Deferred bio jobs.
883 */
884static int io_overwrites_block(struct pool *pool, struct bio *bio)
885{
886 return ((bio_data_dir(bio) == WRITE) &&
887 !(bio->bi_sector & pool->offset_mask)) &&
888 (bio->bi_size == (pool->sectors_per_block << SECTOR_SHIFT));
889}
890
891static void save_and_set_endio(struct bio *bio, bio_end_io_t **save,
892 bio_end_io_t *fn)
893{
894 *save = bio->bi_end_io;
895 bio->bi_end_io = fn;
896}
897
898static int ensure_next_mapping(struct pool *pool)
899{
900 if (pool->next_mapping)
901 return 0;
902
903 pool->next_mapping = mempool_alloc(pool->mapping_pool, GFP_ATOMIC);
904
905 return pool->next_mapping ? 0 : -ENOMEM;
906}
907
908static struct new_mapping *get_next_mapping(struct pool *pool)
909{
910 struct new_mapping *r = pool->next_mapping;
911
912 BUG_ON(!pool->next_mapping);
913
914 pool->next_mapping = NULL;
915
916 return r;
917}
918
919static void schedule_copy(struct thin_c *tc, dm_block_t virt_block,
920 dm_block_t data_origin, dm_block_t data_dest,
921 struct cell *cell, struct bio *bio)
922{
923 int r;
924 struct pool *pool = tc->pool;
925 struct new_mapping *m = get_next_mapping(pool);
926
927 INIT_LIST_HEAD(&m->list);
928 m->prepared = 0;
929 m->tc = tc;
930 m->virt_block = virt_block;
931 m->data_block = data_dest;
932 m->cell = cell;
933 m->err = 0;
934 m->bio = NULL;
935
936 ds_add_work(&pool->ds, &m->list);
937
938 /*
939 * IO to pool_dev remaps to the pool target's data_dev.
940 *
941 * If the whole block of data is being overwritten, we can issue the
942 * bio immediately. Otherwise we use kcopyd to clone the data first.
943 */
944 if (io_overwrites_block(pool, bio)) {
945 m->bio = bio;
946 save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
947 dm_get_mapinfo(bio)->ptr = m;
948 remap_and_issue(tc, bio, data_dest);
949 } else {
950 struct dm_io_region from, to;
951
952 from.bdev = tc->pool_dev->bdev;
953 from.sector = data_origin * pool->sectors_per_block;
954 from.count = pool->sectors_per_block;
955
956 to.bdev = tc->pool_dev->bdev;
957 to.sector = data_dest * pool->sectors_per_block;
958 to.count = pool->sectors_per_block;
959
960 r = dm_kcopyd_copy(pool->copier, &from, 1, &to,
961 0, copy_complete, m);
962 if (r < 0) {
963 mempool_free(m, pool->mapping_pool);
964 DMERR("dm_kcopyd_copy() failed");
965 cell_error(cell);
966 }
967 }
968}
969
970static void schedule_zero(struct thin_c *tc, dm_block_t virt_block,
971 dm_block_t data_block, struct cell *cell,
972 struct bio *bio)
973{
974 struct pool *pool = tc->pool;
975 struct new_mapping *m = get_next_mapping(pool);
976
977 INIT_LIST_HEAD(&m->list);
978 m->prepared = 0;
979 m->tc = tc;
980 m->virt_block = virt_block;
981 m->data_block = data_block;
982 m->cell = cell;
983 m->err = 0;
984 m->bio = NULL;
985
986 /*
987 * If the whole block of data is being overwritten or we are not
988 * zeroing pre-existing data, we can issue the bio immediately.
989 * Otherwise we use kcopyd to zero the data first.
990 */
991 if (!pool->zero_new_blocks)
992 process_prepared_mapping(m);
993
994 else if (io_overwrites_block(pool, bio)) {
995 m->bio = bio;
996 save_and_set_endio(bio, &m->saved_bi_end_io, overwrite_endio);
997 dm_get_mapinfo(bio)->ptr = m;
998 remap_and_issue(tc, bio, data_block);
999
1000 } else {
1001 int r;
1002 struct dm_io_region to;
1003
1004 to.bdev = tc->pool_dev->bdev;
1005 to.sector = data_block * pool->sectors_per_block;
1006 to.count = pool->sectors_per_block;
1007
1008 r = dm_kcopyd_zero(pool->copier, 1, &to, 0, copy_complete, m);
1009 if (r < 0) {
1010 mempool_free(m, pool->mapping_pool);
1011 DMERR("dm_kcopyd_zero() failed");
1012 cell_error(cell);
1013 }
1014 }
1015}
1016
1017static int alloc_data_block(struct thin_c *tc, dm_block_t *result)
1018{
1019 int r;
1020 dm_block_t free_blocks;
1021 unsigned long flags;
1022 struct pool *pool = tc->pool;
1023
1024 r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
1025 if (r)
1026 return r;
1027
1028 if (free_blocks <= pool->low_water_blocks && !pool->low_water_triggered) {
1029 DMWARN("%s: reached low water mark, sending event.",
1030 dm_device_name(pool->pool_md));
1031 spin_lock_irqsave(&pool->lock, flags);
1032 pool->low_water_triggered = 1;
1033 spin_unlock_irqrestore(&pool->lock, flags);
1034 dm_table_event(pool->ti->table);
1035 }
1036
1037 if (!free_blocks) {
1038 if (pool->no_free_space)
1039 return -ENOSPC;
1040 else {
1041 /*
1042 * Try to commit to see if that will free up some
1043 * more space.
1044 */
1045 r = dm_pool_commit_metadata(pool->pmd);
1046 if (r) {
1047 DMERR("%s: dm_pool_commit_metadata() failed, error = %d",
1048 __func__, r);
1049 return r;
1050 }
1051
1052 r = dm_pool_get_free_block_count(pool->pmd, &free_blocks);
1053 if (r)
1054 return r;
1055
1056 /*
1057 * If we still have no space we set a flag to avoid
1058 * doing all this checking and return -ENOSPC.
1059 */
1060 if (!free_blocks) {
1061 DMWARN("%s: no free space available.",
1062 dm_device_name(pool->pool_md));
1063 spin_lock_irqsave(&pool->lock, flags);
1064 pool->no_free_space = 1;
1065 spin_unlock_irqrestore(&pool->lock, flags);
1066 return -ENOSPC;
1067 }
1068 }
1069 }
1070
1071 r = dm_pool_alloc_data_block(pool->pmd, result);
1072 if (r)
1073 return r;
1074
1075 return 0;
1076}
1077
1078/*
1079 * If we have run out of space, queue bios until the device is
1080 * resumed, presumably after having been reloaded with more space.
1081 */
1082static void retry_on_resume(struct bio *bio)
1083{
1084 struct thin_c *tc = dm_get_mapinfo(bio)->ptr;
1085 struct pool *pool = tc->pool;
1086 unsigned long flags;
1087
1088 spin_lock_irqsave(&pool->lock, flags);
1089 bio_list_add(&pool->retry_on_resume_list, bio);
1090 spin_unlock_irqrestore(&pool->lock, flags);
1091}
1092
1093static void no_space(struct cell *cell)
1094{
1095 struct bio *bio;
1096 struct bio_list bios;
1097
1098 bio_list_init(&bios);
1099 cell_release(cell, &bios);
1100
1101 while ((bio = bio_list_pop(&bios)))
1102 retry_on_resume(bio);
1103}
1104
1105static void break_sharing(struct thin_c *tc, struct bio *bio, dm_block_t block,
1106 struct cell_key *key,
1107 struct dm_thin_lookup_result *lookup_result,
1108 struct cell *cell)
1109{
1110 int r;
1111 dm_block_t data_block;
1112
1113 r = alloc_data_block(tc, &data_block);
1114 switch (r) {
1115 case 0:
1116 schedule_copy(tc, block, lookup_result->block,
1117 data_block, cell, bio);
1118 break;
1119
1120 case -ENOSPC:
1121 no_space(cell);
1122 break;
1123
1124 default:
1125 DMERR("%s: alloc_data_block() failed, error = %d", __func__, r);
1126 cell_error(cell);
1127 break;
1128 }
1129}
1130
1131static void process_shared_bio(struct thin_c *tc, struct bio *bio,
1132 dm_block_t block,
1133 struct dm_thin_lookup_result *lookup_result)
1134{
1135 struct cell *cell;
1136 struct pool *pool = tc->pool;
1137 struct cell_key key;
1138
1139 /*
1140 * If cell is already occupied, then sharing is already in the process
1141 * of being broken so we have nothing further to do here.
1142 */
1143 build_data_key(tc->td, lookup_result->block, &key);
1144 if (bio_detain(pool->prison, &key, bio, &cell))
1145 return;
1146
1147 if (bio_data_dir(bio) == WRITE)
1148 break_sharing(tc, bio, block, &key, lookup_result, cell);
1149 else {
1150 struct endio_hook *h;
1151 h = mempool_alloc(pool->endio_hook_pool, GFP_NOIO);
1152
1153 h->tc = tc;
1154 h->entry = ds_inc(&pool->ds);
1155 save_and_set_endio(bio, &h->saved_bi_end_io, shared_read_endio);
1156 dm_get_mapinfo(bio)->ptr = h;
1157
1158 cell_release_singleton(cell, bio);
1159 remap_and_issue(tc, bio, lookup_result->block);
1160 }
1161}
1162
1163static void provision_block(struct thin_c *tc, struct bio *bio, dm_block_t block,
1164 struct cell *cell)
1165{
1166 int r;
1167 dm_block_t data_block;
1168
1169 /*
1170 * Remap empty bios (flushes) immediately, without provisioning.
1171 */
1172 if (!bio->bi_size) {
1173 cell_release_singleton(cell, bio);
1174 remap_and_issue(tc, bio, 0);
1175 return;
1176 }
1177
1178 /*
1179 * Fill read bios with zeroes and complete them immediately.
1180 */
1181 if (bio_data_dir(bio) == READ) {
1182 zero_fill_bio(bio);
1183 cell_release_singleton(cell, bio);
1184 bio_endio(bio, 0);
1185 return;
1186 }
1187
1188 r = alloc_data_block(tc, &data_block);
1189 switch (r) {
1190 case 0:
1191 schedule_zero(tc, block, data_block, cell, bio);
1192 break;
1193
1194 case -ENOSPC:
1195 no_space(cell);
1196 break;
1197
1198 default:
1199 DMERR("%s: alloc_data_block() failed, error = %d", __func__, r);
1200 cell_error(cell);
1201 break;
1202 }
1203}
1204
1205static void process_bio(struct thin_c *tc, struct bio *bio)
1206{
1207 int r;
1208 dm_block_t block = get_bio_block(tc, bio);
1209 struct cell *cell;
1210 struct cell_key key;
1211 struct dm_thin_lookup_result lookup_result;
1212
1213 /*
1214 * If cell is already occupied, then the block is already
1215 * being provisioned so we have nothing further to do here.
1216 */
1217 build_virtual_key(tc->td, block, &key);
1218 if (bio_detain(tc->pool->prison, &key, bio, &cell))
1219 return;
1220
1221 r = dm_thin_find_block(tc->td, block, 1, &lookup_result);
1222 switch (r) {
1223 case 0:
1224 /*
1225 * We can release this cell now. This thread is the only
1226 * one that puts bios into a cell, and we know there were
1227 * no preceding bios.
1228 */
1229 /*
1230 * TODO: this will probably have to change when discard goes
1231 * back in.
1232 */
1233 cell_release_singleton(cell, bio);
1234
1235 if (lookup_result.shared)
1236 process_shared_bio(tc, bio, block, &lookup_result);
1237 else
1238 remap_and_issue(tc, bio, lookup_result.block);
1239 break;
1240
1241 case -ENODATA:
1242 provision_block(tc, bio, block, cell);
1243 break;
1244
1245 default:
1246 DMERR("dm_thin_find_block() failed, error = %d", r);
1247 bio_io_error(bio);
1248 break;
1249 }
1250}
1251
1252static void process_deferred_bios(struct pool *pool)
1253{
1254 unsigned long flags;
1255 struct bio *bio;
1256 struct bio_list bios;
1257 int r;
1258
1259 bio_list_init(&bios);
1260
1261 spin_lock_irqsave(&pool->lock, flags);
1262 bio_list_merge(&bios, &pool->deferred_bios);
1263 bio_list_init(&pool->deferred_bios);
1264 spin_unlock_irqrestore(&pool->lock, flags);
1265
1266 while ((bio = bio_list_pop(&bios))) {
1267 struct thin_c *tc = dm_get_mapinfo(bio)->ptr;
1268 /*
1269 * If we've got no free new_mapping structs, and processing
1270 * this bio might require one, we pause until there are some
1271 * prepared mappings to process.
1272 */
1273 if (ensure_next_mapping(pool)) {
1274 spin_lock_irqsave(&pool->lock, flags);
1275 bio_list_merge(&pool->deferred_bios, &bios);
1276 spin_unlock_irqrestore(&pool->lock, flags);
1277
1278 break;
1279 }
1280 process_bio(tc, bio);
1281 }
1282
1283 /*
1284 * If there are any deferred flush bios, we must commit
1285 * the metadata before issuing them.
1286 */
1287 bio_list_init(&bios);
1288 spin_lock_irqsave(&pool->lock, flags);
1289 bio_list_merge(&bios, &pool->deferred_flush_bios);
1290 bio_list_init(&pool->deferred_flush_bios);
1291 spin_unlock_irqrestore(&pool->lock, flags);
1292
1293 if (bio_list_empty(&bios))
1294 return;
1295
1296 r = dm_pool_commit_metadata(pool->pmd);
1297 if (r) {
1298 DMERR("%s: dm_pool_commit_metadata() failed, error = %d",
1299 __func__, r);
1300 while ((bio = bio_list_pop(&bios)))
1301 bio_io_error(bio);
1302 return;
1303 }
1304
1305 while ((bio = bio_list_pop(&bios)))
1306 generic_make_request(bio);
1307}
1308
1309static void do_worker(struct work_struct *ws)
1310{
1311 struct pool *pool = container_of(ws, struct pool, worker);
1312
1313 process_prepared_mappings(pool);
1314 process_deferred_bios(pool);
1315}
1316
1317/*----------------------------------------------------------------*/
1318
1319/*
1320 * Mapping functions.
1321 */
1322
1323/*
1324 * Called only while mapping a thin bio to hand it over to the workqueue.
1325 */
1326static void thin_defer_bio(struct thin_c *tc, struct bio *bio)
1327{
1328 unsigned long flags;
1329 struct pool *pool = tc->pool;
1330
1331 spin_lock_irqsave(&pool->lock, flags);
1332 bio_list_add(&pool->deferred_bios, bio);
1333 spin_unlock_irqrestore(&pool->lock, flags);
1334
1335 wake_worker(pool);
1336}
1337
1338/*
1339 * Non-blocking function called from the thin target's map function.
1340 */
1341static int thin_bio_map(struct dm_target *ti, struct bio *bio,
1342 union map_info *map_context)
1343{
1344 int r;
1345 struct thin_c *tc = ti->private;
1346 dm_block_t block = get_bio_block(tc, bio);
1347 struct dm_thin_device *td = tc->td;
1348 struct dm_thin_lookup_result result;
1349
1350 /*
1351 * Save the thin context for easy access from the deferred bio later.
1352 */
1353 map_context->ptr = tc;
1354
1355 if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
1356 thin_defer_bio(tc, bio);
1357 return DM_MAPIO_SUBMITTED;
1358 }
1359
1360 r = dm_thin_find_block(td, block, 0, &result);
1361
1362 /*
1363 * Note that we defer readahead too.
1364 */
1365 switch (r) {
1366 case 0:
1367 if (unlikely(result.shared)) {
1368 /*
1369 * We have a race condition here between the
1370 * result.shared value returned by the lookup and
1371 * snapshot creation, which may cause new
1372 * sharing.
1373 *
1374 * To avoid this always quiesce the origin before
1375 * taking the snap. You want to do this anyway to
1376 * ensure a consistent application view
1377 * (i.e. lockfs).
1378 *
1379 * More distant ancestors are irrelevant. The
1380 * shared flag will be set in their case.
1381 */
1382 thin_defer_bio(tc, bio);
1383 r = DM_MAPIO_SUBMITTED;
1384 } else {
1385 remap(tc, bio, result.block);
1386 r = DM_MAPIO_REMAPPED;
1387 }
1388 break;
1389
1390 case -ENODATA:
1391 /*
1392 * In future, the failed dm_thin_find_block above could
1393 * provide the hint to load the metadata into cache.
1394 */
1395 case -EWOULDBLOCK:
1396 thin_defer_bio(tc, bio);
1397 r = DM_MAPIO_SUBMITTED;
1398 break;
1399 }
1400
1401 return r;
1402}
1403
1404static int pool_is_congested(struct dm_target_callbacks *cb, int bdi_bits)
1405{
1406 int r;
1407 unsigned long flags;
1408 struct pool_c *pt = container_of(cb, struct pool_c, callbacks);
1409
1410 spin_lock_irqsave(&pt->pool->lock, flags);
1411 r = !bio_list_empty(&pt->pool->retry_on_resume_list);
1412 spin_unlock_irqrestore(&pt->pool->lock, flags);
1413
1414 if (!r) {
1415 struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
1416 r = bdi_congested(&q->backing_dev_info, bdi_bits);
1417 }
1418
1419 return r;
1420}
1421
1422static void __requeue_bios(struct pool *pool)
1423{
1424 bio_list_merge(&pool->deferred_bios, &pool->retry_on_resume_list);
1425 bio_list_init(&pool->retry_on_resume_list);
1426}
1427
1428/*----------------------------------------------------------------
1429 * Binding of control targets to a pool object
1430 *--------------------------------------------------------------*/
1431static int bind_control_target(struct pool *pool, struct dm_target *ti)
1432{
1433 struct pool_c *pt = ti->private;
1434
1435 pool->ti = ti;
1436 pool->low_water_blocks = pt->low_water_blocks;
1437 pool->zero_new_blocks = pt->zero_new_blocks;
1438
1439 return 0;
1440}
1441
1442static void unbind_control_target(struct pool *pool, struct dm_target *ti)
1443{
1444 if (pool->ti == ti)
1445 pool->ti = NULL;
1446}
1447
1448/*----------------------------------------------------------------
1449 * Pool creation
1450 *--------------------------------------------------------------*/
1451static void __pool_destroy(struct pool *pool)
1452{
1453 __pool_table_remove(pool);
1454
1455 if (dm_pool_metadata_close(pool->pmd) < 0)
1456 DMWARN("%s: dm_pool_metadata_close() failed.", __func__);
1457
1458 prison_destroy(pool->prison);
1459 dm_kcopyd_client_destroy(pool->copier);
1460
1461 if (pool->wq)
1462 destroy_workqueue(pool->wq);
1463
1464 if (pool->next_mapping)
1465 mempool_free(pool->next_mapping, pool->mapping_pool);
1466 mempool_destroy(pool->mapping_pool);
1467 mempool_destroy(pool->endio_hook_pool);
1468 kfree(pool);
1469}
1470
1471static struct pool *pool_create(struct mapped_device *pool_md,
1472 struct block_device *metadata_dev,
1473 unsigned long block_size, char **error)
1474{
1475 int r;
1476 void *err_p;
1477 struct pool *pool;
1478 struct dm_pool_metadata *pmd;
1479
1480 pmd = dm_pool_metadata_open(metadata_dev, block_size);
1481 if (IS_ERR(pmd)) {
1482 *error = "Error creating metadata object";
1483 return (struct pool *)pmd;
1484 }
1485
1486 pool = kmalloc(sizeof(*pool), GFP_KERNEL);
1487 if (!pool) {
1488 *error = "Error allocating memory for pool";
1489 err_p = ERR_PTR(-ENOMEM);
1490 goto bad_pool;
1491 }
1492
1493 pool->pmd = pmd;
1494 pool->sectors_per_block = block_size;
1495 pool->block_shift = ffs(block_size) - 1;
1496 pool->offset_mask = block_size - 1;
1497 pool->low_water_blocks = 0;
1498 pool->zero_new_blocks = 1;
1499 pool->prison = prison_create(PRISON_CELLS);
1500 if (!pool->prison) {
1501 *error = "Error creating pool's bio prison";
1502 err_p = ERR_PTR(-ENOMEM);
1503 goto bad_prison;
1504 }
1505
1506 pool->copier = dm_kcopyd_client_create();
1507 if (IS_ERR(pool->copier)) {
1508 r = PTR_ERR(pool->copier);
1509 *error = "Error creating pool's kcopyd client";
1510 err_p = ERR_PTR(r);
1511 goto bad_kcopyd_client;
1512 }
1513
1514 /*
1515 * Create singlethreaded workqueue that will service all devices
1516 * that use this metadata.
1517 */
1518 pool->wq = alloc_ordered_workqueue("dm-" DM_MSG_PREFIX, WQ_MEM_RECLAIM);
1519 if (!pool->wq) {
1520 *error = "Error creating pool's workqueue";
1521 err_p = ERR_PTR(-ENOMEM);
1522 goto bad_wq;
1523 }
1524
1525 INIT_WORK(&pool->worker, do_worker);
1526 spin_lock_init(&pool->lock);
1527 bio_list_init(&pool->deferred_bios);
1528 bio_list_init(&pool->deferred_flush_bios);
1529 INIT_LIST_HEAD(&pool->prepared_mappings);
1530 pool->low_water_triggered = 0;
1531 pool->no_free_space = 0;
1532 bio_list_init(&pool->retry_on_resume_list);
1533 ds_init(&pool->ds);
1534
1535 pool->next_mapping = NULL;
1536 pool->mapping_pool =
1537 mempool_create_kmalloc_pool(MAPPING_POOL_SIZE, sizeof(struct new_mapping));
1538 if (!pool->mapping_pool) {
1539 *error = "Error creating pool's mapping mempool";
1540 err_p = ERR_PTR(-ENOMEM);
1541 goto bad_mapping_pool;
1542 }
1543
1544 pool->endio_hook_pool =
1545 mempool_create_kmalloc_pool(ENDIO_HOOK_POOL_SIZE, sizeof(struct endio_hook));
1546 if (!pool->endio_hook_pool) {
1547 *error = "Error creating pool's endio_hook mempool";
1548 err_p = ERR_PTR(-ENOMEM);
1549 goto bad_endio_hook_pool;
1550 }
1551 pool->ref_count = 1;
1552 pool->pool_md = pool_md;
1553 pool->md_dev = metadata_dev;
1554 __pool_table_insert(pool);
1555
1556 return pool;
1557
1558bad_endio_hook_pool:
1559 mempool_destroy(pool->mapping_pool);
1560bad_mapping_pool:
1561 destroy_workqueue(pool->wq);
1562bad_wq:
1563 dm_kcopyd_client_destroy(pool->copier);
1564bad_kcopyd_client:
1565 prison_destroy(pool->prison);
1566bad_prison:
1567 kfree(pool);
1568bad_pool:
1569 if (dm_pool_metadata_close(pmd))
1570 DMWARN("%s: dm_pool_metadata_close() failed.", __func__);
1571
1572 return err_p;
1573}
1574
1575static void __pool_inc(struct pool *pool)
1576{
1577 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
1578 pool->ref_count++;
1579}
1580
1581static void __pool_dec(struct pool *pool)
1582{
1583 BUG_ON(!mutex_is_locked(&dm_thin_pool_table.mutex));
1584 BUG_ON(!pool->ref_count);
1585 if (!--pool->ref_count)
1586 __pool_destroy(pool);
1587}
1588
1589static struct pool *__pool_find(struct mapped_device *pool_md,
1590 struct block_device *metadata_dev,
1591 unsigned long block_size, char **error)
1592{
1593 struct pool *pool = __pool_table_lookup_metadata_dev(metadata_dev);
1594
1595 if (pool) {
1596 if (pool->pool_md != pool_md)
1597 return ERR_PTR(-EBUSY);
1598 __pool_inc(pool);
1599
1600 } else {
1601 pool = __pool_table_lookup(pool_md);
1602 if (pool) {
1603 if (pool->md_dev != metadata_dev)
1604 return ERR_PTR(-EINVAL);
1605 __pool_inc(pool);
1606
1607 } else
1608 pool = pool_create(pool_md, metadata_dev, block_size, error);
1609 }
1610
1611 return pool;
1612}
1613
1614/*----------------------------------------------------------------
1615 * Pool target methods
1616 *--------------------------------------------------------------*/
1617static void pool_dtr(struct dm_target *ti)
1618{
1619 struct pool_c *pt = ti->private;
1620
1621 mutex_lock(&dm_thin_pool_table.mutex);
1622
1623 unbind_control_target(pt->pool, ti);
1624 __pool_dec(pt->pool);
1625 dm_put_device(ti, pt->metadata_dev);
1626 dm_put_device(ti, pt->data_dev);
1627 kfree(pt);
1628
1629 mutex_unlock(&dm_thin_pool_table.mutex);
1630}
1631
1632struct pool_features {
1633 unsigned zero_new_blocks:1;
1634};
1635
1636static int parse_pool_features(struct dm_arg_set *as, struct pool_features *pf,
1637 struct dm_target *ti)
1638{
1639 int r;
1640 unsigned argc;
1641 const char *arg_name;
1642
1643 static struct dm_arg _args[] = {
1644 {0, 1, "Invalid number of pool feature arguments"},
1645 };
1646
1647 /*
1648 * No feature arguments supplied.
1649 */
1650 if (!as->argc)
1651 return 0;
1652
1653 r = dm_read_arg_group(_args, as, &argc, &ti->error);
1654 if (r)
1655 return -EINVAL;
1656
1657 while (argc && !r) {
1658 arg_name = dm_shift_arg(as);
1659 argc--;
1660
1661 if (!strcasecmp(arg_name, "skip_block_zeroing")) {
1662 pf->zero_new_blocks = 0;
1663 continue;
1664 }
1665
1666 ti->error = "Unrecognised pool feature requested";
1667 r = -EINVAL;
1668 }
1669
1670 return r;
1671}
1672
1673/*
1674 * thin-pool <metadata dev> <data dev>
1675 * <data block size (sectors)>
1676 * <low water mark (blocks)>
1677 * [<#feature args> [<arg>]*]
1678 *
1679 * Optional feature arguments are:
1680 * skip_block_zeroing: skips the zeroing of newly-provisioned blocks.
1681 */
1682static int pool_ctr(struct dm_target *ti, unsigned argc, char **argv)
1683{
1684 int r;
1685 struct pool_c *pt;
1686 struct pool *pool;
1687 struct pool_features pf;
1688 struct dm_arg_set as;
1689 struct dm_dev *data_dev;
1690 unsigned long block_size;
1691 dm_block_t low_water_blocks;
1692 struct dm_dev *metadata_dev;
1693 sector_t metadata_dev_size;
1694
1695 /*
1696 * FIXME Remove validation from scope of lock.
1697 */
1698 mutex_lock(&dm_thin_pool_table.mutex);
1699
1700 if (argc < 4) {
1701 ti->error = "Invalid argument count";
1702 r = -EINVAL;
1703 goto out_unlock;
1704 }
1705 as.argc = argc;
1706 as.argv = argv;
1707
1708 r = dm_get_device(ti, argv[0], FMODE_READ | FMODE_WRITE, &metadata_dev);
1709 if (r) {
1710 ti->error = "Error opening metadata block device";
1711 goto out_unlock;
1712 }
1713
1714 metadata_dev_size = i_size_read(metadata_dev->bdev->bd_inode) >> SECTOR_SHIFT;
1715 if (metadata_dev_size > METADATA_DEV_MAX_SECTORS) {
1716 ti->error = "Metadata device is too large";
1717 r = -EINVAL;
1718 goto out_metadata;
1719 }
1720
1721 r = dm_get_device(ti, argv[1], FMODE_READ | FMODE_WRITE, &data_dev);
1722 if (r) {
1723 ti->error = "Error getting data device";
1724 goto out_metadata;
1725 }
1726
1727 if (kstrtoul(argv[2], 10, &block_size) || !block_size ||
1728 block_size < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
1729 block_size > DATA_DEV_BLOCK_SIZE_MAX_SECTORS ||
1730 !is_power_of_2(block_size)) {
1731 ti->error = "Invalid block size";
1732 r = -EINVAL;
1733 goto out;
1734 }
1735
1736 if (kstrtoull(argv[3], 10, (unsigned long long *)&low_water_blocks)) {
1737 ti->error = "Invalid low water mark";
1738 r = -EINVAL;
1739 goto out;
1740 }
1741
1742 /*
1743 * Set default pool features.
1744 */
1745 memset(&pf, 0, sizeof(pf));
1746 pf.zero_new_blocks = 1;
1747
1748 dm_consume_args(&as, 4);
1749 r = parse_pool_features(&as, &pf, ti);
1750 if (r)
1751 goto out;
1752
1753 pt = kzalloc(sizeof(*pt), GFP_KERNEL);
1754 if (!pt) {
1755 r = -ENOMEM;
1756 goto out;
1757 }
1758
1759 pool = __pool_find(dm_table_get_md(ti->table), metadata_dev->bdev,
1760 block_size, &ti->error);
1761 if (IS_ERR(pool)) {
1762 r = PTR_ERR(pool);
1763 goto out_free_pt;
1764 }
1765
1766 pt->pool = pool;
1767 pt->ti = ti;
1768 pt->metadata_dev = metadata_dev;
1769 pt->data_dev = data_dev;
1770 pt->low_water_blocks = low_water_blocks;
1771 pt->zero_new_blocks = pf.zero_new_blocks;
1772 ti->num_flush_requests = 1;
1773 ti->num_discard_requests = 0;
1774 ti->private = pt;
1775
1776 pt->callbacks.congested_fn = pool_is_congested;
1777 dm_table_add_target_callbacks(ti->table, &pt->callbacks);
1778
1779 mutex_unlock(&dm_thin_pool_table.mutex);
1780
1781 return 0;
1782
1783out_free_pt:
1784 kfree(pt);
1785out:
1786 dm_put_device(ti, data_dev);
1787out_metadata:
1788 dm_put_device(ti, metadata_dev);
1789out_unlock:
1790 mutex_unlock(&dm_thin_pool_table.mutex);
1791
1792 return r;
1793}
1794
1795static int pool_map(struct dm_target *ti, struct bio *bio,
1796 union map_info *map_context)
1797{
1798 int r;
1799 struct pool_c *pt = ti->private;
1800 struct pool *pool = pt->pool;
1801 unsigned long flags;
1802
1803 /*
1804 * As this is a singleton target, ti->begin is always zero.
1805 */
1806 spin_lock_irqsave(&pool->lock, flags);
1807 bio->bi_bdev = pt->data_dev->bdev;
1808 r = DM_MAPIO_REMAPPED;
1809 spin_unlock_irqrestore(&pool->lock, flags);
1810
1811 return r;
1812}
1813
1814/*
1815 * Retrieves the number of blocks of the data device from
1816 * the superblock and compares it to the actual device size,
1817 * thus resizing the data device in case it has grown.
1818 *
1819 * This both copes with opening preallocated data devices in the ctr
1820 * being followed by a resume
1821 * -and-
1822 * calling the resume method individually after userspace has
1823 * grown the data device in reaction to a table event.
1824 */
1825static int pool_preresume(struct dm_target *ti)
1826{
1827 int r;
1828 struct pool_c *pt = ti->private;
1829 struct pool *pool = pt->pool;
1830 dm_block_t data_size, sb_data_size;
1831
1832 /*
1833 * Take control of the pool object.
1834 */
1835 r = bind_control_target(pool, ti);
1836 if (r)
1837 return r;
1838
1839 data_size = ti->len >> pool->block_shift;
1840 r = dm_pool_get_data_dev_size(pool->pmd, &sb_data_size);
1841 if (r) {
1842 DMERR("failed to retrieve data device size");
1843 return r;
1844 }
1845
1846 if (data_size < sb_data_size) {
1847 DMERR("pool target too small, is %llu blocks (expected %llu)",
1848 data_size, sb_data_size);
1849 return -EINVAL;
1850
1851 } else if (data_size > sb_data_size) {
1852 r = dm_pool_resize_data_dev(pool->pmd, data_size);
1853 if (r) {
1854 DMERR("failed to resize data device");
1855 return r;
1856 }
1857
1858 r = dm_pool_commit_metadata(pool->pmd);
1859 if (r) {
1860 DMERR("%s: dm_pool_commit_metadata() failed, error = %d",
1861 __func__, r);
1862 return r;
1863 }
1864 }
1865
1866 return 0;
1867}
1868
1869static void pool_resume(struct dm_target *ti)
1870{
1871 struct pool_c *pt = ti->private;
1872 struct pool *pool = pt->pool;
1873 unsigned long flags;
1874
1875 spin_lock_irqsave(&pool->lock, flags);
1876 pool->low_water_triggered = 0;
1877 pool->no_free_space = 0;
1878 __requeue_bios(pool);
1879 spin_unlock_irqrestore(&pool->lock, flags);
1880
1881 wake_worker(pool);
1882}
1883
1884static void pool_postsuspend(struct dm_target *ti)
1885{
1886 int r;
1887 struct pool_c *pt = ti->private;
1888 struct pool *pool = pt->pool;
1889
1890 flush_workqueue(pool->wq);
1891
1892 r = dm_pool_commit_metadata(pool->pmd);
1893 if (r < 0) {
1894 DMERR("%s: dm_pool_commit_metadata() failed, error = %d",
1895 __func__, r);
1896 /* FIXME: invalidate device? error the next FUA or FLUSH bio ?*/
1897 }
1898}
1899
1900static int check_arg_count(unsigned argc, unsigned args_required)
1901{
1902 if (argc != args_required) {
1903 DMWARN("Message received with %u arguments instead of %u.",
1904 argc, args_required);
1905 return -EINVAL;
1906 }
1907
1908 return 0;
1909}
1910
1911static int read_dev_id(char *arg, dm_thin_id *dev_id, int warning)
1912{
1913 if (!kstrtoull(arg, 10, (unsigned long long *)dev_id) &&
1914 *dev_id <= MAX_DEV_ID)
1915 return 0;
1916
1917 if (warning)
1918 DMWARN("Message received with invalid device id: %s", arg);
1919
1920 return -EINVAL;
1921}
1922
1923static int process_create_thin_mesg(unsigned argc, char **argv, struct pool *pool)
1924{
1925 dm_thin_id dev_id;
1926 int r;
1927
1928 r = check_arg_count(argc, 2);
1929 if (r)
1930 return r;
1931
1932 r = read_dev_id(argv[1], &dev_id, 1);
1933 if (r)
1934 return r;
1935
1936 r = dm_pool_create_thin(pool->pmd, dev_id);
1937 if (r) {
1938 DMWARN("Creation of new thinly-provisioned device with id %s failed.",
1939 argv[1]);
1940 return r;
1941 }
1942
1943 return 0;
1944}
1945
1946static int process_create_snap_mesg(unsigned argc, char **argv, struct pool *pool)
1947{
1948 dm_thin_id dev_id;
1949 dm_thin_id origin_dev_id;
1950 int r;
1951
1952 r = check_arg_count(argc, 3);
1953 if (r)
1954 return r;
1955
1956 r = read_dev_id(argv[1], &dev_id, 1);
1957 if (r)
1958 return r;
1959
1960 r = read_dev_id(argv[2], &origin_dev_id, 1);
1961 if (r)
1962 return r;
1963
1964 r = dm_pool_create_snap(pool->pmd, dev_id, origin_dev_id);
1965 if (r) {
1966 DMWARN("Creation of new snapshot %s of device %s failed.",
1967 argv[1], argv[2]);
1968 return r;
1969 }
1970
1971 return 0;
1972}
1973
1974static int process_delete_mesg(unsigned argc, char **argv, struct pool *pool)
1975{
1976 dm_thin_id dev_id;
1977 int r;
1978
1979 r = check_arg_count(argc, 2);
1980 if (r)
1981 return r;
1982
1983 r = read_dev_id(argv[1], &dev_id, 1);
1984 if (r)
1985 return r;
1986
1987 r = dm_pool_delete_thin_device(pool->pmd, dev_id);
1988 if (r)
1989 DMWARN("Deletion of thin device %s failed.", argv[1]);
1990
1991 return r;
1992}
1993
1994static int process_set_transaction_id_mesg(unsigned argc, char **argv, struct pool *pool)
1995{
1996 dm_thin_id old_id, new_id;
1997 int r;
1998
1999 r = check_arg_count(argc, 3);
2000 if (r)
2001 return r;
2002
2003 if (kstrtoull(argv[1], 10, (unsigned long long *)&old_id)) {
2004 DMWARN("set_transaction_id message: Unrecognised id %s.", argv[1]);
2005 return -EINVAL;
2006 }
2007
2008 if (kstrtoull(argv[2], 10, (unsigned long long *)&new_id)) {
2009 DMWARN("set_transaction_id message: Unrecognised new id %s.", argv[2]);
2010 return -EINVAL;
2011 }
2012
2013 r = dm_pool_set_metadata_transaction_id(pool->pmd, old_id, new_id);
2014 if (r) {
2015 DMWARN("Failed to change transaction id from %s to %s.",
2016 argv[1], argv[2]);
2017 return r;
2018 }
2019
2020 return 0;
2021}
2022
2023/*
2024 * Messages supported:
2025 * create_thin <dev_id>
2026 * create_snap <dev_id> <origin_id>
2027 * delete <dev_id>
2028 * trim <dev_id> <new_size_in_sectors>
2029 * set_transaction_id <current_trans_id> <new_trans_id>
2030 */
2031static int pool_message(struct dm_target *ti, unsigned argc, char **argv)
2032{
2033 int r = -EINVAL;
2034 struct pool_c *pt = ti->private;
2035 struct pool *pool = pt->pool;
2036
2037 if (!strcasecmp(argv[0], "create_thin"))
2038 r = process_create_thin_mesg(argc, argv, pool);
2039
2040 else if (!strcasecmp(argv[0], "create_snap"))
2041 r = process_create_snap_mesg(argc, argv, pool);
2042
2043 else if (!strcasecmp(argv[0], "delete"))
2044 r = process_delete_mesg(argc, argv, pool);
2045
2046 else if (!strcasecmp(argv[0], "set_transaction_id"))
2047 r = process_set_transaction_id_mesg(argc, argv, pool);
2048
2049 else
2050 DMWARN("Unrecognised thin pool target message received: %s", argv[0]);
2051
2052 if (!r) {
2053 r = dm_pool_commit_metadata(pool->pmd);
2054 if (r)
2055 DMERR("%s message: dm_pool_commit_metadata() failed, error = %d",
2056 argv[0], r);
2057 }
2058
2059 return r;
2060}
2061
2062/*
2063 * Status line is:
2064 * <transaction id> <used metadata sectors>/<total metadata sectors>
2065 * <used data sectors>/<total data sectors> <held metadata root>
2066 */
2067static int pool_status(struct dm_target *ti, status_type_t type,
2068 char *result, unsigned maxlen)
2069{
2070 int r;
2071 unsigned sz = 0;
2072 uint64_t transaction_id;
2073 dm_block_t nr_free_blocks_data;
2074 dm_block_t nr_free_blocks_metadata;
2075 dm_block_t nr_blocks_data;
2076 dm_block_t nr_blocks_metadata;
2077 dm_block_t held_root;
2078 char buf[BDEVNAME_SIZE];
2079 char buf2[BDEVNAME_SIZE];
2080 struct pool_c *pt = ti->private;
2081 struct pool *pool = pt->pool;
2082
2083 switch (type) {
2084 case STATUSTYPE_INFO:
2085 r = dm_pool_get_metadata_transaction_id(pool->pmd,
2086 &transaction_id);
2087 if (r)
2088 return r;
2089
2090 r = dm_pool_get_free_metadata_block_count(pool->pmd,
2091 &nr_free_blocks_metadata);
2092 if (r)
2093 return r;
2094
2095 r = dm_pool_get_metadata_dev_size(pool->pmd, &nr_blocks_metadata);
2096 if (r)
2097 return r;
2098
2099 r = dm_pool_get_free_block_count(pool->pmd,
2100 &nr_free_blocks_data);
2101 if (r)
2102 return r;
2103
2104 r = dm_pool_get_data_dev_size(pool->pmd, &nr_blocks_data);
2105 if (r)
2106 return r;
2107
2108 r = dm_pool_get_held_metadata_root(pool->pmd, &held_root);
2109 if (r)
2110 return r;
2111
2112 DMEMIT("%llu %llu/%llu %llu/%llu ",
2113 (unsigned long long)transaction_id,
2114 (unsigned long long)(nr_blocks_metadata - nr_free_blocks_metadata),
2115 (unsigned long long)nr_blocks_metadata,
2116 (unsigned long long)(nr_blocks_data - nr_free_blocks_data),
2117 (unsigned long long)nr_blocks_data);
2118
2119 if (held_root)
2120 DMEMIT("%llu", held_root);
2121 else
2122 DMEMIT("-");
2123
2124 break;
2125
2126 case STATUSTYPE_TABLE:
2127 DMEMIT("%s %s %lu %llu ",
2128 format_dev_t(buf, pt->metadata_dev->bdev->bd_dev),
2129 format_dev_t(buf2, pt->data_dev->bdev->bd_dev),
2130 (unsigned long)pool->sectors_per_block,
2131 (unsigned long long)pt->low_water_blocks);
2132
2133 DMEMIT("%u ", !pool->zero_new_blocks);
2134
2135 if (!pool->zero_new_blocks)
2136 DMEMIT("skip_block_zeroing ");
2137 break;
2138 }
2139
2140 return 0;
2141}
2142
2143static int pool_iterate_devices(struct dm_target *ti,
2144 iterate_devices_callout_fn fn, void *data)
2145{
2146 struct pool_c *pt = ti->private;
2147
2148 return fn(ti, pt->data_dev, 0, ti->len, data);
2149}
2150
2151static int pool_merge(struct dm_target *ti, struct bvec_merge_data *bvm,
2152 struct bio_vec *biovec, int max_size)
2153{
2154 struct pool_c *pt = ti->private;
2155 struct request_queue *q = bdev_get_queue(pt->data_dev->bdev);
2156
2157 if (!q->merge_bvec_fn)
2158 return max_size;
2159
2160 bvm->bi_bdev = pt->data_dev->bdev;
2161
2162 return min(max_size, q->merge_bvec_fn(q, bvm, biovec));
2163}
2164
2165static void pool_io_hints(struct dm_target *ti, struct queue_limits *limits)
2166{
2167 struct pool_c *pt = ti->private;
2168 struct pool *pool = pt->pool;
2169
2170 blk_limits_io_min(limits, 0);
2171 blk_limits_io_opt(limits, pool->sectors_per_block << SECTOR_SHIFT);
2172}
2173
2174static struct target_type pool_target = {
2175 .name = "thin-pool",
2176 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
2177 DM_TARGET_IMMUTABLE,
2178 .version = {1, 0, 0},
2179 .module = THIS_MODULE,
2180 .ctr = pool_ctr,
2181 .dtr = pool_dtr,
2182 .map = pool_map,
2183 .postsuspend = pool_postsuspend,
2184 .preresume = pool_preresume,
2185 .resume = pool_resume,
2186 .message = pool_message,
2187 .status = pool_status,
2188 .merge = pool_merge,
2189 .iterate_devices = pool_iterate_devices,
2190 .io_hints = pool_io_hints,
2191};
2192
2193/*----------------------------------------------------------------
2194 * Thin target methods
2195 *--------------------------------------------------------------*/
2196static void thin_dtr(struct dm_target *ti)
2197{
2198 struct thin_c *tc = ti->private;
2199
2200 mutex_lock(&dm_thin_pool_table.mutex);
2201
2202 __pool_dec(tc->pool);
2203 dm_pool_close_thin_device(tc->td);
2204 dm_put_device(ti, tc->pool_dev);
2205 kfree(tc);
2206
2207 mutex_unlock(&dm_thin_pool_table.mutex);
2208}
2209
2210/*
2211 * Thin target parameters:
2212 *
2213 * <pool_dev> <dev_id>
2214 *
2215 * pool_dev: the path to the pool (eg, /dev/mapper/my_pool)
2216 * dev_id: the internal device identifier
2217 */
2218static int thin_ctr(struct dm_target *ti, unsigned argc, char **argv)
2219{
2220 int r;
2221 struct thin_c *tc;
2222 struct dm_dev *pool_dev;
2223 struct mapped_device *pool_md;
2224
2225 mutex_lock(&dm_thin_pool_table.mutex);
2226
2227 if (argc != 2) {
2228 ti->error = "Invalid argument count";
2229 r = -EINVAL;
2230 goto out_unlock;
2231 }
2232
2233 tc = ti->private = kzalloc(sizeof(*tc), GFP_KERNEL);
2234 if (!tc) {
2235 ti->error = "Out of memory";
2236 r = -ENOMEM;
2237 goto out_unlock;
2238 }
2239
2240 r = dm_get_device(ti, argv[0], dm_table_get_mode(ti->table), &pool_dev);
2241 if (r) {
2242 ti->error = "Error opening pool device";
2243 goto bad_pool_dev;
2244 }
2245 tc->pool_dev = pool_dev;
2246
2247 if (read_dev_id(argv[1], (unsigned long long *)&tc->dev_id, 0)) {
2248 ti->error = "Invalid device id";
2249 r = -EINVAL;
2250 goto bad_common;
2251 }
2252
2253 pool_md = dm_get_md(tc->pool_dev->bdev->bd_dev);
2254 if (!pool_md) {
2255 ti->error = "Couldn't get pool mapped device";
2256 r = -EINVAL;
2257 goto bad_common;
2258 }
2259
2260 tc->pool = __pool_table_lookup(pool_md);
2261 if (!tc->pool) {
2262 ti->error = "Couldn't find pool object";
2263 r = -EINVAL;
2264 goto bad_pool_lookup;
2265 }
2266 __pool_inc(tc->pool);
2267
2268 r = dm_pool_open_thin_device(tc->pool->pmd, tc->dev_id, &tc->td);
2269 if (r) {
2270 ti->error = "Couldn't open thin internal device";
2271 goto bad_thin_open;
2272 }
2273
2274 ti->split_io = tc->pool->sectors_per_block;
2275 ti->num_flush_requests = 1;
2276 ti->num_discard_requests = 0;
2277 ti->discards_supported = 0;
2278
2279 dm_put(pool_md);
2280
2281 mutex_unlock(&dm_thin_pool_table.mutex);
2282
2283 return 0;
2284
2285bad_thin_open:
2286 __pool_dec(tc->pool);
2287bad_pool_lookup:
2288 dm_put(pool_md);
2289bad_common:
2290 dm_put_device(ti, tc->pool_dev);
2291bad_pool_dev:
2292 kfree(tc);
2293out_unlock:
2294 mutex_unlock(&dm_thin_pool_table.mutex);
2295
2296 return r;
2297}
2298
2299static int thin_map(struct dm_target *ti, struct bio *bio,
2300 union map_info *map_context)
2301{
2302 bio->bi_sector -= ti->begin;
2303
2304 return thin_bio_map(ti, bio, map_context);
2305}
2306
2307static void thin_postsuspend(struct dm_target *ti)
2308{
2309 if (dm_noflush_suspending(ti))
2310 requeue_io((struct thin_c *)ti->private);
2311}
2312
2313/*
2314 * <nr mapped sectors> <highest mapped sector>
2315 */
2316static int thin_status(struct dm_target *ti, status_type_t type,
2317 char *result, unsigned maxlen)
2318{
2319 int r;
2320 ssize_t sz = 0;
2321 dm_block_t mapped, highest;
2322 char buf[BDEVNAME_SIZE];
2323 struct thin_c *tc = ti->private;
2324
2325 if (!tc->td)
2326 DMEMIT("-");
2327 else {
2328 switch (type) {
2329 case STATUSTYPE_INFO:
2330 r = dm_thin_get_mapped_count(tc->td, &mapped);
2331 if (r)
2332 return r;
2333
2334 r = dm_thin_get_highest_mapped_block(tc->td, &highest);
2335 if (r < 0)
2336 return r;
2337
2338 DMEMIT("%llu ", mapped * tc->pool->sectors_per_block);
2339 if (r)
2340 DMEMIT("%llu", ((highest + 1) *
2341 tc->pool->sectors_per_block) - 1);
2342 else
2343 DMEMIT("-");
2344 break;
2345
2346 case STATUSTYPE_TABLE:
2347 DMEMIT("%s %lu",
2348 format_dev_t(buf, tc->pool_dev->bdev->bd_dev),
2349 (unsigned long) tc->dev_id);
2350 break;
2351 }
2352 }
2353
2354 return 0;
2355}
2356
2357static int thin_iterate_devices(struct dm_target *ti,
2358 iterate_devices_callout_fn fn, void *data)
2359{
2360 dm_block_t blocks;
2361 struct thin_c *tc = ti->private;
2362
2363 /*
2364 * We can't call dm_pool_get_data_dev_size() since that blocks. So
2365 * we follow a more convoluted path through to the pool's target.
2366 */
2367 if (!tc->pool->ti)
2368 return 0; /* nothing is bound */
2369
2370 blocks = tc->pool->ti->len >> tc->pool->block_shift;
2371 if (blocks)
2372 return fn(ti, tc->pool_dev, 0, tc->pool->sectors_per_block * blocks, data);
2373
2374 return 0;
2375}
2376
2377static void thin_io_hints(struct dm_target *ti, struct queue_limits *limits)
2378{
2379 struct thin_c *tc = ti->private;
2380
2381 blk_limits_io_min(limits, 0);
2382 blk_limits_io_opt(limits, tc->pool->sectors_per_block << SECTOR_SHIFT);
2383}
2384
2385static struct target_type thin_target = {
2386 .name = "thin",
2387 .version = {1, 0, 0},
2388 .module = THIS_MODULE,
2389 .ctr = thin_ctr,
2390 .dtr = thin_dtr,
2391 .map = thin_map,
2392 .postsuspend = thin_postsuspend,
2393 .status = thin_status,
2394 .iterate_devices = thin_iterate_devices,
2395 .io_hints = thin_io_hints,
2396};
2397
2398/*----------------------------------------------------------------*/
2399
2400static int __init dm_thin_init(void)
2401{
2402 int r;
2403
2404 pool_table_init();
2405
2406 r = dm_register_target(&thin_target);
2407 if (r)
2408 return r;
2409
2410 r = dm_register_target(&pool_target);
2411 if (r)
2412 dm_unregister_target(&thin_target);
2413
2414 return r;
2415}
2416
2417static void dm_thin_exit(void)
2418{
2419 dm_unregister_target(&thin_target);
2420 dm_unregister_target(&pool_target);
2421}
2422
2423module_init(dm_thin_init);
2424module_exit(dm_thin_exit);
2425
2426MODULE_DESCRIPTION(DM_NAME "device-mapper thin provisioning target");
2427MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>");
2428MODULE_LICENSE("GPL");
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index 52b39f335bb..6b6616a41ba 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -25,6 +25,16 @@
25 25
26#define DM_MSG_PREFIX "core" 26#define DM_MSG_PREFIX "core"
27 27
28#ifdef CONFIG_PRINTK
29/*
30 * ratelimit state to be used in DMXXX_LIMIT().
31 */
32DEFINE_RATELIMIT_STATE(dm_ratelimit_state,
33 DEFAULT_RATELIMIT_INTERVAL,
34 DEFAULT_RATELIMIT_BURST);
35EXPORT_SYMBOL(dm_ratelimit_state);
36#endif
37
28/* 38/*
29 * Cookies are numeric values sent with CHANGE and REMOVE 39 * Cookies are numeric values sent with CHANGE and REMOVE
30 * uevents while resuming, removing or renaming the device. 40 * uevents while resuming, removing or renaming the device.
@@ -130,6 +140,8 @@ struct mapped_device {
130 /* Protect queue and type against concurrent access. */ 140 /* Protect queue and type against concurrent access. */
131 struct mutex type_lock; 141 struct mutex type_lock;
132 142
143 struct target_type *immutable_target_type;
144
133 struct gendisk *disk; 145 struct gendisk *disk;
134 char name[16]; 146 char name[16];
135 147
@@ -2086,6 +2098,8 @@ static struct dm_table *__bind(struct mapped_device *md, struct dm_table *t,
2086 write_lock_irqsave(&md->map_lock, flags); 2098 write_lock_irqsave(&md->map_lock, flags);
2087 old_map = md->map; 2099 old_map = md->map;
2088 md->map = t; 2100 md->map = t;
2101 md->immutable_target_type = dm_table_get_immutable_target_type(t);
2102
2089 dm_table_set_restrictions(t, q, limits); 2103 dm_table_set_restrictions(t, q, limits);
2090 if (merge_is_optional) 2104 if (merge_is_optional)
2091 set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags); 2105 set_bit(DMF_MERGE_IS_OPTIONAL, &md->flags);
@@ -2156,6 +2170,11 @@ unsigned dm_get_md_type(struct mapped_device *md)
2156 return md->type; 2170 return md->type;
2157} 2171}
2158 2172
2173struct target_type *dm_get_immutable_target_type(struct mapped_device *md)
2174{
2175 return md->immutable_target_type;
2176}
2177
2159/* 2178/*
2160 * Fully initialize a request-based queue (->elevator, ->request_fn, etc). 2179 * Fully initialize a request-based queue (->elevator, ->request_fn, etc).
2161 */ 2180 */
@@ -2231,6 +2250,7 @@ struct mapped_device *dm_get_md(dev_t dev)
2231 2250
2232 return md; 2251 return md;
2233} 2252}
2253EXPORT_SYMBOL_GPL(dm_get_md);
2234 2254
2235void *dm_get_mdptr(struct mapped_device *md) 2255void *dm_get_mdptr(struct mapped_device *md)
2236{ 2256{
@@ -2316,7 +2336,6 @@ static int dm_wait_for_completion(struct mapped_device *md, int interruptible)
2316 while (1) { 2336 while (1) {
2317 set_current_state(interruptible); 2337 set_current_state(interruptible);
2318 2338
2319 smp_mb();
2320 if (!md_in_flight(md)) 2339 if (!md_in_flight(md))
2321 break; 2340 break;
2322 2341
diff --git a/drivers/md/dm.h b/drivers/md/dm.h
index 6745dbd278a..b7dacd59d8d 100644
--- a/drivers/md/dm.h
+++ b/drivers/md/dm.h
@@ -60,6 +60,7 @@ int dm_table_resume_targets(struct dm_table *t);
60int dm_table_any_congested(struct dm_table *t, int bdi_bits); 60int dm_table_any_congested(struct dm_table *t, int bdi_bits);
61int dm_table_any_busy_target(struct dm_table *t); 61int dm_table_any_busy_target(struct dm_table *t);
62unsigned dm_table_get_type(struct dm_table *t); 62unsigned dm_table_get_type(struct dm_table *t);
63struct target_type *dm_table_get_immutable_target_type(struct dm_table *t);
63bool dm_table_request_based(struct dm_table *t); 64bool dm_table_request_based(struct dm_table *t);
64bool dm_table_supports_discards(struct dm_table *t); 65bool dm_table_supports_discards(struct dm_table *t);
65int dm_table_alloc_md_mempools(struct dm_table *t); 66int dm_table_alloc_md_mempools(struct dm_table *t);
@@ -72,6 +73,7 @@ void dm_lock_md_type(struct mapped_device *md);
72void dm_unlock_md_type(struct mapped_device *md); 73void dm_unlock_md_type(struct mapped_device *md);
73void dm_set_md_type(struct mapped_device *md, unsigned type); 74void dm_set_md_type(struct mapped_device *md, unsigned type);
74unsigned dm_get_md_type(struct mapped_device *md); 75unsigned dm_get_md_type(struct mapped_device *md);
76struct target_type *dm_get_immutable_target_type(struct mapped_device *md);
75 77
76int dm_setup_md_queue(struct mapped_device *md); 78int dm_setup_md_queue(struct mapped_device *md);
77 79
diff --git a/drivers/md/persistent-data/Kconfig b/drivers/md/persistent-data/Kconfig
new file mode 100644
index 00000000000..ceb359050a5
--- /dev/null
+++ b/drivers/md/persistent-data/Kconfig
@@ -0,0 +1,8 @@
1config DM_PERSISTENT_DATA
2 tristate
3 depends on BLK_DEV_DM && EXPERIMENTAL
4 select LIBCRC32C
5 select DM_BUFIO
6 ---help---
7 Library providing immutable on-disk data structure support for
8 device-mapper targets such as the thin provisioning target.
diff --git a/drivers/md/persistent-data/Makefile b/drivers/md/persistent-data/Makefile
new file mode 100644
index 00000000000..cfa95f66223
--- /dev/null
+++ b/drivers/md/persistent-data/Makefile
@@ -0,0 +1,11 @@
1obj-$(CONFIG_DM_PERSISTENT_DATA) += dm-persistent-data.o
2dm-persistent-data-objs := \
3 dm-block-manager.o \
4 dm-space-map-checker.o \
5 dm-space-map-common.o \
6 dm-space-map-disk.o \
7 dm-space-map-metadata.o \
8 dm-transaction-manager.o \
9 dm-btree.o \
10 dm-btree-remove.o \
11 dm-btree-spine.o
diff --git a/drivers/md/persistent-data/dm-block-manager.c b/drivers/md/persistent-data/dm-block-manager.c
new file mode 100644
index 00000000000..0317ecdc6e5
--- /dev/null
+++ b/drivers/md/persistent-data/dm-block-manager.c
@@ -0,0 +1,620 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6#include "dm-block-manager.h"
7#include "dm-persistent-data-internal.h"
8#include "../dm-bufio.h"
9
10#include <linux/crc32c.h>
11#include <linux/module.h>
12#include <linux/slab.h>
13#include <linux/rwsem.h>
14#include <linux/device-mapper.h>
15#include <linux/stacktrace.h>
16
17#define DM_MSG_PREFIX "block manager"
18
19/*----------------------------------------------------------------*/
20
21/*
22 * This is a read/write semaphore with a couple of differences.
23 *
24 * i) There is a restriction on the number of concurrent read locks that
25 * may be held at once. This is just an implementation detail.
26 *
27 * ii) Recursive locking attempts are detected and return EINVAL. A stack
28 * trace is also emitted for the previous lock aquisition.
29 *
30 * iii) Priority is given to write locks.
31 */
32#define MAX_HOLDERS 4
33#define MAX_STACK 10
34
35typedef unsigned long stack_entries[MAX_STACK];
36
37struct block_lock {
38 spinlock_t lock;
39 __s32 count;
40 struct list_head waiters;
41 struct task_struct *holders[MAX_HOLDERS];
42
43#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
44 struct stack_trace traces[MAX_HOLDERS];
45 stack_entries entries[MAX_HOLDERS];
46#endif
47};
48
49struct waiter {
50 struct list_head list;
51 struct task_struct *task;
52 int wants_write;
53};
54
55static unsigned __find_holder(struct block_lock *lock,
56 struct task_struct *task)
57{
58 unsigned i;
59
60 for (i = 0; i < MAX_HOLDERS; i++)
61 if (lock->holders[i] == task)
62 break;
63
64 BUG_ON(i == MAX_HOLDERS);
65 return i;
66}
67
68/* call this *after* you increment lock->count */
69static void __add_holder(struct block_lock *lock, struct task_struct *task)
70{
71 unsigned h = __find_holder(lock, NULL);
72#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
73 struct stack_trace *t;
74#endif
75
76 get_task_struct(task);
77 lock->holders[h] = task;
78
79#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
80 t = lock->traces + h;
81 t->nr_entries = 0;
82 t->max_entries = MAX_STACK;
83 t->entries = lock->entries[h];
84 t->skip = 2;
85 save_stack_trace(t);
86#endif
87}
88
89/* call this *before* you decrement lock->count */
90static void __del_holder(struct block_lock *lock, struct task_struct *task)
91{
92 unsigned h = __find_holder(lock, task);
93 lock->holders[h] = NULL;
94 put_task_struct(task);
95}
96
97static int __check_holder(struct block_lock *lock)
98{
99 unsigned i;
100#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
101 static struct stack_trace t;
102 static stack_entries entries;
103#endif
104
105 for (i = 0; i < MAX_HOLDERS; i++) {
106 if (lock->holders[i] == current) {
107 DMERR("recursive lock detected in pool metadata");
108#ifdef CONFIG_DM_DEBUG_BLOCK_STACK_TRACING
109 DMERR("previously held here:");
110 print_stack_trace(lock->traces + i, 4);
111
112 DMERR("subsequent aquisition attempted here:");
113 t.nr_entries = 0;
114 t.max_entries = MAX_STACK;
115 t.entries = entries;
116 t.skip = 3;
117 save_stack_trace(&t);
118 print_stack_trace(&t, 4);
119#endif
120 return -EINVAL;
121 }
122 }
123
124 return 0;
125}
126
127static void __wait(struct waiter *w)
128{
129 for (;;) {
130 set_task_state(current, TASK_UNINTERRUPTIBLE);
131
132 if (!w->task)
133 break;
134
135 schedule();
136 }
137
138 set_task_state(current, TASK_RUNNING);
139}
140
141static void __wake_waiter(struct waiter *w)
142{
143 struct task_struct *task;
144
145 list_del(&w->list);
146 task = w->task;
147 smp_mb();
148 w->task = NULL;
149 wake_up_process(task);
150}
151
152/*
153 * We either wake a few readers or a single writer.
154 */
155static void __wake_many(struct block_lock *lock)
156{
157 struct waiter *w, *tmp;
158
159 BUG_ON(lock->count < 0);
160 list_for_each_entry_safe(w, tmp, &lock->waiters, list) {
161 if (lock->count >= MAX_HOLDERS)
162 return;
163
164 if (w->wants_write) {
165 if (lock->count > 0)
166 return; /* still read locked */
167
168 lock->count = -1;
169 __add_holder(lock, w->task);
170 __wake_waiter(w);
171 return;
172 }
173
174 lock->count++;
175 __add_holder(lock, w->task);
176 __wake_waiter(w);
177 }
178}
179
180static void bl_init(struct block_lock *lock)
181{
182 int i;
183
184 spin_lock_init(&lock->lock);
185 lock->count = 0;
186 INIT_LIST_HEAD(&lock->waiters);
187 for (i = 0; i < MAX_HOLDERS; i++)
188 lock->holders[i] = NULL;
189}
190
191static int __available_for_read(struct block_lock *lock)
192{
193 return lock->count >= 0 &&
194 lock->count < MAX_HOLDERS &&
195 list_empty(&lock->waiters);
196}
197
198static int bl_down_read(struct block_lock *lock)
199{
200 int r;
201 struct waiter w;
202
203 spin_lock(&lock->lock);
204 r = __check_holder(lock);
205 if (r) {
206 spin_unlock(&lock->lock);
207 return r;
208 }
209
210 if (__available_for_read(lock)) {
211 lock->count++;
212 __add_holder(lock, current);
213 spin_unlock(&lock->lock);
214 return 0;
215 }
216
217 get_task_struct(current);
218
219 w.task = current;
220 w.wants_write = 0;
221 list_add_tail(&w.list, &lock->waiters);
222 spin_unlock(&lock->lock);
223
224 __wait(&w);
225 put_task_struct(current);
226 return 0;
227}
228
229static int bl_down_read_nonblock(struct block_lock *lock)
230{
231 int r;
232
233 spin_lock(&lock->lock);
234 r = __check_holder(lock);
235 if (r)
236 goto out;
237
238 if (__available_for_read(lock)) {
239 lock->count++;
240 __add_holder(lock, current);
241 r = 0;
242 } else
243 r = -EWOULDBLOCK;
244
245out:
246 spin_unlock(&lock->lock);
247 return r;
248}
249
250static void bl_up_read(struct block_lock *lock)
251{
252 spin_lock(&lock->lock);
253 BUG_ON(lock->count <= 0);
254 __del_holder(lock, current);
255 --lock->count;
256 if (!list_empty(&lock->waiters))
257 __wake_many(lock);
258 spin_unlock(&lock->lock);
259}
260
261static int bl_down_write(struct block_lock *lock)
262{
263 int r;
264 struct waiter w;
265
266 spin_lock(&lock->lock);
267 r = __check_holder(lock);
268 if (r) {
269 spin_unlock(&lock->lock);
270 return r;
271 }
272
273 if (lock->count == 0 && list_empty(&lock->waiters)) {
274 lock->count = -1;
275 __add_holder(lock, current);
276 spin_unlock(&lock->lock);
277 return 0;
278 }
279
280 get_task_struct(current);
281 w.task = current;
282 w.wants_write = 1;
283
284 /*
285 * Writers given priority. We know there's only one mutator in the
286 * system, so ignoring the ordering reversal.
287 */
288 list_add(&w.list, &lock->waiters);
289 spin_unlock(&lock->lock);
290
291 __wait(&w);
292 put_task_struct(current);
293
294 return 0;
295}
296
297static void bl_up_write(struct block_lock *lock)
298{
299 spin_lock(&lock->lock);
300 __del_holder(lock, current);
301 lock->count = 0;
302 if (!list_empty(&lock->waiters))
303 __wake_many(lock);
304 spin_unlock(&lock->lock);
305}
306
307static void report_recursive_bug(dm_block_t b, int r)
308{
309 if (r == -EINVAL)
310 DMERR("recursive acquisition of block %llu requested.",
311 (unsigned long long) b);
312}
313
314/*----------------------------------------------------------------*/
315
316/*
317 * Block manager is currently implemented using dm-bufio. struct
318 * dm_block_manager and struct dm_block map directly onto a couple of
319 * structs in the bufio interface. I want to retain the freedom to move
320 * away from bufio in the future. So these structs are just cast within
321 * this .c file, rather than making it through to the public interface.
322 */
323static struct dm_buffer *to_buffer(struct dm_block *b)
324{
325 return (struct dm_buffer *) b;
326}
327
328static struct dm_bufio_client *to_bufio(struct dm_block_manager *bm)
329{
330 return (struct dm_bufio_client *) bm;
331}
332
333dm_block_t dm_block_location(struct dm_block *b)
334{
335 return dm_bufio_get_block_number(to_buffer(b));
336}
337EXPORT_SYMBOL_GPL(dm_block_location);
338
339void *dm_block_data(struct dm_block *b)
340{
341 return dm_bufio_get_block_data(to_buffer(b));
342}
343EXPORT_SYMBOL_GPL(dm_block_data);
344
345struct buffer_aux {
346 struct dm_block_validator *validator;
347 struct block_lock lock;
348 int write_locked;
349};
350
351static void dm_block_manager_alloc_callback(struct dm_buffer *buf)
352{
353 struct buffer_aux *aux = dm_bufio_get_aux_data(buf);
354 aux->validator = NULL;
355 bl_init(&aux->lock);
356}
357
358static void dm_block_manager_write_callback(struct dm_buffer *buf)
359{
360 struct buffer_aux *aux = dm_bufio_get_aux_data(buf);
361 if (aux->validator) {
362 aux->validator->prepare_for_write(aux->validator, (struct dm_block *) buf,
363 dm_bufio_get_block_size(dm_bufio_get_client(buf)));
364 }
365}
366
367/*----------------------------------------------------------------
368 * Public interface
369 *--------------------------------------------------------------*/
370struct dm_block_manager *dm_block_manager_create(struct block_device *bdev,
371 unsigned block_size,
372 unsigned cache_size,
373 unsigned max_held_per_thread)
374{
375 return (struct dm_block_manager *)
376 dm_bufio_client_create(bdev, block_size, max_held_per_thread,
377 sizeof(struct buffer_aux),
378 dm_block_manager_alloc_callback,
379 dm_block_manager_write_callback);
380}
381EXPORT_SYMBOL_GPL(dm_block_manager_create);
382
383void dm_block_manager_destroy(struct dm_block_manager *bm)
384{
385 return dm_bufio_client_destroy(to_bufio(bm));
386}
387EXPORT_SYMBOL_GPL(dm_block_manager_destroy);
388
389unsigned dm_bm_block_size(struct dm_block_manager *bm)
390{
391 return dm_bufio_get_block_size(to_bufio(bm));
392}
393EXPORT_SYMBOL_GPL(dm_bm_block_size);
394
395dm_block_t dm_bm_nr_blocks(struct dm_block_manager *bm)
396{
397 return dm_bufio_get_device_size(to_bufio(bm));
398}
399
400static int dm_bm_validate_buffer(struct dm_block_manager *bm,
401 struct dm_buffer *buf,
402 struct buffer_aux *aux,
403 struct dm_block_validator *v)
404{
405 if (unlikely(!aux->validator)) {
406 int r;
407 if (!v)
408 return 0;
409 r = v->check(v, (struct dm_block *) buf, dm_bufio_get_block_size(to_bufio(bm)));
410 if (unlikely(r))
411 return r;
412 aux->validator = v;
413 } else {
414 if (unlikely(aux->validator != v)) {
415 DMERR("validator mismatch (old=%s vs new=%s) for block %llu",
416 aux->validator->name, v ? v->name : "NULL",
417 (unsigned long long)
418 dm_bufio_get_block_number(buf));
419 return -EINVAL;
420 }
421 }
422
423 return 0;
424}
425int dm_bm_read_lock(struct dm_block_manager *bm, dm_block_t b,
426 struct dm_block_validator *v,
427 struct dm_block **result)
428{
429 struct buffer_aux *aux;
430 void *p;
431 int r;
432
433 p = dm_bufio_read(to_bufio(bm), b, (struct dm_buffer **) result);
434 if (unlikely(IS_ERR(p)))
435 return PTR_ERR(p);
436
437 aux = dm_bufio_get_aux_data(to_buffer(*result));
438 r = bl_down_read(&aux->lock);
439 if (unlikely(r)) {
440 dm_bufio_release(to_buffer(*result));
441 report_recursive_bug(b, r);
442 return r;
443 }
444
445 aux->write_locked = 0;
446
447 r = dm_bm_validate_buffer(bm, to_buffer(*result), aux, v);
448 if (unlikely(r)) {
449 bl_up_read(&aux->lock);
450 dm_bufio_release(to_buffer(*result));
451 return r;
452 }
453
454 return 0;
455}
456EXPORT_SYMBOL_GPL(dm_bm_read_lock);
457
458int dm_bm_write_lock(struct dm_block_manager *bm,
459 dm_block_t b, struct dm_block_validator *v,
460 struct dm_block **result)
461{
462 struct buffer_aux *aux;
463 void *p;
464 int r;
465
466 p = dm_bufio_read(to_bufio(bm), b, (struct dm_buffer **) result);
467 if (unlikely(IS_ERR(p)))
468 return PTR_ERR(p);
469
470 aux = dm_bufio_get_aux_data(to_buffer(*result));
471 r = bl_down_write(&aux->lock);
472 if (r) {
473 dm_bufio_release(to_buffer(*result));
474 report_recursive_bug(b, r);
475 return r;
476 }
477
478 aux->write_locked = 1;
479
480 r = dm_bm_validate_buffer(bm, to_buffer(*result), aux, v);
481 if (unlikely(r)) {
482 bl_up_write(&aux->lock);
483 dm_bufio_release(to_buffer(*result));
484 return r;
485 }
486
487 return 0;
488}
489EXPORT_SYMBOL_GPL(dm_bm_write_lock);
490
491int dm_bm_read_try_lock(struct dm_block_manager *bm,
492 dm_block_t b, struct dm_block_validator *v,
493 struct dm_block **result)
494{
495 struct buffer_aux *aux;
496 void *p;
497 int r;
498
499 p = dm_bufio_get(to_bufio(bm), b, (struct dm_buffer **) result);
500 if (unlikely(IS_ERR(p)))
501 return PTR_ERR(p);
502 if (unlikely(!p))
503 return -EWOULDBLOCK;
504
505 aux = dm_bufio_get_aux_data(to_buffer(*result));
506 r = bl_down_read_nonblock(&aux->lock);
507 if (r < 0) {
508 dm_bufio_release(to_buffer(*result));
509 report_recursive_bug(b, r);
510 return r;
511 }
512 aux->write_locked = 0;
513
514 r = dm_bm_validate_buffer(bm, to_buffer(*result), aux, v);
515 if (unlikely(r)) {
516 bl_up_read(&aux->lock);
517 dm_bufio_release(to_buffer(*result));
518 return r;
519 }
520
521 return 0;
522}
523
524int dm_bm_write_lock_zero(struct dm_block_manager *bm,
525 dm_block_t b, struct dm_block_validator *v,
526 struct dm_block **result)
527{
528 int r;
529 struct buffer_aux *aux;
530 void *p;
531
532 p = dm_bufio_new(to_bufio(bm), b, (struct dm_buffer **) result);
533 if (unlikely(IS_ERR(p)))
534 return PTR_ERR(p);
535
536 memset(p, 0, dm_bm_block_size(bm));
537
538 aux = dm_bufio_get_aux_data(to_buffer(*result));
539 r = bl_down_write(&aux->lock);
540 if (r) {
541 dm_bufio_release(to_buffer(*result));
542 return r;
543 }
544
545 aux->write_locked = 1;
546 aux->validator = v;
547
548 return 0;
549}
550
551int dm_bm_unlock(struct dm_block *b)
552{
553 struct buffer_aux *aux;
554 aux = dm_bufio_get_aux_data(to_buffer(b));
555
556 if (aux->write_locked) {
557 dm_bufio_mark_buffer_dirty(to_buffer(b));
558 bl_up_write(&aux->lock);
559 } else
560 bl_up_read(&aux->lock);
561
562 dm_bufio_release(to_buffer(b));
563
564 return 0;
565}
566EXPORT_SYMBOL_GPL(dm_bm_unlock);
567
568int dm_bm_unlock_move(struct dm_block *b, dm_block_t n)
569{
570 struct buffer_aux *aux;
571
572 aux = dm_bufio_get_aux_data(to_buffer(b));
573
574 if (aux->write_locked) {
575 dm_bufio_mark_buffer_dirty(to_buffer(b));
576 bl_up_write(&aux->lock);
577 } else
578 bl_up_read(&aux->lock);
579
580 dm_bufio_release_move(to_buffer(b), n);
581 return 0;
582}
583
584int dm_bm_flush_and_unlock(struct dm_block_manager *bm,
585 struct dm_block *superblock)
586{
587 int r;
588
589 r = dm_bufio_write_dirty_buffers(to_bufio(bm));
590 if (unlikely(r))
591 return r;
592 r = dm_bufio_issue_flush(to_bufio(bm));
593 if (unlikely(r))
594 return r;
595
596 dm_bm_unlock(superblock);
597
598 r = dm_bufio_write_dirty_buffers(to_bufio(bm));
599 if (unlikely(r))
600 return r;
601 r = dm_bufio_issue_flush(to_bufio(bm));
602 if (unlikely(r))
603 return r;
604
605 return 0;
606}
607
608u32 dm_bm_checksum(const void *data, size_t len, u32 init_xor)
609{
610 return crc32c(~(u32) 0, data, len) ^ init_xor;
611}
612EXPORT_SYMBOL_GPL(dm_bm_checksum);
613
614/*----------------------------------------------------------------*/
615
616MODULE_LICENSE("GPL");
617MODULE_AUTHOR("Joe Thornber <dm-devel@redhat.com>");
618MODULE_DESCRIPTION("Immutable metadata library for dm");
619
620/*----------------------------------------------------------------*/
diff --git a/drivers/md/persistent-data/dm-block-manager.h b/drivers/md/persistent-data/dm-block-manager.h
new file mode 100644
index 00000000000..924833d2dfa
--- /dev/null
+++ b/drivers/md/persistent-data/dm-block-manager.h
@@ -0,0 +1,123 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef _LINUX_DM_BLOCK_MANAGER_H
8#define _LINUX_DM_BLOCK_MANAGER_H
9
10#include <linux/types.h>
11#include <linux/blkdev.h>
12
13/*----------------------------------------------------------------*/
14
15/*
16 * Block number.
17 */
18typedef uint64_t dm_block_t;
19struct dm_block;
20
21dm_block_t dm_block_location(struct dm_block *b);
22void *dm_block_data(struct dm_block *b);
23
24/*----------------------------------------------------------------*/
25
26/*
27 * @name should be a unique identifier for the block manager, no longer
28 * than 32 chars.
29 *
30 * @max_held_per_thread should be the maximum number of locks, read or
31 * write, that an individual thread holds at any one time.
32 */
33struct dm_block_manager;
34struct dm_block_manager *dm_block_manager_create(
35 struct block_device *bdev, unsigned block_size,
36 unsigned cache_size, unsigned max_held_per_thread);
37void dm_block_manager_destroy(struct dm_block_manager *bm);
38
39unsigned dm_bm_block_size(struct dm_block_manager *bm);
40dm_block_t dm_bm_nr_blocks(struct dm_block_manager *bm);
41
42/*----------------------------------------------------------------*/
43
44/*
45 * The validator allows the caller to verify newly-read data and modify
46 * the data just before writing, e.g. to calculate checksums. It's
47 * important to be consistent with your use of validators. The only time
48 * you can change validators is if you call dm_bm_write_lock_zero.
49 */
50struct dm_block_validator {
51 const char *name;
52 void (*prepare_for_write)(struct dm_block_validator *v, struct dm_block *b, size_t block_size);
53
54 /*
55 * Return 0 if the checksum is valid or < 0 on error.
56 */
57 int (*check)(struct dm_block_validator *v, struct dm_block *b, size_t block_size);
58};
59
60/*----------------------------------------------------------------*/
61
62/*
63 * You can have multiple concurrent readers or a single writer holding a
64 * block lock.
65 */
66
67/*
68 * dm_bm_lock() locks a block and returns through @result a pointer to
69 * memory that holds a copy of that block. If you have write-locked the
70 * block then any changes you make to memory pointed to by @result will be
71 * written back to the disk sometime after dm_bm_unlock is called.
72 */
73int dm_bm_read_lock(struct dm_block_manager *bm, dm_block_t b,
74 struct dm_block_validator *v,
75 struct dm_block **result);
76
77int dm_bm_write_lock(struct dm_block_manager *bm, dm_block_t b,
78 struct dm_block_validator *v,
79 struct dm_block **result);
80
81/*
82 * The *_try_lock variants return -EWOULDBLOCK if the block isn't
83 * available immediately.
84 */
85int dm_bm_read_try_lock(struct dm_block_manager *bm, dm_block_t b,
86 struct dm_block_validator *v,
87 struct dm_block **result);
88
89/*
90 * Use dm_bm_write_lock_zero() when you know you're going to
91 * overwrite the block completely. It saves a disk read.
92 */
93int dm_bm_write_lock_zero(struct dm_block_manager *bm, dm_block_t b,
94 struct dm_block_validator *v,
95 struct dm_block **result);
96
97int dm_bm_unlock(struct dm_block *b);
98
99/*
100 * An optimisation; we often want to copy a block's contents to a new
101 * block. eg, as part of the shadowing operation. It's far better for
102 * bufio to do this move behind the scenes than hold 2 locks and memcpy the
103 * data.
104 */
105int dm_bm_unlock_move(struct dm_block *b, dm_block_t n);
106
107/*
108 * It's a common idiom to have a superblock that should be committed last.
109 *
110 * @superblock should be write-locked on entry. It will be unlocked during
111 * this function. All dirty blocks are guaranteed to be written and flushed
112 * before the superblock.
113 *
114 * This method always blocks.
115 */
116int dm_bm_flush_and_unlock(struct dm_block_manager *bm,
117 struct dm_block *superblock);
118
119u32 dm_bm_checksum(const void *data, size_t len, u32 init_xor);
120
121/*----------------------------------------------------------------*/
122
123#endif /* _LINUX_DM_BLOCK_MANAGER_H */
diff --git a/drivers/md/persistent-data/dm-btree-internal.h b/drivers/md/persistent-data/dm-btree-internal.h
new file mode 100644
index 00000000000..d279c768f8f
--- /dev/null
+++ b/drivers/md/persistent-data/dm-btree-internal.h
@@ -0,0 +1,137 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef DM_BTREE_INTERNAL_H
8#define DM_BTREE_INTERNAL_H
9
10#include "dm-btree.h"
11
12/*----------------------------------------------------------------*/
13
14/*
15 * We'll need 2 accessor functions for n->csum and n->blocknr
16 * to support dm-btree-spine.c in that case.
17 */
18
19enum node_flags {
20 INTERNAL_NODE = 1,
21 LEAF_NODE = 1 << 1
22};
23
24/*
25 * Every btree node begins with this structure. Make sure it's a multiple
26 * of 8-bytes in size, otherwise the 64bit keys will be mis-aligned.
27 */
28struct node_header {
29 __le32 csum;
30 __le32 flags;
31 __le64 blocknr; /* Block this node is supposed to live in. */
32
33 __le32 nr_entries;
34 __le32 max_entries;
35 __le32 value_size;
36 __le32 padding;
37} __packed;
38
39struct node {
40 struct node_header header;
41 __le64 keys[0];
42} __packed;
43
44
45void inc_children(struct dm_transaction_manager *tm, struct node *n,
46 struct dm_btree_value_type *vt);
47
48int new_block(struct dm_btree_info *info, struct dm_block **result);
49int unlock_block(struct dm_btree_info *info, struct dm_block *b);
50
51/*
52 * Spines keep track of the rolling locks. There are 2 variants, read-only
53 * and one that uses shadowing. These are separate structs to allow the
54 * type checker to spot misuse, for example accidentally calling read_lock
55 * on a shadow spine.
56 */
57struct ro_spine {
58 struct dm_btree_info *info;
59
60 int count;
61 struct dm_block *nodes[2];
62};
63
64void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info);
65int exit_ro_spine(struct ro_spine *s);
66int ro_step(struct ro_spine *s, dm_block_t new_child);
67struct node *ro_node(struct ro_spine *s);
68
69struct shadow_spine {
70 struct dm_btree_info *info;
71
72 int count;
73 struct dm_block *nodes[2];
74
75 dm_block_t root;
76};
77
78void init_shadow_spine(struct shadow_spine *s, struct dm_btree_info *info);
79int exit_shadow_spine(struct shadow_spine *s);
80
81int shadow_step(struct shadow_spine *s, dm_block_t b,
82 struct dm_btree_value_type *vt);
83
84/*
85 * The spine must have at least one entry before calling this.
86 */
87struct dm_block *shadow_current(struct shadow_spine *s);
88
89/*
90 * The spine must have at least two entries before calling this.
91 */
92struct dm_block *shadow_parent(struct shadow_spine *s);
93
94int shadow_has_parent(struct shadow_spine *s);
95
96int shadow_root(struct shadow_spine *s);
97
98/*
99 * Some inlines.
100 */
101static inline __le64 *key_ptr(struct node *n, uint32_t index)
102{
103 return n->keys + index;
104}
105
106static inline void *value_base(struct node *n)
107{
108 return &n->keys[le32_to_cpu(n->header.max_entries)];
109}
110
111/*
112 * FIXME: Now that value size is stored in node we don't need the third parm.
113 */
114static inline void *value_ptr(struct node *n, uint32_t index, size_t value_size)
115{
116 BUG_ON(value_size != le32_to_cpu(n->header.value_size));
117 return value_base(n) + (value_size * index);
118}
119
120/*
121 * Assumes the values are suitably-aligned and converts to core format.
122 */
123static inline uint64_t value64(struct node *n, uint32_t index)
124{
125 __le64 *values_le = value_base(n);
126
127 return le64_to_cpu(values_le[index]);
128}
129
130/*
131 * Searching for a key within a single node.
132 */
133int lower_bound(struct node *n, uint64_t key);
134
135extern struct dm_block_validator btree_node_validator;
136
137#endif /* DM_BTREE_INTERNAL_H */
diff --git a/drivers/md/persistent-data/dm-btree-remove.c b/drivers/md/persistent-data/dm-btree-remove.c
new file mode 100644
index 00000000000..65fd85ec651
--- /dev/null
+++ b/drivers/md/persistent-data/dm-btree-remove.c
@@ -0,0 +1,566 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-btree.h"
8#include "dm-btree-internal.h"
9#include "dm-transaction-manager.h"
10
11#include <linux/module.h>
12
13/*
14 * Removing an entry from a btree
15 * ==============================
16 *
17 * A very important constraint for our btree is that no node, except the
18 * root, may have fewer than a certain number of entries.
19 * (MIN_ENTRIES <= nr_entries <= MAX_ENTRIES).
20 *
21 * Ensuring this is complicated by the way we want to only ever hold the
22 * locks on 2 nodes concurrently, and only change nodes in a top to bottom
23 * fashion.
24 *
25 * Each node may have a left or right sibling. When decending the spine,
26 * if a node contains only MIN_ENTRIES then we try and increase this to at
27 * least MIN_ENTRIES + 1. We do this in the following ways:
28 *
29 * [A] No siblings => this can only happen if the node is the root, in which
30 * case we copy the childs contents over the root.
31 *
32 * [B] No left sibling
33 * ==> rebalance(node, right sibling)
34 *
35 * [C] No right sibling
36 * ==> rebalance(left sibling, node)
37 *
38 * [D] Both siblings, total_entries(left, node, right) <= DEL_THRESHOLD
39 * ==> delete node adding it's contents to left and right
40 *
41 * [E] Both siblings, total_entries(left, node, right) > DEL_THRESHOLD
42 * ==> rebalance(left, node, right)
43 *
44 * After these operations it's possible that the our original node no
45 * longer contains the desired sub tree. For this reason this rebalancing
46 * is performed on the children of the current node. This also avoids
47 * having a special case for the root.
48 *
49 * Once this rebalancing has occurred we can then step into the child node
50 * for internal nodes. Or delete the entry for leaf nodes.
51 */
52
53/*
54 * Some little utilities for moving node data around.
55 */
56static void node_shift(struct node *n, int shift)
57{
58 uint32_t nr_entries = le32_to_cpu(n->header.nr_entries);
59 uint32_t value_size = le32_to_cpu(n->header.value_size);
60
61 if (shift < 0) {
62 shift = -shift;
63 BUG_ON(shift > nr_entries);
64 BUG_ON((void *) key_ptr(n, shift) >= value_ptr(n, shift, value_size));
65 memmove(key_ptr(n, 0),
66 key_ptr(n, shift),
67 (nr_entries - shift) * sizeof(__le64));
68 memmove(value_ptr(n, 0, value_size),
69 value_ptr(n, shift, value_size),
70 (nr_entries - shift) * value_size);
71 } else {
72 BUG_ON(nr_entries + shift > le32_to_cpu(n->header.max_entries));
73 memmove(key_ptr(n, shift),
74 key_ptr(n, 0),
75 nr_entries * sizeof(__le64));
76 memmove(value_ptr(n, shift, value_size),
77 value_ptr(n, 0, value_size),
78 nr_entries * value_size);
79 }
80}
81
82static void node_copy(struct node *left, struct node *right, int shift)
83{
84 uint32_t nr_left = le32_to_cpu(left->header.nr_entries);
85 uint32_t value_size = le32_to_cpu(left->header.value_size);
86 BUG_ON(value_size != le32_to_cpu(right->header.value_size));
87
88 if (shift < 0) {
89 shift = -shift;
90 BUG_ON(nr_left + shift > le32_to_cpu(left->header.max_entries));
91 memcpy(key_ptr(left, nr_left),
92 key_ptr(right, 0),
93 shift * sizeof(__le64));
94 memcpy(value_ptr(left, nr_left, value_size),
95 value_ptr(right, 0, value_size),
96 shift * value_size);
97 } else {
98 BUG_ON(shift > le32_to_cpu(right->header.max_entries));
99 memcpy(key_ptr(right, 0),
100 key_ptr(left, nr_left - shift),
101 shift * sizeof(__le64));
102 memcpy(value_ptr(right, 0, value_size),
103 value_ptr(left, nr_left - shift, value_size),
104 shift * value_size);
105 }
106}
107
108/*
109 * Delete a specific entry from a leaf node.
110 */
111static void delete_at(struct node *n, unsigned index)
112{
113 unsigned nr_entries = le32_to_cpu(n->header.nr_entries);
114 unsigned nr_to_copy = nr_entries - (index + 1);
115 uint32_t value_size = le32_to_cpu(n->header.value_size);
116 BUG_ON(index >= nr_entries);
117
118 if (nr_to_copy) {
119 memmove(key_ptr(n, index),
120 key_ptr(n, index + 1),
121 nr_to_copy * sizeof(__le64));
122
123 memmove(value_ptr(n, index, value_size),
124 value_ptr(n, index + 1, value_size),
125 nr_to_copy * value_size);
126 }
127
128 n->header.nr_entries = cpu_to_le32(nr_entries - 1);
129}
130
131static unsigned del_threshold(struct node *n)
132{
133 return le32_to_cpu(n->header.max_entries) / 3;
134}
135
136static unsigned merge_threshold(struct node *n)
137{
138 /*
139 * The extra one is because we know we're potentially going to
140 * delete an entry.
141 */
142 return 2 * (le32_to_cpu(n->header.max_entries) / 3) + 1;
143}
144
145struct child {
146 unsigned index;
147 struct dm_block *block;
148 struct node *n;
149};
150
151static struct dm_btree_value_type le64_type = {
152 .context = NULL,
153 .size = sizeof(__le64),
154 .inc = NULL,
155 .dec = NULL,
156 .equal = NULL
157};
158
159static int init_child(struct dm_btree_info *info, struct node *parent,
160 unsigned index, struct child *result)
161{
162 int r, inc;
163 dm_block_t root;
164
165 result->index = index;
166 root = value64(parent, index);
167
168 r = dm_tm_shadow_block(info->tm, root, &btree_node_validator,
169 &result->block, &inc);
170 if (r)
171 return r;
172
173 result->n = dm_block_data(result->block);
174
175 if (inc)
176 inc_children(info->tm, result->n, &le64_type);
177
178 *((__le64 *) value_ptr(parent, index, sizeof(__le64))) =
179 cpu_to_le64(dm_block_location(result->block));
180
181 return 0;
182}
183
184static int exit_child(struct dm_btree_info *info, struct child *c)
185{
186 return dm_tm_unlock(info->tm, c->block);
187}
188
189static void shift(struct node *left, struct node *right, int count)
190{
191 if (!count)
192 return;
193
194 if (count > 0) {
195 node_shift(right, count);
196 node_copy(left, right, count);
197 } else {
198 node_copy(left, right, count);
199 node_shift(right, count);
200 }
201
202 left->header.nr_entries =
203 cpu_to_le32(le32_to_cpu(left->header.nr_entries) - count);
204 BUG_ON(le32_to_cpu(left->header.nr_entries) > le32_to_cpu(left->header.max_entries));
205
206 right->header.nr_entries =
207 cpu_to_le32(le32_to_cpu(right->header.nr_entries) + count);
208 BUG_ON(le32_to_cpu(right->header.nr_entries) > le32_to_cpu(right->header.max_entries));
209}
210
211static void __rebalance2(struct dm_btree_info *info, struct node *parent,
212 struct child *l, struct child *r)
213{
214 struct node *left = l->n;
215 struct node *right = r->n;
216 uint32_t nr_left = le32_to_cpu(left->header.nr_entries);
217 uint32_t nr_right = le32_to_cpu(right->header.nr_entries);
218
219 if (nr_left + nr_right <= merge_threshold(left)) {
220 /*
221 * Merge
222 */
223 node_copy(left, right, -nr_right);
224 left->header.nr_entries = cpu_to_le32(nr_left + nr_right);
225 delete_at(parent, r->index);
226
227 /*
228 * We need to decrement the right block, but not it's
229 * children, since they're still referenced by left.
230 */
231 dm_tm_dec(info->tm, dm_block_location(r->block));
232 } else {
233 /*
234 * Rebalance.
235 */
236 unsigned target_left = (nr_left + nr_right) / 2;
237 unsigned shift_ = nr_left - target_left;
238 BUG_ON(le32_to_cpu(left->header.max_entries) <= nr_left - shift_);
239 BUG_ON(le32_to_cpu(right->header.max_entries) <= nr_right + shift_);
240 shift(left, right, nr_left - target_left);
241 *key_ptr(parent, r->index) = right->keys[0];
242 }
243}
244
245static int rebalance2(struct shadow_spine *s, struct dm_btree_info *info,
246 unsigned left_index)
247{
248 int r;
249 struct node *parent;
250 struct child left, right;
251
252 parent = dm_block_data(shadow_current(s));
253
254 r = init_child(info, parent, left_index, &left);
255 if (r)
256 return r;
257
258 r = init_child(info, parent, left_index + 1, &right);
259 if (r) {
260 exit_child(info, &left);
261 return r;
262 }
263
264 __rebalance2(info, parent, &left, &right);
265
266 r = exit_child(info, &left);
267 if (r) {
268 exit_child(info, &right);
269 return r;
270 }
271
272 return exit_child(info, &right);
273}
274
275static void __rebalance3(struct dm_btree_info *info, struct node *parent,
276 struct child *l, struct child *c, struct child *r)
277{
278 struct node *left = l->n;
279 struct node *center = c->n;
280 struct node *right = r->n;
281
282 uint32_t nr_left = le32_to_cpu(left->header.nr_entries);
283 uint32_t nr_center = le32_to_cpu(center->header.nr_entries);
284 uint32_t nr_right = le32_to_cpu(right->header.nr_entries);
285 uint32_t max_entries = le32_to_cpu(left->header.max_entries);
286
287 unsigned target;
288
289 BUG_ON(left->header.max_entries != center->header.max_entries);
290 BUG_ON(center->header.max_entries != right->header.max_entries);
291
292 if (((nr_left + nr_center + nr_right) / 2) < merge_threshold(center)) {
293 /*
294 * Delete center node:
295 *
296 * We dump as many entries from center as possible into
297 * left, then the rest in right, then rebalance2. This
298 * wastes some cpu, but I want something simple atm.
299 */
300 unsigned shift = min(max_entries - nr_left, nr_center);
301
302 BUG_ON(nr_left + shift > max_entries);
303 node_copy(left, center, -shift);
304 left->header.nr_entries = cpu_to_le32(nr_left + shift);
305
306 if (shift != nr_center) {
307 shift = nr_center - shift;
308 BUG_ON((nr_right + shift) >= max_entries);
309 node_shift(right, shift);
310 node_copy(center, right, shift);
311 right->header.nr_entries = cpu_to_le32(nr_right + shift);
312 }
313 *key_ptr(parent, r->index) = right->keys[0];
314
315 delete_at(parent, c->index);
316 r->index--;
317
318 dm_tm_dec(info->tm, dm_block_location(c->block));
319 __rebalance2(info, parent, l, r);
320
321 return;
322 }
323
324 /*
325 * Rebalance
326 */
327 target = (nr_left + nr_center + nr_right) / 3;
328 BUG_ON(target > max_entries);
329
330 /*
331 * Adjust the left node
332 */
333 shift(left, center, nr_left - target);
334
335 /*
336 * Adjust the right node
337 */
338 shift(center, right, target - nr_right);
339 *key_ptr(parent, c->index) = center->keys[0];
340 *key_ptr(parent, r->index) = right->keys[0];
341}
342
343static int rebalance3(struct shadow_spine *s, struct dm_btree_info *info,
344 unsigned left_index)
345{
346 int r;
347 struct node *parent = dm_block_data(shadow_current(s));
348 struct child left, center, right;
349
350 /*
351 * FIXME: fill out an array?
352 */
353 r = init_child(info, parent, left_index, &left);
354 if (r)
355 return r;
356
357 r = init_child(info, parent, left_index + 1, &center);
358 if (r) {
359 exit_child(info, &left);
360 return r;
361 }
362
363 r = init_child(info, parent, left_index + 2, &right);
364 if (r) {
365 exit_child(info, &left);
366 exit_child(info, &center);
367 return r;
368 }
369
370 __rebalance3(info, parent, &left, &center, &right);
371
372 r = exit_child(info, &left);
373 if (r) {
374 exit_child(info, &center);
375 exit_child(info, &right);
376 return r;
377 }
378
379 r = exit_child(info, &center);
380 if (r) {
381 exit_child(info, &right);
382 return r;
383 }
384
385 r = exit_child(info, &right);
386 if (r)
387 return r;
388
389 return 0;
390}
391
392static int get_nr_entries(struct dm_transaction_manager *tm,
393 dm_block_t b, uint32_t *result)
394{
395 int r;
396 struct dm_block *block;
397 struct node *n;
398
399 r = dm_tm_read_lock(tm, b, &btree_node_validator, &block);
400 if (r)
401 return r;
402
403 n = dm_block_data(block);
404 *result = le32_to_cpu(n->header.nr_entries);
405
406 return dm_tm_unlock(tm, block);
407}
408
409static int rebalance_children(struct shadow_spine *s,
410 struct dm_btree_info *info, uint64_t key)
411{
412 int i, r, has_left_sibling, has_right_sibling;
413 uint32_t child_entries;
414 struct node *n;
415
416 n = dm_block_data(shadow_current(s));
417
418 if (le32_to_cpu(n->header.nr_entries) == 1) {
419 struct dm_block *child;
420 dm_block_t b = value64(n, 0);
421
422 r = dm_tm_read_lock(info->tm, b, &btree_node_validator, &child);
423 if (r)
424 return r;
425
426 memcpy(n, dm_block_data(child),
427 dm_bm_block_size(dm_tm_get_bm(info->tm)));
428 r = dm_tm_unlock(info->tm, child);
429 if (r)
430 return r;
431
432 dm_tm_dec(info->tm, dm_block_location(child));
433 return 0;
434 }
435
436 i = lower_bound(n, key);
437 if (i < 0)
438 return -ENODATA;
439
440 r = get_nr_entries(info->tm, value64(n, i), &child_entries);
441 if (r)
442 return r;
443
444 if (child_entries > del_threshold(n))
445 return 0;
446
447 has_left_sibling = i > 0;
448 has_right_sibling = i < (le32_to_cpu(n->header.nr_entries) - 1);
449
450 if (!has_left_sibling)
451 r = rebalance2(s, info, i);
452
453 else if (!has_right_sibling)
454 r = rebalance2(s, info, i - 1);
455
456 else
457 r = rebalance3(s, info, i - 1);
458
459 return r;
460}
461
462static int do_leaf(struct node *n, uint64_t key, unsigned *index)
463{
464 int i = lower_bound(n, key);
465
466 if ((i < 0) ||
467 (i >= le32_to_cpu(n->header.nr_entries)) ||
468 (le64_to_cpu(n->keys[i]) != key))
469 return -ENODATA;
470
471 *index = i;
472
473 return 0;
474}
475
476/*
477 * Prepares for removal from one level of the hierarchy. The caller must
478 * call delete_at() to remove the entry at index.
479 */
480static int remove_raw(struct shadow_spine *s, struct dm_btree_info *info,
481 struct dm_btree_value_type *vt, dm_block_t root,
482 uint64_t key, unsigned *index)
483{
484 int i = *index, r;
485 struct node *n;
486
487 for (;;) {
488 r = shadow_step(s, root, vt);
489 if (r < 0)
490 break;
491
492 /*
493 * We have to patch up the parent node, ugly, but I don't
494 * see a way to do this automatically as part of the spine
495 * op.
496 */
497 if (shadow_has_parent(s)) {
498 __le64 location = cpu_to_le64(dm_block_location(shadow_current(s)));
499 memcpy(value_ptr(dm_block_data(shadow_parent(s)), i, sizeof(__le64)),
500 &location, sizeof(__le64));
501 }
502
503 n = dm_block_data(shadow_current(s));
504
505 if (le32_to_cpu(n->header.flags) & LEAF_NODE)
506 return do_leaf(n, key, index);
507
508 r = rebalance_children(s, info, key);
509 if (r)
510 break;
511
512 n = dm_block_data(shadow_current(s));
513 if (le32_to_cpu(n->header.flags) & LEAF_NODE)
514 return do_leaf(n, key, index);
515
516 i = lower_bound(n, key);
517
518 /*
519 * We know the key is present, or else
520 * rebalance_children would have returned
521 * -ENODATA
522 */
523 root = value64(n, i);
524 }
525
526 return r;
527}
528
529int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
530 uint64_t *keys, dm_block_t *new_root)
531{
532 unsigned level, last_level = info->levels - 1;
533 int index = 0, r = 0;
534 struct shadow_spine spine;
535 struct node *n;
536
537 init_shadow_spine(&spine, info);
538 for (level = 0; level < info->levels; level++) {
539 r = remove_raw(&spine, info,
540 (level == last_level ?
541 &info->value_type : &le64_type),
542 root, keys[level], (unsigned *)&index);
543 if (r < 0)
544 break;
545
546 n = dm_block_data(shadow_current(&spine));
547 if (level != last_level) {
548 root = value64(n, index);
549 continue;
550 }
551
552 BUG_ON(index < 0 || index >= le32_to_cpu(n->header.nr_entries));
553
554 if (info->value_type.dec)
555 info->value_type.dec(info->value_type.context,
556 value_ptr(n, index, info->value_type.size));
557
558 delete_at(n, index);
559 }
560
561 *new_root = shadow_root(&spine);
562 exit_shadow_spine(&spine);
563
564 return r;
565}
566EXPORT_SYMBOL_GPL(dm_btree_remove);
diff --git a/drivers/md/persistent-data/dm-btree-spine.c b/drivers/md/persistent-data/dm-btree-spine.c
new file mode 100644
index 00000000000..d9a7912ee8e
--- /dev/null
+++ b/drivers/md/persistent-data/dm-btree-spine.c
@@ -0,0 +1,244 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-btree-internal.h"
8#include "dm-transaction-manager.h"
9
10#include <linux/device-mapper.h>
11
12#define DM_MSG_PREFIX "btree spine"
13
14/*----------------------------------------------------------------*/
15
16#define BTREE_CSUM_XOR 121107
17
18static int node_check(struct dm_block_validator *v,
19 struct dm_block *b,
20 size_t block_size);
21
22static void node_prepare_for_write(struct dm_block_validator *v,
23 struct dm_block *b,
24 size_t block_size)
25{
26 struct node *n = dm_block_data(b);
27 struct node_header *h = &n->header;
28
29 h->blocknr = cpu_to_le64(dm_block_location(b));
30 h->csum = cpu_to_le32(dm_bm_checksum(&h->flags,
31 block_size - sizeof(__le32),
32 BTREE_CSUM_XOR));
33
34 BUG_ON(node_check(v, b, 4096));
35}
36
37static int node_check(struct dm_block_validator *v,
38 struct dm_block *b,
39 size_t block_size)
40{
41 struct node *n = dm_block_data(b);
42 struct node_header *h = &n->header;
43 size_t value_size;
44 __le32 csum_disk;
45 uint32_t flags;
46
47 if (dm_block_location(b) != le64_to_cpu(h->blocknr)) {
48 DMERR("node_check failed blocknr %llu wanted %llu",
49 le64_to_cpu(h->blocknr), dm_block_location(b));
50 return -ENOTBLK;
51 }
52
53 csum_disk = cpu_to_le32(dm_bm_checksum(&h->flags,
54 block_size - sizeof(__le32),
55 BTREE_CSUM_XOR));
56 if (csum_disk != h->csum) {
57 DMERR("node_check failed csum %u wanted %u",
58 le32_to_cpu(csum_disk), le32_to_cpu(h->csum));
59 return -EILSEQ;
60 }
61
62 value_size = le32_to_cpu(h->value_size);
63
64 if (sizeof(struct node_header) +
65 (sizeof(__le64) + value_size) * le32_to_cpu(h->max_entries) > block_size) {
66 DMERR("node_check failed: max_entries too large");
67 return -EILSEQ;
68 }
69
70 if (le32_to_cpu(h->nr_entries) > le32_to_cpu(h->max_entries)) {
71 DMERR("node_check failed, too many entries");
72 return -EILSEQ;
73 }
74
75 /*
76 * The node must be either INTERNAL or LEAF.
77 */
78 flags = le32_to_cpu(h->flags);
79 if (!(flags & INTERNAL_NODE) && !(flags & LEAF_NODE)) {
80 DMERR("node_check failed, node is neither INTERNAL or LEAF");
81 return -EILSEQ;
82 }
83
84 return 0;
85}
86
87struct dm_block_validator btree_node_validator = {
88 .name = "btree_node",
89 .prepare_for_write = node_prepare_for_write,
90 .check = node_check
91};
92
93/*----------------------------------------------------------------*/
94
95static int bn_read_lock(struct dm_btree_info *info, dm_block_t b,
96 struct dm_block **result)
97{
98 return dm_tm_read_lock(info->tm, b, &btree_node_validator, result);
99}
100
101static int bn_shadow(struct dm_btree_info *info, dm_block_t orig,
102 struct dm_btree_value_type *vt,
103 struct dm_block **result)
104{
105 int r, inc;
106
107 r = dm_tm_shadow_block(info->tm, orig, &btree_node_validator,
108 result, &inc);
109 if (!r && inc)
110 inc_children(info->tm, dm_block_data(*result), vt);
111
112 return r;
113}
114
115int new_block(struct dm_btree_info *info, struct dm_block **result)
116{
117 return dm_tm_new_block(info->tm, &btree_node_validator, result);
118}
119
120int unlock_block(struct dm_btree_info *info, struct dm_block *b)
121{
122 return dm_tm_unlock(info->tm, b);
123}
124
125/*----------------------------------------------------------------*/
126
127void init_ro_spine(struct ro_spine *s, struct dm_btree_info *info)
128{
129 s->info = info;
130 s->count = 0;
131 s->nodes[0] = NULL;
132 s->nodes[1] = NULL;
133}
134
135int exit_ro_spine(struct ro_spine *s)
136{
137 int r = 0, i;
138
139 for (i = 0; i < s->count; i++) {
140 int r2 = unlock_block(s->info, s->nodes[i]);
141 if (r2 < 0)
142 r = r2;
143 }
144
145 return r;
146}
147
148int ro_step(struct ro_spine *s, dm_block_t new_child)
149{
150 int r;
151
152 if (s->count == 2) {
153 r = unlock_block(s->info, s->nodes[0]);
154 if (r < 0)
155 return r;
156 s->nodes[0] = s->nodes[1];
157 s->count--;
158 }
159
160 r = bn_read_lock(s->info, new_child, s->nodes + s->count);
161 if (!r)
162 s->count++;
163
164 return r;
165}
166
167struct node *ro_node(struct ro_spine *s)
168{
169 struct dm_block *block;
170
171 BUG_ON(!s->count);
172 block = s->nodes[s->count - 1];
173
174 return dm_block_data(block);
175}
176
177/*----------------------------------------------------------------*/
178
179void init_shadow_spine(struct shadow_spine *s, struct dm_btree_info *info)
180{
181 s->info = info;
182 s->count = 0;
183}
184
185int exit_shadow_spine(struct shadow_spine *s)
186{
187 int r = 0, i;
188
189 for (i = 0; i < s->count; i++) {
190 int r2 = unlock_block(s->info, s->nodes[i]);
191 if (r2 < 0)
192 r = r2;
193 }
194
195 return r;
196}
197
198int shadow_step(struct shadow_spine *s, dm_block_t b,
199 struct dm_btree_value_type *vt)
200{
201 int r;
202
203 if (s->count == 2) {
204 r = unlock_block(s->info, s->nodes[0]);
205 if (r < 0)
206 return r;
207 s->nodes[0] = s->nodes[1];
208 s->count--;
209 }
210
211 r = bn_shadow(s->info, b, vt, s->nodes + s->count);
212 if (!r) {
213 if (!s->count)
214 s->root = dm_block_location(s->nodes[0]);
215
216 s->count++;
217 }
218
219 return r;
220}
221
222struct dm_block *shadow_current(struct shadow_spine *s)
223{
224 BUG_ON(!s->count);
225
226 return s->nodes[s->count - 1];
227}
228
229struct dm_block *shadow_parent(struct shadow_spine *s)
230{
231 BUG_ON(s->count != 2);
232
233 return s->count == 2 ? s->nodes[0] : NULL;
234}
235
236int shadow_has_parent(struct shadow_spine *s)
237{
238 return s->count >= 2;
239}
240
241int shadow_root(struct shadow_spine *s)
242{
243 return s->root;
244}
diff --git a/drivers/md/persistent-data/dm-btree.c b/drivers/md/persistent-data/dm-btree.c
new file mode 100644
index 00000000000..e0638be53ea
--- /dev/null
+++ b/drivers/md/persistent-data/dm-btree.c
@@ -0,0 +1,805 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-btree-internal.h"
8#include "dm-space-map.h"
9#include "dm-transaction-manager.h"
10
11#include <linux/module.h>
12#include <linux/device-mapper.h>
13
14#define DM_MSG_PREFIX "btree"
15
16/*----------------------------------------------------------------
17 * Array manipulation
18 *--------------------------------------------------------------*/
19static void memcpy_disk(void *dest, const void *src, size_t len)
20 __dm_written_to_disk(src)
21{
22 memcpy(dest, src, len);
23 __dm_unbless_for_disk(src);
24}
25
26static void array_insert(void *base, size_t elt_size, unsigned nr_elts,
27 unsigned index, void *elt)
28 __dm_written_to_disk(elt)
29{
30 if (index < nr_elts)
31 memmove(base + (elt_size * (index + 1)),
32 base + (elt_size * index),
33 (nr_elts - index) * elt_size);
34
35 memcpy_disk(base + (elt_size * index), elt, elt_size);
36}
37
38/*----------------------------------------------------------------*/
39
40/* makes the assumption that no two keys are the same. */
41static int bsearch(struct node *n, uint64_t key, int want_hi)
42{
43 int lo = -1, hi = le32_to_cpu(n->header.nr_entries);
44
45 while (hi - lo > 1) {
46 int mid = lo + ((hi - lo) / 2);
47 uint64_t mid_key = le64_to_cpu(n->keys[mid]);
48
49 if (mid_key == key)
50 return mid;
51
52 if (mid_key < key)
53 lo = mid;
54 else
55 hi = mid;
56 }
57
58 return want_hi ? hi : lo;
59}
60
61int lower_bound(struct node *n, uint64_t key)
62{
63 return bsearch(n, key, 0);
64}
65
66void inc_children(struct dm_transaction_manager *tm, struct node *n,
67 struct dm_btree_value_type *vt)
68{
69 unsigned i;
70 uint32_t nr_entries = le32_to_cpu(n->header.nr_entries);
71
72 if (le32_to_cpu(n->header.flags) & INTERNAL_NODE)
73 for (i = 0; i < nr_entries; i++)
74 dm_tm_inc(tm, value64(n, i));
75 else if (vt->inc)
76 for (i = 0; i < nr_entries; i++)
77 vt->inc(vt->context,
78 value_ptr(n, i, vt->size));
79}
80
81static int insert_at(size_t value_size, struct node *node, unsigned index,
82 uint64_t key, void *value)
83 __dm_written_to_disk(value)
84{
85 uint32_t nr_entries = le32_to_cpu(node->header.nr_entries);
86 __le64 key_le = cpu_to_le64(key);
87
88 if (index > nr_entries ||
89 index >= le32_to_cpu(node->header.max_entries)) {
90 DMERR("too many entries in btree node for insert");
91 __dm_unbless_for_disk(value);
92 return -ENOMEM;
93 }
94
95 __dm_bless_for_disk(&key_le);
96
97 array_insert(node->keys, sizeof(*node->keys), nr_entries, index, &key_le);
98 array_insert(value_base(node), value_size, nr_entries, index, value);
99 node->header.nr_entries = cpu_to_le32(nr_entries + 1);
100
101 return 0;
102}
103
104/*----------------------------------------------------------------*/
105
106/*
107 * We want 3n entries (for some n). This works more nicely for repeated
108 * insert remove loops than (2n + 1).
109 */
110static uint32_t calc_max_entries(size_t value_size, size_t block_size)
111{
112 uint32_t total, n;
113 size_t elt_size = sizeof(uint64_t) + value_size; /* key + value */
114
115 block_size -= sizeof(struct node_header);
116 total = block_size / elt_size;
117 n = total / 3; /* rounds down */
118
119 return 3 * n;
120}
121
122int dm_btree_empty(struct dm_btree_info *info, dm_block_t *root)
123{
124 int r;
125 struct dm_block *b;
126 struct node *n;
127 size_t block_size;
128 uint32_t max_entries;
129
130 r = new_block(info, &b);
131 if (r < 0)
132 return r;
133
134 block_size = dm_bm_block_size(dm_tm_get_bm(info->tm));
135 max_entries = calc_max_entries(info->value_type.size, block_size);
136
137 n = dm_block_data(b);
138 memset(n, 0, block_size);
139 n->header.flags = cpu_to_le32(LEAF_NODE);
140 n->header.nr_entries = cpu_to_le32(0);
141 n->header.max_entries = cpu_to_le32(max_entries);
142 n->header.value_size = cpu_to_le32(info->value_type.size);
143
144 *root = dm_block_location(b);
145 return unlock_block(info, b);
146}
147EXPORT_SYMBOL_GPL(dm_btree_empty);
148
149/*----------------------------------------------------------------*/
150
151/*
152 * Deletion uses a recursive algorithm, since we have limited stack space
153 * we explicitly manage our own stack on the heap.
154 */
155#define MAX_SPINE_DEPTH 64
156struct frame {
157 struct dm_block *b;
158 struct node *n;
159 unsigned level;
160 unsigned nr_children;
161 unsigned current_child;
162};
163
164struct del_stack {
165 struct dm_transaction_manager *tm;
166 int top;
167 struct frame spine[MAX_SPINE_DEPTH];
168};
169
170static int top_frame(struct del_stack *s, struct frame **f)
171{
172 if (s->top < 0) {
173 DMERR("btree deletion stack empty");
174 return -EINVAL;
175 }
176
177 *f = s->spine + s->top;
178
179 return 0;
180}
181
182static int unprocessed_frames(struct del_stack *s)
183{
184 return s->top >= 0;
185}
186
187static int push_frame(struct del_stack *s, dm_block_t b, unsigned level)
188{
189 int r;
190 uint32_t ref_count;
191
192 if (s->top >= MAX_SPINE_DEPTH - 1) {
193 DMERR("btree deletion stack out of memory");
194 return -ENOMEM;
195 }
196
197 r = dm_tm_ref(s->tm, b, &ref_count);
198 if (r)
199 return r;
200
201 if (ref_count > 1)
202 /*
203 * This is a shared node, so we can just decrement it's
204 * reference counter and leave the children.
205 */
206 dm_tm_dec(s->tm, b);
207
208 else {
209 struct frame *f = s->spine + ++s->top;
210
211 r = dm_tm_read_lock(s->tm, b, &btree_node_validator, &f->b);
212 if (r) {
213 s->top--;
214 return r;
215 }
216
217 f->n = dm_block_data(f->b);
218 f->level = level;
219 f->nr_children = le32_to_cpu(f->n->header.nr_entries);
220 f->current_child = 0;
221 }
222
223 return 0;
224}
225
226static void pop_frame(struct del_stack *s)
227{
228 struct frame *f = s->spine + s->top--;
229
230 dm_tm_dec(s->tm, dm_block_location(f->b));
231 dm_tm_unlock(s->tm, f->b);
232}
233
234int dm_btree_del(struct dm_btree_info *info, dm_block_t root)
235{
236 int r;
237 struct del_stack *s;
238
239 s = kmalloc(sizeof(*s), GFP_KERNEL);
240 if (!s)
241 return -ENOMEM;
242 s->tm = info->tm;
243 s->top = -1;
244
245 r = push_frame(s, root, 1);
246 if (r)
247 goto out;
248
249 while (unprocessed_frames(s)) {
250 uint32_t flags;
251 struct frame *f;
252 dm_block_t b;
253
254 r = top_frame(s, &f);
255 if (r)
256 goto out;
257
258 if (f->current_child >= f->nr_children) {
259 pop_frame(s);
260 continue;
261 }
262
263 flags = le32_to_cpu(f->n->header.flags);
264 if (flags & INTERNAL_NODE) {
265 b = value64(f->n, f->current_child);
266 f->current_child++;
267 r = push_frame(s, b, f->level);
268 if (r)
269 goto out;
270
271 } else if (f->level != (info->levels - 1)) {
272 b = value64(f->n, f->current_child);
273 f->current_child++;
274 r = push_frame(s, b, f->level + 1);
275 if (r)
276 goto out;
277
278 } else {
279 if (info->value_type.dec) {
280 unsigned i;
281
282 for (i = 0; i < f->nr_children; i++)
283 info->value_type.dec(info->value_type.context,
284 value_ptr(f->n, i, info->value_type.size));
285 }
286 f->current_child = f->nr_children;
287 }
288 }
289
290out:
291 kfree(s);
292 return r;
293}
294EXPORT_SYMBOL_GPL(dm_btree_del);
295
296/*----------------------------------------------------------------*/
297
298static int btree_lookup_raw(struct ro_spine *s, dm_block_t block, uint64_t key,
299 int (*search_fn)(struct node *, uint64_t),
300 uint64_t *result_key, void *v, size_t value_size)
301{
302 int i, r;
303 uint32_t flags, nr_entries;
304
305 do {
306 r = ro_step(s, block);
307 if (r < 0)
308 return r;
309
310 i = search_fn(ro_node(s), key);
311
312 flags = le32_to_cpu(ro_node(s)->header.flags);
313 nr_entries = le32_to_cpu(ro_node(s)->header.nr_entries);
314 if (i < 0 || i >= nr_entries)
315 return -ENODATA;
316
317 if (flags & INTERNAL_NODE)
318 block = value64(ro_node(s), i);
319
320 } while (!(flags & LEAF_NODE));
321
322 *result_key = le64_to_cpu(ro_node(s)->keys[i]);
323 memcpy(v, value_ptr(ro_node(s), i, value_size), value_size);
324
325 return 0;
326}
327
328int dm_btree_lookup(struct dm_btree_info *info, dm_block_t root,
329 uint64_t *keys, void *value_le)
330{
331 unsigned level, last_level = info->levels - 1;
332 int r = -ENODATA;
333 uint64_t rkey;
334 __le64 internal_value_le;
335 struct ro_spine spine;
336
337 init_ro_spine(&spine, info);
338 for (level = 0; level < info->levels; level++) {
339 size_t size;
340 void *value_p;
341
342 if (level == last_level) {
343 value_p = value_le;
344 size = info->value_type.size;
345
346 } else {
347 value_p = &internal_value_le;
348 size = sizeof(uint64_t);
349 }
350
351 r = btree_lookup_raw(&spine, root, keys[level],
352 lower_bound, &rkey,
353 value_p, size);
354
355 if (!r) {
356 if (rkey != keys[level]) {
357 exit_ro_spine(&spine);
358 return -ENODATA;
359 }
360 } else {
361 exit_ro_spine(&spine);
362 return r;
363 }
364
365 root = le64_to_cpu(internal_value_le);
366 }
367 exit_ro_spine(&spine);
368
369 return r;
370}
371EXPORT_SYMBOL_GPL(dm_btree_lookup);
372
373/*
374 * Splits a node by creating a sibling node and shifting half the nodes
375 * contents across. Assumes there is a parent node, and it has room for
376 * another child.
377 *
378 * Before:
379 * +--------+
380 * | Parent |
381 * +--------+
382 * |
383 * v
384 * +----------+
385 * | A ++++++ |
386 * +----------+
387 *
388 *
389 * After:
390 * +--------+
391 * | Parent |
392 * +--------+
393 * | |
394 * v +------+
395 * +---------+ |
396 * | A* +++ | v
397 * +---------+ +-------+
398 * | B +++ |
399 * +-------+
400 *
401 * Where A* is a shadow of A.
402 */
403static int btree_split_sibling(struct shadow_spine *s, dm_block_t root,
404 unsigned parent_index, uint64_t key)
405{
406 int r;
407 size_t size;
408 unsigned nr_left, nr_right;
409 struct dm_block *left, *right, *parent;
410 struct node *ln, *rn, *pn;
411 __le64 location;
412
413 left = shadow_current(s);
414
415 r = new_block(s->info, &right);
416 if (r < 0)
417 return r;
418
419 ln = dm_block_data(left);
420 rn = dm_block_data(right);
421
422 nr_left = le32_to_cpu(ln->header.nr_entries) / 2;
423 nr_right = le32_to_cpu(ln->header.nr_entries) - nr_left;
424
425 ln->header.nr_entries = cpu_to_le32(nr_left);
426
427 rn->header.flags = ln->header.flags;
428 rn->header.nr_entries = cpu_to_le32(nr_right);
429 rn->header.max_entries = ln->header.max_entries;
430 rn->header.value_size = ln->header.value_size;
431 memcpy(rn->keys, ln->keys + nr_left, nr_right * sizeof(rn->keys[0]));
432
433 size = le32_to_cpu(ln->header.flags) & INTERNAL_NODE ?
434 sizeof(uint64_t) : s->info->value_type.size;
435 memcpy(value_ptr(rn, 0, size), value_ptr(ln, nr_left, size),
436 size * nr_right);
437
438 /*
439 * Patch up the parent
440 */
441 parent = shadow_parent(s);
442
443 pn = dm_block_data(parent);
444 location = cpu_to_le64(dm_block_location(left));
445 __dm_bless_for_disk(&location);
446 memcpy_disk(value_ptr(pn, parent_index, sizeof(__le64)),
447 &location, sizeof(__le64));
448
449 location = cpu_to_le64(dm_block_location(right));
450 __dm_bless_for_disk(&location);
451
452 r = insert_at(sizeof(__le64), pn, parent_index + 1,
453 le64_to_cpu(rn->keys[0]), &location);
454 if (r)
455 return r;
456
457 if (key < le64_to_cpu(rn->keys[0])) {
458 unlock_block(s->info, right);
459 s->nodes[1] = left;
460 } else {
461 unlock_block(s->info, left);
462 s->nodes[1] = right;
463 }
464
465 return 0;
466}
467
468/*
469 * Splits a node by creating two new children beneath the given node.
470 *
471 * Before:
472 * +----------+
473 * | A ++++++ |
474 * +----------+
475 *
476 *
477 * After:
478 * +------------+
479 * | A (shadow) |
480 * +------------+
481 * | |
482 * +------+ +----+
483 * | |
484 * v v
485 * +-------+ +-------+
486 * | B +++ | | C +++ |
487 * +-------+ +-------+
488 */
489static int btree_split_beneath(struct shadow_spine *s, uint64_t key)
490{
491 int r;
492 size_t size;
493 unsigned nr_left, nr_right;
494 struct dm_block *left, *right, *new_parent;
495 struct node *pn, *ln, *rn;
496 __le64 val;
497
498 new_parent = shadow_current(s);
499
500 r = new_block(s->info, &left);
501 if (r < 0)
502 return r;
503
504 r = new_block(s->info, &right);
505 if (r < 0) {
506 /* FIXME: put left */
507 return r;
508 }
509
510 pn = dm_block_data(new_parent);
511 ln = dm_block_data(left);
512 rn = dm_block_data(right);
513
514 nr_left = le32_to_cpu(pn->header.nr_entries) / 2;
515 nr_right = le32_to_cpu(pn->header.nr_entries) - nr_left;
516
517 ln->header.flags = pn->header.flags;
518 ln->header.nr_entries = cpu_to_le32(nr_left);
519 ln->header.max_entries = pn->header.max_entries;
520 ln->header.value_size = pn->header.value_size;
521
522 rn->header.flags = pn->header.flags;
523 rn->header.nr_entries = cpu_to_le32(nr_right);
524 rn->header.max_entries = pn->header.max_entries;
525 rn->header.value_size = pn->header.value_size;
526
527 memcpy(ln->keys, pn->keys, nr_left * sizeof(pn->keys[0]));
528 memcpy(rn->keys, pn->keys + nr_left, nr_right * sizeof(pn->keys[0]));
529
530 size = le32_to_cpu(pn->header.flags) & INTERNAL_NODE ?
531 sizeof(__le64) : s->info->value_type.size;
532 memcpy(value_ptr(ln, 0, size), value_ptr(pn, 0, size), nr_left * size);
533 memcpy(value_ptr(rn, 0, size), value_ptr(pn, nr_left, size),
534 nr_right * size);
535
536 /* new_parent should just point to l and r now */
537 pn->header.flags = cpu_to_le32(INTERNAL_NODE);
538 pn->header.nr_entries = cpu_to_le32(2);
539 pn->header.max_entries = cpu_to_le32(
540 calc_max_entries(sizeof(__le64),
541 dm_bm_block_size(
542 dm_tm_get_bm(s->info->tm))));
543 pn->header.value_size = cpu_to_le32(sizeof(__le64));
544
545 val = cpu_to_le64(dm_block_location(left));
546 __dm_bless_for_disk(&val);
547 pn->keys[0] = ln->keys[0];
548 memcpy_disk(value_ptr(pn, 0, sizeof(__le64)), &val, sizeof(__le64));
549
550 val = cpu_to_le64(dm_block_location(right));
551 __dm_bless_for_disk(&val);
552 pn->keys[1] = rn->keys[0];
553 memcpy_disk(value_ptr(pn, 1, sizeof(__le64)), &val, sizeof(__le64));
554
555 /*
556 * rejig the spine. This is ugly, since it knows too
557 * much about the spine
558 */
559 if (s->nodes[0] != new_parent) {
560 unlock_block(s->info, s->nodes[0]);
561 s->nodes[0] = new_parent;
562 }
563 if (key < le64_to_cpu(rn->keys[0])) {
564 unlock_block(s->info, right);
565 s->nodes[1] = left;
566 } else {
567 unlock_block(s->info, left);
568 s->nodes[1] = right;
569 }
570 s->count = 2;
571
572 return 0;
573}
574
575static int btree_insert_raw(struct shadow_spine *s, dm_block_t root,
576 struct dm_btree_value_type *vt,
577 uint64_t key, unsigned *index)
578{
579 int r, i = *index, top = 1;
580 struct node *node;
581
582 for (;;) {
583 r = shadow_step(s, root, vt);
584 if (r < 0)
585 return r;
586
587 node = dm_block_data(shadow_current(s));
588
589 /*
590 * We have to patch up the parent node, ugly, but I don't
591 * see a way to do this automatically as part of the spine
592 * op.
593 */
594 if (shadow_has_parent(s) && i >= 0) { /* FIXME: second clause unness. */
595 __le64 location = cpu_to_le64(dm_block_location(shadow_current(s)));
596
597 __dm_bless_for_disk(&location);
598 memcpy_disk(value_ptr(dm_block_data(shadow_parent(s)), i, sizeof(uint64_t)),
599 &location, sizeof(__le64));
600 }
601
602 node = dm_block_data(shadow_current(s));
603
604 if (node->header.nr_entries == node->header.max_entries) {
605 if (top)
606 r = btree_split_beneath(s, key);
607 else
608 r = btree_split_sibling(s, root, i, key);
609
610 if (r < 0)
611 return r;
612 }
613
614 node = dm_block_data(shadow_current(s));
615
616 i = lower_bound(node, key);
617
618 if (le32_to_cpu(node->header.flags) & LEAF_NODE)
619 break;
620
621 if (i < 0) {
622 /* change the bounds on the lowest key */
623 node->keys[0] = cpu_to_le64(key);
624 i = 0;
625 }
626
627 root = value64(node, i);
628 top = 0;
629 }
630
631 if (i < 0 || le64_to_cpu(node->keys[i]) != key)
632 i++;
633
634 *index = i;
635 return 0;
636}
637
638static int insert(struct dm_btree_info *info, dm_block_t root,
639 uint64_t *keys, void *value, dm_block_t *new_root,
640 int *inserted)
641 __dm_written_to_disk(value)
642{
643 int r, need_insert;
644 unsigned level, index = -1, last_level = info->levels - 1;
645 dm_block_t block = root;
646 struct shadow_spine spine;
647 struct node *n;
648 struct dm_btree_value_type le64_type;
649
650 le64_type.context = NULL;
651 le64_type.size = sizeof(__le64);
652 le64_type.inc = NULL;
653 le64_type.dec = NULL;
654 le64_type.equal = NULL;
655
656 init_shadow_spine(&spine, info);
657
658 for (level = 0; level < (info->levels - 1); level++) {
659 r = btree_insert_raw(&spine, block, &le64_type, keys[level], &index);
660 if (r < 0)
661 goto bad;
662
663 n = dm_block_data(shadow_current(&spine));
664 need_insert = ((index >= le32_to_cpu(n->header.nr_entries)) ||
665 (le64_to_cpu(n->keys[index]) != keys[level]));
666
667 if (need_insert) {
668 dm_block_t new_tree;
669 __le64 new_le;
670
671 r = dm_btree_empty(info, &new_tree);
672 if (r < 0)
673 goto bad;
674
675 new_le = cpu_to_le64(new_tree);
676 __dm_bless_for_disk(&new_le);
677
678 r = insert_at(sizeof(uint64_t), n, index,
679 keys[level], &new_le);
680 if (r)
681 goto bad;
682 }
683
684 if (level < last_level)
685 block = value64(n, index);
686 }
687
688 r = btree_insert_raw(&spine, block, &info->value_type,
689 keys[level], &index);
690 if (r < 0)
691 goto bad;
692
693 n = dm_block_data(shadow_current(&spine));
694 need_insert = ((index >= le32_to_cpu(n->header.nr_entries)) ||
695 (le64_to_cpu(n->keys[index]) != keys[level]));
696
697 if (need_insert) {
698 if (inserted)
699 *inserted = 1;
700
701 r = insert_at(info->value_type.size, n, index,
702 keys[level], value);
703 if (r)
704 goto bad_unblessed;
705 } else {
706 if (inserted)
707 *inserted = 0;
708
709 if (info->value_type.dec &&
710 (!info->value_type.equal ||
711 !info->value_type.equal(
712 info->value_type.context,
713 value_ptr(n, index, info->value_type.size),
714 value))) {
715 info->value_type.dec(info->value_type.context,
716 value_ptr(n, index, info->value_type.size));
717 }
718 memcpy_disk(value_ptr(n, index, info->value_type.size),
719 value, info->value_type.size);
720 }
721
722 *new_root = shadow_root(&spine);
723 exit_shadow_spine(&spine);
724
725 return 0;
726
727bad:
728 __dm_unbless_for_disk(value);
729bad_unblessed:
730 exit_shadow_spine(&spine);
731 return r;
732}
733
734int dm_btree_insert(struct dm_btree_info *info, dm_block_t root,
735 uint64_t *keys, void *value, dm_block_t *new_root)
736 __dm_written_to_disk(value)
737{
738 return insert(info, root, keys, value, new_root, NULL);
739}
740EXPORT_SYMBOL_GPL(dm_btree_insert);
741
742int dm_btree_insert_notify(struct dm_btree_info *info, dm_block_t root,
743 uint64_t *keys, void *value, dm_block_t *new_root,
744 int *inserted)
745 __dm_written_to_disk(value)
746{
747 return insert(info, root, keys, value, new_root, inserted);
748}
749EXPORT_SYMBOL_GPL(dm_btree_insert_notify);
750
751/*----------------------------------------------------------------*/
752
753static int find_highest_key(struct ro_spine *s, dm_block_t block,
754 uint64_t *result_key, dm_block_t *next_block)
755{
756 int i, r;
757 uint32_t flags;
758
759 do {
760 r = ro_step(s, block);
761 if (r < 0)
762 return r;
763
764 flags = le32_to_cpu(ro_node(s)->header.flags);
765 i = le32_to_cpu(ro_node(s)->header.nr_entries);
766 if (!i)
767 return -ENODATA;
768 else
769 i--;
770
771 *result_key = le64_to_cpu(ro_node(s)->keys[i]);
772 if (next_block || flags & INTERNAL_NODE)
773 block = value64(ro_node(s), i);
774
775 } while (flags & INTERNAL_NODE);
776
777 if (next_block)
778 *next_block = block;
779 return 0;
780}
781
782int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
783 uint64_t *result_keys)
784{
785 int r = 0, count = 0, level;
786 struct ro_spine spine;
787
788 init_ro_spine(&spine, info);
789 for (level = 0; level < info->levels; level++) {
790 r = find_highest_key(&spine, root, result_keys + level,
791 level == info->levels - 1 ? NULL : &root);
792 if (r == -ENODATA) {
793 r = 0;
794 break;
795
796 } else if (r)
797 break;
798
799 count++;
800 }
801 exit_ro_spine(&spine);
802
803 return r ? r : count;
804}
805EXPORT_SYMBOL_GPL(dm_btree_find_highest_key);
diff --git a/drivers/md/persistent-data/dm-btree.h b/drivers/md/persistent-data/dm-btree.h
new file mode 100644
index 00000000000..ae02c84410f
--- /dev/null
+++ b/drivers/md/persistent-data/dm-btree.h
@@ -0,0 +1,145 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6#ifndef _LINUX_DM_BTREE_H
7#define _LINUX_DM_BTREE_H
8
9#include "dm-block-manager.h"
10
11struct dm_transaction_manager;
12
13/*----------------------------------------------------------------*/
14
15/*
16 * Annotations used to check on-disk metadata is handled as little-endian.
17 */
18#ifdef __CHECKER__
19# define __dm_written_to_disk(x) __releases(x)
20# define __dm_reads_from_disk(x) __acquires(x)
21# define __dm_bless_for_disk(x) __acquire(x)
22# define __dm_unbless_for_disk(x) __release(x)
23#else
24# define __dm_written_to_disk(x)
25# define __dm_reads_from_disk(x)
26# define __dm_bless_for_disk(x)
27# define __dm_unbless_for_disk(x)
28#endif
29
30/*----------------------------------------------------------------*/
31
32/*
33 * Manipulates hierarchical B+ trees with 64-bit keys and arbitrary-sized
34 * values.
35 */
36
37/*
38 * Infomation about the values stored within the btree.
39 */
40struct dm_btree_value_type {
41 void *context;
42
43 /*
44 * The size in bytes of each value.
45 */
46 uint32_t size;
47
48 /*
49 * Any of these methods can be safely set to NULL if you do not
50 * need the corresponding feature.
51 */
52
53 /*
54 * The btree is making a duplicate of the value, for instance
55 * because previously-shared btree nodes have now diverged.
56 * @value argument is the new copy that the copy function may modify.
57 * (Probably it just wants to increment a reference count
58 * somewhere.) This method is _not_ called for insertion of a new
59 * value: It is assumed the ref count is already 1.
60 */
61 void (*inc)(void *context, void *value);
62
63 /*
64 * This value is being deleted. The btree takes care of freeing
65 * the memory pointed to by @value. Often the del function just
66 * needs to decrement a reference count somewhere.
67 */
68 void (*dec)(void *context, void *value);
69
70 /*
71 * A test for equality between two values. When a value is
72 * overwritten with a new one, the old one has the dec method
73 * called _unless_ the new and old value are deemed equal.
74 */
75 int (*equal)(void *context, void *value1, void *value2);
76};
77
78/*
79 * The shape and contents of a btree.
80 */
81struct dm_btree_info {
82 struct dm_transaction_manager *tm;
83
84 /*
85 * Number of nested btrees. (Not the depth of a single tree.)
86 */
87 unsigned levels;
88 struct dm_btree_value_type value_type;
89};
90
91/*
92 * Set up an empty tree. O(1).
93 */
94int dm_btree_empty(struct dm_btree_info *info, dm_block_t *root);
95
96/*
97 * Delete a tree. O(n) - this is the slow one! It can also block, so
98 * please don't call it on an IO path.
99 */
100int dm_btree_del(struct dm_btree_info *info, dm_block_t root);
101
102/*
103 * All the lookup functions return -ENODATA if the key cannot be found.
104 */
105
106/*
107 * Tries to find a key that matches exactly. O(ln(n))
108 */
109int dm_btree_lookup(struct dm_btree_info *info, dm_block_t root,
110 uint64_t *keys, void *value_le);
111
112/*
113 * Insertion (or overwrite an existing value). O(ln(n))
114 */
115int dm_btree_insert(struct dm_btree_info *info, dm_block_t root,
116 uint64_t *keys, void *value, dm_block_t *new_root)
117 __dm_written_to_disk(value);
118
119/*
120 * A variant of insert that indicates whether it actually inserted or just
121 * overwrote. Useful if you're keeping track of the number of entries in a
122 * tree.
123 */
124int dm_btree_insert_notify(struct dm_btree_info *info, dm_block_t root,
125 uint64_t *keys, void *value, dm_block_t *new_root,
126 int *inserted)
127 __dm_written_to_disk(value);
128
129/*
130 * Remove a key if present. This doesn't remove empty sub trees. Normally
131 * subtrees represent a separate entity, like a snapshot map, so this is
132 * correct behaviour. O(ln(n)).
133 */
134int dm_btree_remove(struct dm_btree_info *info, dm_block_t root,
135 uint64_t *keys, dm_block_t *new_root);
136
137/*
138 * Returns < 0 on failure. Otherwise the number of key entries that have
139 * been filled out. Remember trees can have zero entries, and as such have
140 * no highest key.
141 */
142int dm_btree_find_highest_key(struct dm_btree_info *info, dm_block_t root,
143 uint64_t *result_keys);
144
145#endif /* _LINUX_DM_BTREE_H */
diff --git a/drivers/md/persistent-data/dm-persistent-data-internal.h b/drivers/md/persistent-data/dm-persistent-data-internal.h
new file mode 100644
index 00000000000..c49e26fff36
--- /dev/null
+++ b/drivers/md/persistent-data/dm-persistent-data-internal.h
@@ -0,0 +1,19 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef _DM_PERSISTENT_DATA_INTERNAL_H
8#define _DM_PERSISTENT_DATA_INTERNAL_H
9
10#include "dm-block-manager.h"
11
12static inline unsigned dm_hash_block(dm_block_t b, unsigned hash_mask)
13{
14 const unsigned BIG_PRIME = 4294967291UL;
15
16 return (((unsigned) b) * BIG_PRIME) & hash_mask;
17}
18
19#endif /* _PERSISTENT_DATA_INTERNAL_H */
diff --git a/drivers/md/persistent-data/dm-space-map-checker.c b/drivers/md/persistent-data/dm-space-map-checker.c
new file mode 100644
index 00000000000..bb44a937fe6
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-checker.c
@@ -0,0 +1,437 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-space-map-checker.h"
8
9#include <linux/device-mapper.h>
10
11#ifdef CONFIG_DM_DEBUG_SPACE_MAPS
12
13#define DM_MSG_PREFIX "space map checker"
14
15/*----------------------------------------------------------------*/
16
17struct count_array {
18 dm_block_t nr;
19 dm_block_t nr_free;
20
21 uint32_t *counts;
22};
23
24static int ca_get_count(struct count_array *ca, dm_block_t b, uint32_t *count)
25{
26 if (b >= ca->nr)
27 return -EINVAL;
28
29 *count = ca->counts[b];
30 return 0;
31}
32
33static int ca_count_more_than_one(struct count_array *ca, dm_block_t b, int *r)
34{
35 if (b >= ca->nr)
36 return -EINVAL;
37
38 *r = ca->counts[b] > 1;
39 return 0;
40}
41
42static int ca_set_count(struct count_array *ca, dm_block_t b, uint32_t count)
43{
44 uint32_t old_count;
45
46 if (b >= ca->nr)
47 return -EINVAL;
48
49 old_count = ca->counts[b];
50
51 if (!count && old_count)
52 ca->nr_free++;
53
54 else if (count && !old_count)
55 ca->nr_free--;
56
57 ca->counts[b] = count;
58 return 0;
59}
60
61static int ca_inc_block(struct count_array *ca, dm_block_t b)
62{
63 if (b >= ca->nr)
64 return -EINVAL;
65
66 ca_set_count(ca, b, ca->counts[b] + 1);
67 return 0;
68}
69
70static int ca_dec_block(struct count_array *ca, dm_block_t b)
71{
72 if (b >= ca->nr)
73 return -EINVAL;
74
75 BUG_ON(ca->counts[b] == 0);
76 ca_set_count(ca, b, ca->counts[b] - 1);
77 return 0;
78}
79
80static int ca_create(struct count_array *ca, struct dm_space_map *sm)
81{
82 int r;
83 dm_block_t nr_blocks;
84
85 r = dm_sm_get_nr_blocks(sm, &nr_blocks);
86 if (r)
87 return r;
88
89 ca->nr = nr_blocks;
90 ca->nr_free = nr_blocks;
91 ca->counts = kzalloc(sizeof(*ca->counts) * nr_blocks, GFP_KERNEL);
92 if (!ca->counts)
93 return -ENOMEM;
94
95 return 0;
96}
97
98static int ca_load(struct count_array *ca, struct dm_space_map *sm)
99{
100 int r;
101 uint32_t count;
102 dm_block_t nr_blocks, i;
103
104 r = dm_sm_get_nr_blocks(sm, &nr_blocks);
105 if (r)
106 return r;
107
108 BUG_ON(ca->nr != nr_blocks);
109
110 DMWARN("Loading debug space map from disk. This may take some time");
111 for (i = 0; i < nr_blocks; i++) {
112 r = dm_sm_get_count(sm, i, &count);
113 if (r) {
114 DMERR("load failed");
115 return r;
116 }
117
118 ca_set_count(ca, i, count);
119 }
120 DMWARN("Load complete");
121
122 return 0;
123}
124
125static int ca_extend(struct count_array *ca, dm_block_t extra_blocks)
126{
127 dm_block_t nr_blocks = ca->nr + extra_blocks;
128 uint32_t *counts = kzalloc(sizeof(*counts) * nr_blocks, GFP_KERNEL);
129 if (!counts)
130 return -ENOMEM;
131
132 memcpy(counts, ca->counts, sizeof(*counts) * ca->nr);
133 kfree(ca->counts);
134 ca->nr = nr_blocks;
135 ca->nr_free += extra_blocks;
136 ca->counts = counts;
137 return 0;
138}
139
140static int ca_commit(struct count_array *old, struct count_array *new)
141{
142 if (old->nr != new->nr) {
143 BUG_ON(old->nr > new->nr);
144 ca_extend(old, new->nr - old->nr);
145 }
146
147 BUG_ON(old->nr != new->nr);
148 old->nr_free = new->nr_free;
149 memcpy(old->counts, new->counts, sizeof(*old->counts) * old->nr);
150 return 0;
151}
152
153static void ca_destroy(struct count_array *ca)
154{
155 kfree(ca->counts);
156}
157
158/*----------------------------------------------------------------*/
159
160struct sm_checker {
161 struct dm_space_map sm;
162
163 struct count_array old_counts;
164 struct count_array counts;
165
166 struct dm_space_map *real_sm;
167};
168
169static void sm_checker_destroy(struct dm_space_map *sm)
170{
171 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
172
173 dm_sm_destroy(smc->real_sm);
174 ca_destroy(&smc->old_counts);
175 ca_destroy(&smc->counts);
176 kfree(smc);
177}
178
179static int sm_checker_get_nr_blocks(struct dm_space_map *sm, dm_block_t *count)
180{
181 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
182 int r = dm_sm_get_nr_blocks(smc->real_sm, count);
183 if (!r)
184 BUG_ON(smc->old_counts.nr != *count);
185 return r;
186}
187
188static int sm_checker_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
189{
190 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
191 int r = dm_sm_get_nr_free(smc->real_sm, count);
192 if (!r) {
193 /*
194 * Slow, but we know it's correct.
195 */
196 dm_block_t b, n = 0;
197 for (b = 0; b < smc->old_counts.nr; b++)
198 if (smc->old_counts.counts[b] == 0 &&
199 smc->counts.counts[b] == 0)
200 n++;
201
202 if (n != *count)
203 DMERR("free block counts differ, checker %u, sm-disk:%u",
204 (unsigned) n, (unsigned) *count);
205 }
206 return r;
207}
208
209static int sm_checker_new_block(struct dm_space_map *sm, dm_block_t *b)
210{
211 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
212 int r = dm_sm_new_block(smc->real_sm, b);
213
214 if (!r) {
215 BUG_ON(*b >= smc->old_counts.nr);
216 BUG_ON(smc->old_counts.counts[*b] != 0);
217 BUG_ON(*b >= smc->counts.nr);
218 BUG_ON(smc->counts.counts[*b] != 0);
219 ca_set_count(&smc->counts, *b, 1);
220 }
221
222 return r;
223}
224
225static int sm_checker_inc_block(struct dm_space_map *sm, dm_block_t b)
226{
227 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
228 int r = dm_sm_inc_block(smc->real_sm, b);
229 int r2 = ca_inc_block(&smc->counts, b);
230 BUG_ON(r != r2);
231 return r;
232}
233
234static int sm_checker_dec_block(struct dm_space_map *sm, dm_block_t b)
235{
236 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
237 int r = dm_sm_dec_block(smc->real_sm, b);
238 int r2 = ca_dec_block(&smc->counts, b);
239 BUG_ON(r != r2);
240 return r;
241}
242
243static int sm_checker_get_count(struct dm_space_map *sm, dm_block_t b, uint32_t *result)
244{
245 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
246 uint32_t result2 = 0;
247 int r = dm_sm_get_count(smc->real_sm, b, result);
248 int r2 = ca_get_count(&smc->counts, b, &result2);
249
250 BUG_ON(r != r2);
251 if (!r)
252 BUG_ON(*result != result2);
253 return r;
254}
255
256static int sm_checker_count_more_than_one(struct dm_space_map *sm, dm_block_t b, int *result)
257{
258 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
259 int result2 = 0;
260 int r = dm_sm_count_is_more_than_one(smc->real_sm, b, result);
261 int r2 = ca_count_more_than_one(&smc->counts, b, &result2);
262
263 BUG_ON(r != r2);
264 if (!r)
265 BUG_ON(!(*result) && result2);
266 return r;
267}
268
269static int sm_checker_set_count(struct dm_space_map *sm, dm_block_t b, uint32_t count)
270{
271 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
272 uint32_t old_rc;
273 int r = dm_sm_set_count(smc->real_sm, b, count);
274 int r2;
275
276 BUG_ON(b >= smc->counts.nr);
277 old_rc = smc->counts.counts[b];
278 r2 = ca_set_count(&smc->counts, b, count);
279 BUG_ON(r != r2);
280
281 return r;
282}
283
284static int sm_checker_commit(struct dm_space_map *sm)
285{
286 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
287 int r;
288
289 r = dm_sm_commit(smc->real_sm);
290 if (r)
291 return r;
292
293 r = ca_commit(&smc->old_counts, &smc->counts);
294 if (r)
295 return r;
296
297 return 0;
298}
299
300static int sm_checker_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
301{
302 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
303 int r = dm_sm_extend(smc->real_sm, extra_blocks);
304 if (r)
305 return r;
306
307 return ca_extend(&smc->counts, extra_blocks);
308}
309
310static int sm_checker_root_size(struct dm_space_map *sm, size_t *result)
311{
312 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
313 return dm_sm_root_size(smc->real_sm, result);
314}
315
316static int sm_checker_copy_root(struct dm_space_map *sm, void *copy_to_here_le, size_t len)
317{
318 struct sm_checker *smc = container_of(sm, struct sm_checker, sm);
319 return dm_sm_copy_root(smc->real_sm, copy_to_here_le, len);
320}
321
322/*----------------------------------------------------------------*/
323
324static struct dm_space_map ops_ = {
325 .destroy = sm_checker_destroy,
326 .get_nr_blocks = sm_checker_get_nr_blocks,
327 .get_nr_free = sm_checker_get_nr_free,
328 .inc_block = sm_checker_inc_block,
329 .dec_block = sm_checker_dec_block,
330 .new_block = sm_checker_new_block,
331 .get_count = sm_checker_get_count,
332 .count_is_more_than_one = sm_checker_count_more_than_one,
333 .set_count = sm_checker_set_count,
334 .commit = sm_checker_commit,
335 .extend = sm_checker_extend,
336 .root_size = sm_checker_root_size,
337 .copy_root = sm_checker_copy_root
338};
339
340struct dm_space_map *dm_sm_checker_create(struct dm_space_map *sm)
341{
342 int r;
343 struct sm_checker *smc;
344
345 if (!sm)
346 return NULL;
347
348 smc = kmalloc(sizeof(*smc), GFP_KERNEL);
349 if (!smc)
350 return NULL;
351
352 memcpy(&smc->sm, &ops_, sizeof(smc->sm));
353 r = ca_create(&smc->old_counts, sm);
354 if (r) {
355 kfree(smc);
356 return NULL;
357 }
358
359 r = ca_create(&smc->counts, sm);
360 if (r) {
361 ca_destroy(&smc->old_counts);
362 kfree(smc);
363 return NULL;
364 }
365
366 smc->real_sm = sm;
367
368 r = ca_load(&smc->counts, sm);
369 if (r) {
370 ca_destroy(&smc->counts);
371 ca_destroy(&smc->old_counts);
372 kfree(smc);
373 return NULL;
374 }
375
376 r = ca_commit(&smc->old_counts, &smc->counts);
377 if (r) {
378 ca_destroy(&smc->counts);
379 ca_destroy(&smc->old_counts);
380 kfree(smc);
381 return NULL;
382 }
383
384 return &smc->sm;
385}
386EXPORT_SYMBOL_GPL(dm_sm_checker_create);
387
388struct dm_space_map *dm_sm_checker_create_fresh(struct dm_space_map *sm)
389{
390 int r;
391 struct sm_checker *smc;
392
393 if (!sm)
394 return NULL;
395
396 smc = kmalloc(sizeof(*smc), GFP_KERNEL);
397 if (!smc)
398 return NULL;
399
400 memcpy(&smc->sm, &ops_, sizeof(smc->sm));
401 r = ca_create(&smc->old_counts, sm);
402 if (r) {
403 kfree(smc);
404 return NULL;
405 }
406
407 r = ca_create(&smc->counts, sm);
408 if (r) {
409 ca_destroy(&smc->old_counts);
410 kfree(smc);
411 return NULL;
412 }
413
414 smc->real_sm = sm;
415 return &smc->sm;
416}
417EXPORT_SYMBOL_GPL(dm_sm_checker_create_fresh);
418
419/*----------------------------------------------------------------*/
420
421#else
422
423struct dm_space_map *dm_sm_checker_create(struct dm_space_map *sm)
424{
425 return sm;
426}
427EXPORT_SYMBOL_GPL(dm_sm_checker_create);
428
429struct dm_space_map *dm_sm_checker_create_fresh(struct dm_space_map *sm)
430{
431 return sm;
432}
433EXPORT_SYMBOL_GPL(dm_sm_checker_create_fresh);
434
435/*----------------------------------------------------------------*/
436
437#endif
diff --git a/drivers/md/persistent-data/dm-space-map-checker.h b/drivers/md/persistent-data/dm-space-map-checker.h
new file mode 100644
index 00000000000..444dccf6688
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-checker.h
@@ -0,0 +1,26 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef SNAPSHOTS_SPACE_MAP_CHECKER_H
8#define SNAPSHOTS_SPACE_MAP_CHECKER_H
9
10#include "dm-space-map.h"
11
12/*----------------------------------------------------------------*/
13
14/*
15 * This space map wraps a real on-disk space map, and verifies all of its
16 * operations. It uses a lot of memory, so only use if you have a specific
17 * problem that you're debugging.
18 *
19 * Ownership of @sm passes.
20 */
21struct dm_space_map *dm_sm_checker_create(struct dm_space_map *sm);
22struct dm_space_map *dm_sm_checker_create_fresh(struct dm_space_map *sm);
23
24/*----------------------------------------------------------------*/
25
26#endif
diff --git a/drivers/md/persistent-data/dm-space-map-common.c b/drivers/md/persistent-data/dm-space-map-common.c
new file mode 100644
index 00000000000..df2494c06cd
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-common.c
@@ -0,0 +1,705 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-space-map-common.h"
8#include "dm-transaction-manager.h"
9
10#include <linux/bitops.h>
11#include <linux/device-mapper.h>
12
13#define DM_MSG_PREFIX "space map common"
14
15/*----------------------------------------------------------------*/
16
17/*
18 * Index validator.
19 */
20#define INDEX_CSUM_XOR 160478
21
22static void index_prepare_for_write(struct dm_block_validator *v,
23 struct dm_block *b,
24 size_t block_size)
25{
26 struct disk_metadata_index *mi_le = dm_block_data(b);
27
28 mi_le->blocknr = cpu_to_le64(dm_block_location(b));
29 mi_le->csum = cpu_to_le32(dm_bm_checksum(&mi_le->padding,
30 block_size - sizeof(__le32),
31 INDEX_CSUM_XOR));
32}
33
34static int index_check(struct dm_block_validator *v,
35 struct dm_block *b,
36 size_t block_size)
37{
38 struct disk_metadata_index *mi_le = dm_block_data(b);
39 __le32 csum_disk;
40
41 if (dm_block_location(b) != le64_to_cpu(mi_le->blocknr)) {
42 DMERR("index_check failed blocknr %llu wanted %llu",
43 le64_to_cpu(mi_le->blocknr), dm_block_location(b));
44 return -ENOTBLK;
45 }
46
47 csum_disk = cpu_to_le32(dm_bm_checksum(&mi_le->padding,
48 block_size - sizeof(__le32),
49 INDEX_CSUM_XOR));
50 if (csum_disk != mi_le->csum) {
51 DMERR("index_check failed csum %u wanted %u",
52 le32_to_cpu(csum_disk), le32_to_cpu(mi_le->csum));
53 return -EILSEQ;
54 }
55
56 return 0;
57}
58
59static struct dm_block_validator index_validator = {
60 .name = "index",
61 .prepare_for_write = index_prepare_for_write,
62 .check = index_check
63};
64
65/*----------------------------------------------------------------*/
66
67/*
68 * Bitmap validator
69 */
70#define BITMAP_CSUM_XOR 240779
71
72static void bitmap_prepare_for_write(struct dm_block_validator *v,
73 struct dm_block *b,
74 size_t block_size)
75{
76 struct disk_bitmap_header *disk_header = dm_block_data(b);
77
78 disk_header->blocknr = cpu_to_le64(dm_block_location(b));
79 disk_header->csum = cpu_to_le32(dm_bm_checksum(&disk_header->not_used,
80 block_size - sizeof(__le32),
81 BITMAP_CSUM_XOR));
82}
83
84static int bitmap_check(struct dm_block_validator *v,
85 struct dm_block *b,
86 size_t block_size)
87{
88 struct disk_bitmap_header *disk_header = dm_block_data(b);
89 __le32 csum_disk;
90
91 if (dm_block_location(b) != le64_to_cpu(disk_header->blocknr)) {
92 DMERR("bitmap check failed blocknr %llu wanted %llu",
93 le64_to_cpu(disk_header->blocknr), dm_block_location(b));
94 return -ENOTBLK;
95 }
96
97 csum_disk = cpu_to_le32(dm_bm_checksum(&disk_header->not_used,
98 block_size - sizeof(__le32),
99 BITMAP_CSUM_XOR));
100 if (csum_disk != disk_header->csum) {
101 DMERR("bitmap check failed csum %u wanted %u",
102 le32_to_cpu(csum_disk), le32_to_cpu(disk_header->csum));
103 return -EILSEQ;
104 }
105
106 return 0;
107}
108
109static struct dm_block_validator dm_sm_bitmap_validator = {
110 .name = "sm_bitmap",
111 .prepare_for_write = bitmap_prepare_for_write,
112 .check = bitmap_check
113};
114
115/*----------------------------------------------------------------*/
116
117#define ENTRIES_PER_WORD 32
118#define ENTRIES_SHIFT 5
119
120static void *dm_bitmap_data(struct dm_block *b)
121{
122 return dm_block_data(b) + sizeof(struct disk_bitmap_header);
123}
124
125#define WORD_MASK_HIGH 0xAAAAAAAAAAAAAAAAULL
126
127static unsigned bitmap_word_used(void *addr, unsigned b)
128{
129 __le64 *words_le = addr;
130 __le64 *w_le = words_le + (b >> ENTRIES_SHIFT);
131
132 uint64_t bits = le64_to_cpu(*w_le);
133 uint64_t mask = (bits + WORD_MASK_HIGH + 1) & WORD_MASK_HIGH;
134
135 return !(~bits & mask);
136}
137
138static unsigned sm_lookup_bitmap(void *addr, unsigned b)
139{
140 __le64 *words_le = addr;
141 __le64 *w_le = words_le + (b >> ENTRIES_SHIFT);
142 unsigned hi, lo;
143
144 b = (b & (ENTRIES_PER_WORD - 1)) << 1;
145 hi = !!test_bit_le(b, (void *) w_le);
146 lo = !!test_bit_le(b + 1, (void *) w_le);
147 return (hi << 1) | lo;
148}
149
150static void sm_set_bitmap(void *addr, unsigned b, unsigned val)
151{
152 __le64 *words_le = addr;
153 __le64 *w_le = words_le + (b >> ENTRIES_SHIFT);
154
155 b = (b & (ENTRIES_PER_WORD - 1)) << 1;
156
157 if (val & 2)
158 __set_bit_le(b, (void *) w_le);
159 else
160 __clear_bit_le(b, (void *) w_le);
161
162 if (val & 1)
163 __set_bit_le(b + 1, (void *) w_le);
164 else
165 __clear_bit_le(b + 1, (void *) w_le);
166}
167
168static int sm_find_free(void *addr, unsigned begin, unsigned end,
169 unsigned *result)
170{
171 while (begin < end) {
172 if (!(begin & (ENTRIES_PER_WORD - 1)) &&
173 bitmap_word_used(addr, begin)) {
174 begin += ENTRIES_PER_WORD;
175 continue;
176 }
177
178 if (!sm_lookup_bitmap(addr, begin)) {
179 *result = begin;
180 return 0;
181 }
182
183 begin++;
184 }
185
186 return -ENOSPC;
187}
188
189/*----------------------------------------------------------------*/
190
191static int sm_ll_init(struct ll_disk *ll, struct dm_transaction_manager *tm)
192{
193 ll->tm = tm;
194
195 ll->bitmap_info.tm = tm;
196 ll->bitmap_info.levels = 1;
197
198 /*
199 * Because the new bitmap blocks are created via a shadow
200 * operation, the old entry has already had its reference count
201 * decremented and we don't need the btree to do any bookkeeping.
202 */
203 ll->bitmap_info.value_type.size = sizeof(struct disk_index_entry);
204 ll->bitmap_info.value_type.inc = NULL;
205 ll->bitmap_info.value_type.dec = NULL;
206 ll->bitmap_info.value_type.equal = NULL;
207
208 ll->ref_count_info.tm = tm;
209 ll->ref_count_info.levels = 1;
210 ll->ref_count_info.value_type.size = sizeof(uint32_t);
211 ll->ref_count_info.value_type.inc = NULL;
212 ll->ref_count_info.value_type.dec = NULL;
213 ll->ref_count_info.value_type.equal = NULL;
214
215 ll->block_size = dm_bm_block_size(dm_tm_get_bm(tm));
216
217 if (ll->block_size > (1 << 30)) {
218 DMERR("block size too big to hold bitmaps");
219 return -EINVAL;
220 }
221
222 ll->entries_per_block = (ll->block_size - sizeof(struct disk_bitmap_header)) *
223 ENTRIES_PER_BYTE;
224 ll->nr_blocks = 0;
225 ll->bitmap_root = 0;
226 ll->ref_count_root = 0;
227
228 return 0;
229}
230
231int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks)
232{
233 int r;
234 dm_block_t i, nr_blocks, nr_indexes;
235 unsigned old_blocks, blocks;
236
237 nr_blocks = ll->nr_blocks + extra_blocks;
238 old_blocks = dm_sector_div_up(ll->nr_blocks, ll->entries_per_block);
239 blocks = dm_sector_div_up(nr_blocks, ll->entries_per_block);
240
241 nr_indexes = dm_sector_div_up(nr_blocks, ll->entries_per_block);
242 if (nr_indexes > ll->max_entries(ll)) {
243 DMERR("space map too large");
244 return -EINVAL;
245 }
246
247 for (i = old_blocks; i < blocks; i++) {
248 struct dm_block *b;
249 struct disk_index_entry idx;
250
251 r = dm_tm_new_block(ll->tm, &dm_sm_bitmap_validator, &b);
252 if (r < 0)
253 return r;
254 idx.blocknr = cpu_to_le64(dm_block_location(b));
255
256 r = dm_tm_unlock(ll->tm, b);
257 if (r < 0)
258 return r;
259
260 idx.nr_free = cpu_to_le32(ll->entries_per_block);
261 idx.none_free_before = 0;
262
263 r = ll->save_ie(ll, i, &idx);
264 if (r < 0)
265 return r;
266 }
267
268 ll->nr_blocks = nr_blocks;
269 return 0;
270}
271
272int sm_ll_lookup_bitmap(struct ll_disk *ll, dm_block_t b, uint32_t *result)
273{
274 int r;
275 dm_block_t index = b;
276 struct disk_index_entry ie_disk;
277 struct dm_block *blk;
278
279 b = do_div(index, ll->entries_per_block);
280 r = ll->load_ie(ll, index, &ie_disk);
281 if (r < 0)
282 return r;
283
284 r = dm_tm_read_lock(ll->tm, le64_to_cpu(ie_disk.blocknr),
285 &dm_sm_bitmap_validator, &blk);
286 if (r < 0)
287 return r;
288
289 *result = sm_lookup_bitmap(dm_bitmap_data(blk), b);
290
291 return dm_tm_unlock(ll->tm, blk);
292}
293
294int sm_ll_lookup(struct ll_disk *ll, dm_block_t b, uint32_t *result)
295{
296 __le32 le_rc;
297 int r = sm_ll_lookup_bitmap(ll, b, result);
298
299 if (r)
300 return r;
301
302 if (*result != 3)
303 return r;
304
305 r = dm_btree_lookup(&ll->ref_count_info, ll->ref_count_root, &b, &le_rc);
306 if (r < 0)
307 return r;
308
309 *result = le32_to_cpu(le_rc);
310
311 return r;
312}
313
314int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t begin,
315 dm_block_t end, dm_block_t *result)
316{
317 int r;
318 struct disk_index_entry ie_disk;
319 dm_block_t i, index_begin = begin;
320 dm_block_t index_end = dm_sector_div_up(end, ll->entries_per_block);
321
322 /*
323 * FIXME: Use shifts
324 */
325 begin = do_div(index_begin, ll->entries_per_block);
326 end = do_div(end, ll->entries_per_block);
327
328 for (i = index_begin; i < index_end; i++, begin = 0) {
329 struct dm_block *blk;
330 unsigned position;
331 uint32_t bit_end;
332
333 r = ll->load_ie(ll, i, &ie_disk);
334 if (r < 0)
335 return r;
336
337 if (le32_to_cpu(ie_disk.nr_free) == 0)
338 continue;
339
340 r = dm_tm_read_lock(ll->tm, le64_to_cpu(ie_disk.blocknr),
341 &dm_sm_bitmap_validator, &blk);
342 if (r < 0)
343 return r;
344
345 bit_end = (i == index_end - 1) ? end : ll->entries_per_block;
346
347 r = sm_find_free(dm_bitmap_data(blk),
348 max_t(unsigned, begin, le32_to_cpu(ie_disk.none_free_before)),
349 bit_end, &position);
350 if (r == -ENOSPC) {
351 /*
352 * This might happen because we started searching
353 * part way through the bitmap.
354 */
355 dm_tm_unlock(ll->tm, blk);
356 continue;
357
358 } else if (r < 0) {
359 dm_tm_unlock(ll->tm, blk);
360 return r;
361 }
362
363 r = dm_tm_unlock(ll->tm, blk);
364 if (r < 0)
365 return r;
366
367 *result = i * ll->entries_per_block + (dm_block_t) position;
368 return 0;
369 }
370
371 return -ENOSPC;
372}
373
374int sm_ll_insert(struct ll_disk *ll, dm_block_t b,
375 uint32_t ref_count, enum allocation_event *ev)
376{
377 int r;
378 uint32_t bit, old;
379 struct dm_block *nb;
380 dm_block_t index = b;
381 struct disk_index_entry ie_disk;
382 void *bm_le;
383 int inc;
384
385 bit = do_div(index, ll->entries_per_block);
386 r = ll->load_ie(ll, index, &ie_disk);
387 if (r < 0)
388 return r;
389
390 r = dm_tm_shadow_block(ll->tm, le64_to_cpu(ie_disk.blocknr),
391 &dm_sm_bitmap_validator, &nb, &inc);
392 if (r < 0) {
393 DMERR("dm_tm_shadow_block() failed");
394 return r;
395 }
396 ie_disk.blocknr = cpu_to_le64(dm_block_location(nb));
397
398 bm_le = dm_bitmap_data(nb);
399 old = sm_lookup_bitmap(bm_le, bit);
400
401 if (ref_count <= 2) {
402 sm_set_bitmap(bm_le, bit, ref_count);
403
404 r = dm_tm_unlock(ll->tm, nb);
405 if (r < 0)
406 return r;
407
408#if 0
409 /* FIXME: dm_btree_remove doesn't handle this yet */
410 if (old > 2) {
411 r = dm_btree_remove(&ll->ref_count_info,
412 ll->ref_count_root,
413 &b, &ll->ref_count_root);
414 if (r)
415 return r;
416 }
417#endif
418
419 } else {
420 __le32 le_rc = cpu_to_le32(ref_count);
421
422 sm_set_bitmap(bm_le, bit, 3);
423 r = dm_tm_unlock(ll->tm, nb);
424 if (r < 0)
425 return r;
426
427 __dm_bless_for_disk(&le_rc);
428 r = dm_btree_insert(&ll->ref_count_info, ll->ref_count_root,
429 &b, &le_rc, &ll->ref_count_root);
430 if (r < 0) {
431 DMERR("ref count insert failed");
432 return r;
433 }
434 }
435
436 if (ref_count && !old) {
437 *ev = SM_ALLOC;
438 ll->nr_allocated++;
439 ie_disk.nr_free = cpu_to_le32(le32_to_cpu(ie_disk.nr_free) - 1);
440 if (le32_to_cpu(ie_disk.none_free_before) == bit)
441 ie_disk.none_free_before = cpu_to_le32(bit + 1);
442
443 } else if (old && !ref_count) {
444 *ev = SM_FREE;
445 ll->nr_allocated--;
446 ie_disk.nr_free = cpu_to_le32(le32_to_cpu(ie_disk.nr_free) + 1);
447 ie_disk.none_free_before = cpu_to_le32(min(le32_to_cpu(ie_disk.none_free_before), bit));
448 }
449
450 return ll->save_ie(ll, index, &ie_disk);
451}
452
453int sm_ll_inc(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev)
454{
455 int r;
456 uint32_t rc;
457
458 r = sm_ll_lookup(ll, b, &rc);
459 if (r)
460 return r;
461
462 return sm_ll_insert(ll, b, rc + 1, ev);
463}
464
465int sm_ll_dec(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev)
466{
467 int r;
468 uint32_t rc;
469
470 r = sm_ll_lookup(ll, b, &rc);
471 if (r)
472 return r;
473
474 if (!rc)
475 return -EINVAL;
476
477 return sm_ll_insert(ll, b, rc - 1, ev);
478}
479
480int sm_ll_commit(struct ll_disk *ll)
481{
482 return ll->commit(ll);
483}
484
485/*----------------------------------------------------------------*/
486
487static int metadata_ll_load_ie(struct ll_disk *ll, dm_block_t index,
488 struct disk_index_entry *ie)
489{
490 memcpy(ie, ll->mi_le.index + index, sizeof(*ie));
491 return 0;
492}
493
494static int metadata_ll_save_ie(struct ll_disk *ll, dm_block_t index,
495 struct disk_index_entry *ie)
496{
497 memcpy(ll->mi_le.index + index, ie, sizeof(*ie));
498 return 0;
499}
500
501static int metadata_ll_init_index(struct ll_disk *ll)
502{
503 int r;
504 struct dm_block *b;
505
506 r = dm_tm_new_block(ll->tm, &index_validator, &b);
507 if (r < 0)
508 return r;
509
510 memcpy(dm_block_data(b), &ll->mi_le, sizeof(ll->mi_le));
511 ll->bitmap_root = dm_block_location(b);
512
513 return dm_tm_unlock(ll->tm, b);
514}
515
516static int metadata_ll_open(struct ll_disk *ll)
517{
518 int r;
519 struct dm_block *block;
520
521 r = dm_tm_read_lock(ll->tm, ll->bitmap_root,
522 &index_validator, &block);
523 if (r)
524 return r;
525
526 memcpy(&ll->mi_le, dm_block_data(block), sizeof(ll->mi_le));
527 return dm_tm_unlock(ll->tm, block);
528}
529
530static dm_block_t metadata_ll_max_entries(struct ll_disk *ll)
531{
532 return MAX_METADATA_BITMAPS;
533}
534
535static int metadata_ll_commit(struct ll_disk *ll)
536{
537 int r, inc;
538 struct dm_block *b;
539
540 r = dm_tm_shadow_block(ll->tm, ll->bitmap_root, &index_validator, &b, &inc);
541 if (r)
542 return r;
543
544 memcpy(dm_block_data(b), &ll->mi_le, sizeof(ll->mi_le));
545 ll->bitmap_root = dm_block_location(b);
546
547 return dm_tm_unlock(ll->tm, b);
548}
549
550int sm_ll_new_metadata(struct ll_disk *ll, struct dm_transaction_manager *tm)
551{
552 int r;
553
554 r = sm_ll_init(ll, tm);
555 if (r < 0)
556 return r;
557
558 ll->load_ie = metadata_ll_load_ie;
559 ll->save_ie = metadata_ll_save_ie;
560 ll->init_index = metadata_ll_init_index;
561 ll->open_index = metadata_ll_open;
562 ll->max_entries = metadata_ll_max_entries;
563 ll->commit = metadata_ll_commit;
564
565 ll->nr_blocks = 0;
566 ll->nr_allocated = 0;
567
568 r = ll->init_index(ll);
569 if (r < 0)
570 return r;
571
572 r = dm_btree_empty(&ll->ref_count_info, &ll->ref_count_root);
573 if (r < 0)
574 return r;
575
576 return 0;
577}
578
579int sm_ll_open_metadata(struct ll_disk *ll, struct dm_transaction_manager *tm,
580 void *root_le, size_t len)
581{
582 int r;
583 struct disk_sm_root *smr = root_le;
584
585 if (len < sizeof(struct disk_sm_root)) {
586 DMERR("sm_metadata root too small");
587 return -ENOMEM;
588 }
589
590 r = sm_ll_init(ll, tm);
591 if (r < 0)
592 return r;
593
594 ll->load_ie = metadata_ll_load_ie;
595 ll->save_ie = metadata_ll_save_ie;
596 ll->init_index = metadata_ll_init_index;
597 ll->open_index = metadata_ll_open;
598 ll->max_entries = metadata_ll_max_entries;
599 ll->commit = metadata_ll_commit;
600
601 ll->nr_blocks = le64_to_cpu(smr->nr_blocks);
602 ll->nr_allocated = le64_to_cpu(smr->nr_allocated);
603 ll->bitmap_root = le64_to_cpu(smr->bitmap_root);
604 ll->ref_count_root = le64_to_cpu(smr->ref_count_root);
605
606 return ll->open_index(ll);
607}
608
609/*----------------------------------------------------------------*/
610
611static int disk_ll_load_ie(struct ll_disk *ll, dm_block_t index,
612 struct disk_index_entry *ie)
613{
614 return dm_btree_lookup(&ll->bitmap_info, ll->bitmap_root, &index, ie);
615}
616
617static int disk_ll_save_ie(struct ll_disk *ll, dm_block_t index,
618 struct disk_index_entry *ie)
619{
620 __dm_bless_for_disk(ie);
621 return dm_btree_insert(&ll->bitmap_info, ll->bitmap_root,
622 &index, ie, &ll->bitmap_root);
623}
624
625static int disk_ll_init_index(struct ll_disk *ll)
626{
627 return dm_btree_empty(&ll->bitmap_info, &ll->bitmap_root);
628}
629
630static int disk_ll_open(struct ll_disk *ll)
631{
632 /* nothing to do */
633 return 0;
634}
635
636static dm_block_t disk_ll_max_entries(struct ll_disk *ll)
637{
638 return -1ULL;
639}
640
641static int disk_ll_commit(struct ll_disk *ll)
642{
643 return 0;
644}
645
646int sm_ll_new_disk(struct ll_disk *ll, struct dm_transaction_manager *tm)
647{
648 int r;
649
650 r = sm_ll_init(ll, tm);
651 if (r < 0)
652 return r;
653
654 ll->load_ie = disk_ll_load_ie;
655 ll->save_ie = disk_ll_save_ie;
656 ll->init_index = disk_ll_init_index;
657 ll->open_index = disk_ll_open;
658 ll->max_entries = disk_ll_max_entries;
659 ll->commit = disk_ll_commit;
660
661 ll->nr_blocks = 0;
662 ll->nr_allocated = 0;
663
664 r = ll->init_index(ll);
665 if (r < 0)
666 return r;
667
668 r = dm_btree_empty(&ll->ref_count_info, &ll->ref_count_root);
669 if (r < 0)
670 return r;
671
672 return 0;
673}
674
675int sm_ll_open_disk(struct ll_disk *ll, struct dm_transaction_manager *tm,
676 void *root_le, size_t len)
677{
678 int r;
679 struct disk_sm_root *smr = root_le;
680
681 if (len < sizeof(struct disk_sm_root)) {
682 DMERR("sm_metadata root too small");
683 return -ENOMEM;
684 }
685
686 r = sm_ll_init(ll, tm);
687 if (r < 0)
688 return r;
689
690 ll->load_ie = disk_ll_load_ie;
691 ll->save_ie = disk_ll_save_ie;
692 ll->init_index = disk_ll_init_index;
693 ll->open_index = disk_ll_open;
694 ll->max_entries = disk_ll_max_entries;
695 ll->commit = disk_ll_commit;
696
697 ll->nr_blocks = le64_to_cpu(smr->nr_blocks);
698 ll->nr_allocated = le64_to_cpu(smr->nr_allocated);
699 ll->bitmap_root = le64_to_cpu(smr->bitmap_root);
700 ll->ref_count_root = le64_to_cpu(smr->ref_count_root);
701
702 return ll->open_index(ll);
703}
704
705/*----------------------------------------------------------------*/
diff --git a/drivers/md/persistent-data/dm-space-map-common.h b/drivers/md/persistent-data/dm-space-map-common.h
new file mode 100644
index 00000000000..8f220821a9a
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-common.h
@@ -0,0 +1,126 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef DM_SPACE_MAP_COMMON_H
8#define DM_SPACE_MAP_COMMON_H
9
10#include "dm-btree.h"
11
12/*----------------------------------------------------------------*/
13
14/*
15 * Low level disk format
16 *
17 * Bitmap btree
18 * ------------
19 *
20 * Each value stored in the btree is an index_entry. This points to a
21 * block that is used as a bitmap. Within the bitmap hold 2 bits per
22 * entry, which represent UNUSED = 0, REF_COUNT = 1, REF_COUNT = 2 and
23 * REF_COUNT = many.
24 *
25 * Refcount btree
26 * --------------
27 *
28 * Any entry that has a ref count higher than 2 gets entered in the ref
29 * count tree. The leaf values for this tree is the 32-bit ref count.
30 */
31
32struct disk_index_entry {
33 __le64 blocknr;
34 __le32 nr_free;
35 __le32 none_free_before;
36} __packed;
37
38
39#define MAX_METADATA_BITMAPS 255
40struct disk_metadata_index {
41 __le32 csum;
42 __le32 padding;
43 __le64 blocknr;
44
45 struct disk_index_entry index[MAX_METADATA_BITMAPS];
46} __packed;
47
48struct ll_disk;
49
50typedef int (*load_ie_fn)(struct ll_disk *ll, dm_block_t index, struct disk_index_entry *result);
51typedef int (*save_ie_fn)(struct ll_disk *ll, dm_block_t index, struct disk_index_entry *ie);
52typedef int (*init_index_fn)(struct ll_disk *ll);
53typedef int (*open_index_fn)(struct ll_disk *ll);
54typedef dm_block_t (*max_index_entries_fn)(struct ll_disk *ll);
55typedef int (*commit_fn)(struct ll_disk *ll);
56
57struct ll_disk {
58 struct dm_transaction_manager *tm;
59 struct dm_btree_info bitmap_info;
60 struct dm_btree_info ref_count_info;
61
62 uint32_t block_size;
63 uint32_t entries_per_block;
64 dm_block_t nr_blocks;
65 dm_block_t nr_allocated;
66
67 /*
68 * bitmap_root may be a btree root or a simple index.
69 */
70 dm_block_t bitmap_root;
71
72 dm_block_t ref_count_root;
73
74 struct disk_metadata_index mi_le;
75 load_ie_fn load_ie;
76 save_ie_fn save_ie;
77 init_index_fn init_index;
78 open_index_fn open_index;
79 max_index_entries_fn max_entries;
80 commit_fn commit;
81};
82
83struct disk_sm_root {
84 __le64 nr_blocks;
85 __le64 nr_allocated;
86 __le64 bitmap_root;
87 __le64 ref_count_root;
88} __packed;
89
90#define ENTRIES_PER_BYTE 4
91
92struct disk_bitmap_header {
93 __le32 csum;
94 __le32 not_used;
95 __le64 blocknr;
96} __packed;
97
98enum allocation_event {
99 SM_NONE,
100 SM_ALLOC,
101 SM_FREE,
102};
103
104/*----------------------------------------------------------------*/
105
106int sm_ll_extend(struct ll_disk *ll, dm_block_t extra_blocks);
107int sm_ll_lookup_bitmap(struct ll_disk *ll, dm_block_t b, uint32_t *result);
108int sm_ll_lookup(struct ll_disk *ll, dm_block_t b, uint32_t *result);
109int sm_ll_find_free_block(struct ll_disk *ll, dm_block_t begin,
110 dm_block_t end, dm_block_t *result);
111int sm_ll_insert(struct ll_disk *ll, dm_block_t b, uint32_t ref_count, enum allocation_event *ev);
112int sm_ll_inc(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
113int sm_ll_dec(struct ll_disk *ll, dm_block_t b, enum allocation_event *ev);
114int sm_ll_commit(struct ll_disk *ll);
115
116int sm_ll_new_metadata(struct ll_disk *ll, struct dm_transaction_manager *tm);
117int sm_ll_open_metadata(struct ll_disk *ll, struct dm_transaction_manager *tm,
118 void *root_le, size_t len);
119
120int sm_ll_new_disk(struct ll_disk *ll, struct dm_transaction_manager *tm);
121int sm_ll_open_disk(struct ll_disk *ll, struct dm_transaction_manager *tm,
122 void *root_le, size_t len);
123
124/*----------------------------------------------------------------*/
125
126#endif /* DM_SPACE_MAP_COMMON_H */
diff --git a/drivers/md/persistent-data/dm-space-map-disk.c b/drivers/md/persistent-data/dm-space-map-disk.c
new file mode 100644
index 00000000000..aeff7852cf7
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-disk.c
@@ -0,0 +1,335 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-space-map-checker.h"
8#include "dm-space-map-common.h"
9#include "dm-space-map-disk.h"
10#include "dm-space-map.h"
11#include "dm-transaction-manager.h"
12
13#include <linux/list.h>
14#include <linux/slab.h>
15#include <linux/module.h>
16#include <linux/device-mapper.h>
17
18#define DM_MSG_PREFIX "space map disk"
19
20/*----------------------------------------------------------------*/
21
22/*
23 * Space map interface.
24 */
25struct sm_disk {
26 struct dm_space_map sm;
27
28 struct ll_disk ll;
29 struct ll_disk old_ll;
30
31 dm_block_t begin;
32 dm_block_t nr_allocated_this_transaction;
33};
34
35static void sm_disk_destroy(struct dm_space_map *sm)
36{
37 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
38
39 kfree(smd);
40}
41
42static int sm_disk_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
43{
44 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
45
46 return sm_ll_extend(&smd->ll, extra_blocks);
47}
48
49static int sm_disk_get_nr_blocks(struct dm_space_map *sm, dm_block_t *count)
50{
51 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
52 *count = smd->old_ll.nr_blocks;
53
54 return 0;
55}
56
57static int sm_disk_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
58{
59 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
60 *count = (smd->old_ll.nr_blocks - smd->old_ll.nr_allocated) - smd->nr_allocated_this_transaction;
61
62 return 0;
63}
64
65static int sm_disk_get_count(struct dm_space_map *sm, dm_block_t b,
66 uint32_t *result)
67{
68 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
69 return sm_ll_lookup(&smd->ll, b, result);
70}
71
72static int sm_disk_count_is_more_than_one(struct dm_space_map *sm, dm_block_t b,
73 int *result)
74{
75 int r;
76 uint32_t count;
77
78 r = sm_disk_get_count(sm, b, &count);
79 if (r)
80 return r;
81
82 return count > 1;
83}
84
85static int sm_disk_set_count(struct dm_space_map *sm, dm_block_t b,
86 uint32_t count)
87{
88 int r;
89 uint32_t old_count;
90 enum allocation_event ev;
91 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
92
93 r = sm_ll_insert(&smd->ll, b, count, &ev);
94 if (!r) {
95 switch (ev) {
96 case SM_NONE:
97 break;
98
99 case SM_ALLOC:
100 /*
101 * This _must_ be free in the prior transaction
102 * otherwise we've lost atomicity.
103 */
104 smd->nr_allocated_this_transaction++;
105 break;
106
107 case SM_FREE:
108 /*
109 * It's only free if it's also free in the last
110 * transaction.
111 */
112 r = sm_ll_lookup(&smd->old_ll, b, &old_count);
113 if (r)
114 return r;
115
116 if (!old_count)
117 smd->nr_allocated_this_transaction--;
118 break;
119 }
120 }
121
122 return r;
123}
124
125static int sm_disk_inc_block(struct dm_space_map *sm, dm_block_t b)
126{
127 int r;
128 enum allocation_event ev;
129 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
130
131 r = sm_ll_inc(&smd->ll, b, &ev);
132 if (!r && (ev == SM_ALLOC))
133 /*
134 * This _must_ be free in the prior transaction
135 * otherwise we've lost atomicity.
136 */
137 smd->nr_allocated_this_transaction++;
138
139 return r;
140}
141
142static int sm_disk_dec_block(struct dm_space_map *sm, dm_block_t b)
143{
144 int r;
145 uint32_t old_count;
146 enum allocation_event ev;
147 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
148
149 r = sm_ll_dec(&smd->ll, b, &ev);
150 if (!r && (ev == SM_FREE)) {
151 /*
152 * It's only free if it's also free in the last
153 * transaction.
154 */
155 r = sm_ll_lookup(&smd->old_ll, b, &old_count);
156 if (r)
157 return r;
158
159 if (!old_count)
160 smd->nr_allocated_this_transaction--;
161 }
162
163 return r;
164}
165
166static int sm_disk_new_block(struct dm_space_map *sm, dm_block_t *b)
167{
168 int r;
169 enum allocation_event ev;
170 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
171
172 /* FIXME: we should loop round a couple of times */
173 r = sm_ll_find_free_block(&smd->old_ll, smd->begin, smd->old_ll.nr_blocks, b);
174 if (r)
175 return r;
176
177 smd->begin = *b + 1;
178 r = sm_ll_inc(&smd->ll, *b, &ev);
179 if (!r) {
180 BUG_ON(ev != SM_ALLOC);
181 smd->nr_allocated_this_transaction++;
182 }
183
184 return r;
185}
186
187static int sm_disk_commit(struct dm_space_map *sm)
188{
189 int r;
190 dm_block_t nr_free;
191 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
192
193 r = sm_disk_get_nr_free(sm, &nr_free);
194 if (r)
195 return r;
196
197 r = sm_ll_commit(&smd->ll);
198 if (r)
199 return r;
200
201 memcpy(&smd->old_ll, &smd->ll, sizeof(smd->old_ll));
202 smd->begin = 0;
203 smd->nr_allocated_this_transaction = 0;
204
205 r = sm_disk_get_nr_free(sm, &nr_free);
206 if (r)
207 return r;
208
209 return 0;
210}
211
212static int sm_disk_root_size(struct dm_space_map *sm, size_t *result)
213{
214 *result = sizeof(struct disk_sm_root);
215
216 return 0;
217}
218
219static int sm_disk_copy_root(struct dm_space_map *sm, void *where_le, size_t max)
220{
221 struct sm_disk *smd = container_of(sm, struct sm_disk, sm);
222 struct disk_sm_root root_le;
223
224 root_le.nr_blocks = cpu_to_le64(smd->ll.nr_blocks);
225 root_le.nr_allocated = cpu_to_le64(smd->ll.nr_allocated);
226 root_le.bitmap_root = cpu_to_le64(smd->ll.bitmap_root);
227 root_le.ref_count_root = cpu_to_le64(smd->ll.ref_count_root);
228
229 if (max < sizeof(root_le))
230 return -ENOSPC;
231
232 memcpy(where_le, &root_le, sizeof(root_le));
233
234 return 0;
235}
236
237/*----------------------------------------------------------------*/
238
239static struct dm_space_map ops = {
240 .destroy = sm_disk_destroy,
241 .extend = sm_disk_extend,
242 .get_nr_blocks = sm_disk_get_nr_blocks,
243 .get_nr_free = sm_disk_get_nr_free,
244 .get_count = sm_disk_get_count,
245 .count_is_more_than_one = sm_disk_count_is_more_than_one,
246 .set_count = sm_disk_set_count,
247 .inc_block = sm_disk_inc_block,
248 .dec_block = sm_disk_dec_block,
249 .new_block = sm_disk_new_block,
250 .commit = sm_disk_commit,
251 .root_size = sm_disk_root_size,
252 .copy_root = sm_disk_copy_root
253};
254
255static struct dm_space_map *dm_sm_disk_create_real(
256 struct dm_transaction_manager *tm,
257 dm_block_t nr_blocks)
258{
259 int r;
260 struct sm_disk *smd;
261
262 smd = kmalloc(sizeof(*smd), GFP_KERNEL);
263 if (!smd)
264 return ERR_PTR(-ENOMEM);
265
266 smd->begin = 0;
267 smd->nr_allocated_this_transaction = 0;
268 memcpy(&smd->sm, &ops, sizeof(smd->sm));
269
270 r = sm_ll_new_disk(&smd->ll, tm);
271 if (r)
272 goto bad;
273
274 r = sm_ll_extend(&smd->ll, nr_blocks);
275 if (r)
276 goto bad;
277
278 r = sm_disk_commit(&smd->sm);
279 if (r)
280 goto bad;
281
282 return &smd->sm;
283
284bad:
285 kfree(smd);
286 return ERR_PTR(r);
287}
288
289struct dm_space_map *dm_sm_disk_create(struct dm_transaction_manager *tm,
290 dm_block_t nr_blocks)
291{
292 struct dm_space_map *sm = dm_sm_disk_create_real(tm, nr_blocks);
293 return dm_sm_checker_create_fresh(sm);
294}
295EXPORT_SYMBOL_GPL(dm_sm_disk_create);
296
297static struct dm_space_map *dm_sm_disk_open_real(
298 struct dm_transaction_manager *tm,
299 void *root_le, size_t len)
300{
301 int r;
302 struct sm_disk *smd;
303
304 smd = kmalloc(sizeof(*smd), GFP_KERNEL);
305 if (!smd)
306 return ERR_PTR(-ENOMEM);
307
308 smd->begin = 0;
309 smd->nr_allocated_this_transaction = 0;
310 memcpy(&smd->sm, &ops, sizeof(smd->sm));
311
312 r = sm_ll_open_disk(&smd->ll, tm, root_le, len);
313 if (r)
314 goto bad;
315
316 r = sm_disk_commit(&smd->sm);
317 if (r)
318 goto bad;
319
320 return &smd->sm;
321
322bad:
323 kfree(smd);
324 return ERR_PTR(r);
325}
326
327struct dm_space_map *dm_sm_disk_open(struct dm_transaction_manager *tm,
328 void *root_le, size_t len)
329{
330 return dm_sm_checker_create(
331 dm_sm_disk_open_real(tm, root_le, len));
332}
333EXPORT_SYMBOL_GPL(dm_sm_disk_open);
334
335/*----------------------------------------------------------------*/
diff --git a/drivers/md/persistent-data/dm-space-map-disk.h b/drivers/md/persistent-data/dm-space-map-disk.h
new file mode 100644
index 00000000000..447a0a9a2d9
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-disk.h
@@ -0,0 +1,25 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef _LINUX_DM_SPACE_MAP_DISK_H
8#define _LINUX_DM_SPACE_MAP_DISK_H
9
10#include "dm-block-manager.h"
11
12struct dm_space_map;
13struct dm_transaction_manager;
14
15/*
16 * Unfortunately we have to use two-phase construction due to the cycle
17 * between the tm and sm.
18 */
19struct dm_space_map *dm_sm_disk_create(struct dm_transaction_manager *tm,
20 dm_block_t nr_blocks);
21
22struct dm_space_map *dm_sm_disk_open(struct dm_transaction_manager *tm,
23 void *root, size_t len);
24
25#endif /* _LINUX_DM_SPACE_MAP_DISK_H */
diff --git a/drivers/md/persistent-data/dm-space-map-metadata.c b/drivers/md/persistent-data/dm-space-map-metadata.c
new file mode 100644
index 00000000000..e89ae5e7a51
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-metadata.c
@@ -0,0 +1,596 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#include "dm-space-map.h"
8#include "dm-space-map-common.h"
9#include "dm-space-map-metadata.h"
10
11#include <linux/list.h>
12#include <linux/slab.h>
13#include <linux/device-mapper.h>
14
15#define DM_MSG_PREFIX "space map metadata"
16
17/*----------------------------------------------------------------*/
18
19/*
20 * Space map interface.
21 *
22 * The low level disk format is written using the standard btree and
23 * transaction manager. This means that performing disk operations may
24 * cause us to recurse into the space map in order to allocate new blocks.
25 * For this reason we have a pool of pre-allocated blocks large enough to
26 * service any metadata_ll_disk operation.
27 */
28
29/*
30 * FIXME: we should calculate this based on the size of the device.
31 * Only the metadata space map needs this functionality.
32 */
33#define MAX_RECURSIVE_ALLOCATIONS 1024
34
35enum block_op_type {
36 BOP_INC,
37 BOP_DEC
38};
39
40struct block_op {
41 enum block_op_type type;
42 dm_block_t block;
43};
44
45struct sm_metadata {
46 struct dm_space_map sm;
47
48 struct ll_disk ll;
49 struct ll_disk old_ll;
50
51 dm_block_t begin;
52
53 unsigned recursion_count;
54 unsigned allocated_this_transaction;
55 unsigned nr_uncommitted;
56 struct block_op uncommitted[MAX_RECURSIVE_ALLOCATIONS];
57};
58
59static int add_bop(struct sm_metadata *smm, enum block_op_type type, dm_block_t b)
60{
61 struct block_op *op;
62
63 if (smm->nr_uncommitted == MAX_RECURSIVE_ALLOCATIONS) {
64 DMERR("too many recursive allocations");
65 return -ENOMEM;
66 }
67
68 op = smm->uncommitted + smm->nr_uncommitted++;
69 op->type = type;
70 op->block = b;
71
72 return 0;
73}
74
75static int commit_bop(struct sm_metadata *smm, struct block_op *op)
76{
77 int r = 0;
78 enum allocation_event ev;
79
80 switch (op->type) {
81 case BOP_INC:
82 r = sm_ll_inc(&smm->ll, op->block, &ev);
83 break;
84
85 case BOP_DEC:
86 r = sm_ll_dec(&smm->ll, op->block, &ev);
87 break;
88 }
89
90 return r;
91}
92
93static void in(struct sm_metadata *smm)
94{
95 smm->recursion_count++;
96}
97
98static int out(struct sm_metadata *smm)
99{
100 int r = 0;
101
102 /*
103 * If we're not recursing then very bad things are happening.
104 */
105 if (!smm->recursion_count) {
106 DMERR("lost track of recursion depth");
107 return -ENOMEM;
108 }
109
110 if (smm->recursion_count == 1 && smm->nr_uncommitted) {
111 while (smm->nr_uncommitted && !r) {
112 smm->nr_uncommitted--;
113 r = commit_bop(smm, smm->uncommitted +
114 smm->nr_uncommitted);
115 if (r)
116 break;
117 }
118 }
119
120 smm->recursion_count--;
121
122 return r;
123}
124
125/*
126 * When using the out() function above, we often want to combine an error
127 * code for the operation run in the recursive context with that from
128 * out().
129 */
130static int combine_errors(int r1, int r2)
131{
132 return r1 ? r1 : r2;
133}
134
135static int recursing(struct sm_metadata *smm)
136{
137 return smm->recursion_count;
138}
139
140static void sm_metadata_destroy(struct dm_space_map *sm)
141{
142 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
143
144 kfree(smm);
145}
146
147static int sm_metadata_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
148{
149 DMERR("doesn't support extend");
150 return -EINVAL;
151}
152
153static int sm_metadata_get_nr_blocks(struct dm_space_map *sm, dm_block_t *count)
154{
155 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
156
157 *count = smm->ll.nr_blocks;
158
159 return 0;
160}
161
162static int sm_metadata_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
163{
164 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
165
166 *count = smm->old_ll.nr_blocks - smm->old_ll.nr_allocated -
167 smm->allocated_this_transaction;
168
169 return 0;
170}
171
172static int sm_metadata_get_count(struct dm_space_map *sm, dm_block_t b,
173 uint32_t *result)
174{
175 int r, i;
176 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
177 unsigned adjustment = 0;
178
179 /*
180 * We may have some uncommitted adjustments to add. This list
181 * should always be really short.
182 */
183 for (i = 0; i < smm->nr_uncommitted; i++) {
184 struct block_op *op = smm->uncommitted + i;
185
186 if (op->block != b)
187 continue;
188
189 switch (op->type) {
190 case BOP_INC:
191 adjustment++;
192 break;
193
194 case BOP_DEC:
195 adjustment--;
196 break;
197 }
198 }
199
200 r = sm_ll_lookup(&smm->ll, b, result);
201 if (r)
202 return r;
203
204 *result += adjustment;
205
206 return 0;
207}
208
209static int sm_metadata_count_is_more_than_one(struct dm_space_map *sm,
210 dm_block_t b, int *result)
211{
212 int r, i, adjustment = 0;
213 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
214 uint32_t rc;
215
216 /*
217 * We may have some uncommitted adjustments to add. This list
218 * should always be really short.
219 */
220 for (i = 0; i < smm->nr_uncommitted; i++) {
221 struct block_op *op = smm->uncommitted + i;
222
223 if (op->block != b)
224 continue;
225
226 switch (op->type) {
227 case BOP_INC:
228 adjustment++;
229 break;
230
231 case BOP_DEC:
232 adjustment--;
233 break;
234 }
235 }
236
237 if (adjustment > 1) {
238 *result = 1;
239 return 0;
240 }
241
242 r = sm_ll_lookup_bitmap(&smm->ll, b, &rc);
243 if (r)
244 return r;
245
246 if (rc == 3)
247 /*
248 * We err on the side of caution, and always return true.
249 */
250 *result = 1;
251 else
252 *result = rc + adjustment > 1;
253
254 return 0;
255}
256
257static int sm_metadata_set_count(struct dm_space_map *sm, dm_block_t b,
258 uint32_t count)
259{
260 int r, r2;
261 enum allocation_event ev;
262 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
263
264 if (smm->recursion_count) {
265 DMERR("cannot recurse set_count()");
266 return -EINVAL;
267 }
268
269 in(smm);
270 r = sm_ll_insert(&smm->ll, b, count, &ev);
271 r2 = out(smm);
272
273 return combine_errors(r, r2);
274}
275
276static int sm_metadata_inc_block(struct dm_space_map *sm, dm_block_t b)
277{
278 int r, r2 = 0;
279 enum allocation_event ev;
280 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
281
282 if (recursing(smm))
283 r = add_bop(smm, BOP_INC, b);
284 else {
285 in(smm);
286 r = sm_ll_inc(&smm->ll, b, &ev);
287 r2 = out(smm);
288 }
289
290 return combine_errors(r, r2);
291}
292
293static int sm_metadata_dec_block(struct dm_space_map *sm, dm_block_t b)
294{
295 int r, r2 = 0;
296 enum allocation_event ev;
297 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
298
299 if (recursing(smm))
300 r = add_bop(smm, BOP_DEC, b);
301 else {
302 in(smm);
303 r = sm_ll_dec(&smm->ll, b, &ev);
304 r2 = out(smm);
305 }
306
307 return combine_errors(r, r2);
308}
309
310static int sm_metadata_new_block_(struct dm_space_map *sm, dm_block_t *b)
311{
312 int r, r2 = 0;
313 enum allocation_event ev;
314 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
315
316 r = sm_ll_find_free_block(&smm->old_ll, smm->begin, smm->old_ll.nr_blocks, b);
317 if (r)
318 return r;
319
320 smm->begin = *b + 1;
321
322 if (recursing(smm))
323 r = add_bop(smm, BOP_INC, *b);
324 else {
325 in(smm);
326 r = sm_ll_inc(&smm->ll, *b, &ev);
327 r2 = out(smm);
328 }
329
330 if (!r)
331 smm->allocated_this_transaction++;
332
333 return combine_errors(r, r2);
334}
335
336static int sm_metadata_new_block(struct dm_space_map *sm, dm_block_t *b)
337{
338 int r = sm_metadata_new_block_(sm, b);
339 if (r)
340 DMERR("out of metadata space");
341 return r;
342}
343
344static int sm_metadata_commit(struct dm_space_map *sm)
345{
346 int r;
347 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
348
349 r = sm_ll_commit(&smm->ll);
350 if (r)
351 return r;
352
353 memcpy(&smm->old_ll, &smm->ll, sizeof(smm->old_ll));
354 smm->begin = 0;
355 smm->allocated_this_transaction = 0;
356
357 return 0;
358}
359
360static int sm_metadata_root_size(struct dm_space_map *sm, size_t *result)
361{
362 *result = sizeof(struct disk_sm_root);
363
364 return 0;
365}
366
367static int sm_metadata_copy_root(struct dm_space_map *sm, void *where_le, size_t max)
368{
369 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
370 struct disk_sm_root root_le;
371
372 root_le.nr_blocks = cpu_to_le64(smm->ll.nr_blocks);
373 root_le.nr_allocated = cpu_to_le64(smm->ll.nr_allocated);
374 root_le.bitmap_root = cpu_to_le64(smm->ll.bitmap_root);
375 root_le.ref_count_root = cpu_to_le64(smm->ll.ref_count_root);
376
377 if (max < sizeof(root_le))
378 return -ENOSPC;
379
380 memcpy(where_le, &root_le, sizeof(root_le));
381
382 return 0;
383}
384
385static struct dm_space_map ops = {
386 .destroy = sm_metadata_destroy,
387 .extend = sm_metadata_extend,
388 .get_nr_blocks = sm_metadata_get_nr_blocks,
389 .get_nr_free = sm_metadata_get_nr_free,
390 .get_count = sm_metadata_get_count,
391 .count_is_more_than_one = sm_metadata_count_is_more_than_one,
392 .set_count = sm_metadata_set_count,
393 .inc_block = sm_metadata_inc_block,
394 .dec_block = sm_metadata_dec_block,
395 .new_block = sm_metadata_new_block,
396 .commit = sm_metadata_commit,
397 .root_size = sm_metadata_root_size,
398 .copy_root = sm_metadata_copy_root
399};
400
401/*----------------------------------------------------------------*/
402
403/*
404 * When a new space map is created that manages its own space. We use
405 * this tiny bootstrap allocator.
406 */
407static void sm_bootstrap_destroy(struct dm_space_map *sm)
408{
409}
410
411static int sm_bootstrap_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
412{
413 DMERR("boostrap doesn't support extend");
414
415 return -EINVAL;
416}
417
418static int sm_bootstrap_get_nr_blocks(struct dm_space_map *sm, dm_block_t *count)
419{
420 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
421
422 return smm->ll.nr_blocks;
423}
424
425static int sm_bootstrap_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
426{
427 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
428
429 *count = smm->ll.nr_blocks - smm->begin;
430
431 return 0;
432}
433
434static int sm_bootstrap_get_count(struct dm_space_map *sm, dm_block_t b,
435 uint32_t *result)
436{
437 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
438
439 return b < smm->begin ? 1 : 0;
440}
441
442static int sm_bootstrap_count_is_more_than_one(struct dm_space_map *sm,
443 dm_block_t b, int *result)
444{
445 *result = 0;
446
447 return 0;
448}
449
450static int sm_bootstrap_set_count(struct dm_space_map *sm, dm_block_t b,
451 uint32_t count)
452{
453 DMERR("boostrap doesn't support set_count");
454
455 return -EINVAL;
456}
457
458static int sm_bootstrap_new_block(struct dm_space_map *sm, dm_block_t *b)
459{
460 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
461
462 /*
463 * We know the entire device is unused.
464 */
465 if (smm->begin == smm->ll.nr_blocks)
466 return -ENOSPC;
467
468 *b = smm->begin++;
469
470 return 0;
471}
472
473static int sm_bootstrap_inc_block(struct dm_space_map *sm, dm_block_t b)
474{
475 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
476
477 return add_bop(smm, BOP_INC, b);
478}
479
480static int sm_bootstrap_dec_block(struct dm_space_map *sm, dm_block_t b)
481{
482 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
483
484 return add_bop(smm, BOP_DEC, b);
485}
486
487static int sm_bootstrap_commit(struct dm_space_map *sm)
488{
489 return 0;
490}
491
492static int sm_bootstrap_root_size(struct dm_space_map *sm, size_t *result)
493{
494 DMERR("boostrap doesn't support root_size");
495
496 return -EINVAL;
497}
498
499static int sm_bootstrap_copy_root(struct dm_space_map *sm, void *where,
500 size_t max)
501{
502 DMERR("boostrap doesn't support copy_root");
503
504 return -EINVAL;
505}
506
507static struct dm_space_map bootstrap_ops = {
508 .destroy = sm_bootstrap_destroy,
509 .extend = sm_bootstrap_extend,
510 .get_nr_blocks = sm_bootstrap_get_nr_blocks,
511 .get_nr_free = sm_bootstrap_get_nr_free,
512 .get_count = sm_bootstrap_get_count,
513 .count_is_more_than_one = sm_bootstrap_count_is_more_than_one,
514 .set_count = sm_bootstrap_set_count,
515 .inc_block = sm_bootstrap_inc_block,
516 .dec_block = sm_bootstrap_dec_block,
517 .new_block = sm_bootstrap_new_block,
518 .commit = sm_bootstrap_commit,
519 .root_size = sm_bootstrap_root_size,
520 .copy_root = sm_bootstrap_copy_root
521};
522
523/*----------------------------------------------------------------*/
524
525struct dm_space_map *dm_sm_metadata_init(void)
526{
527 struct sm_metadata *smm;
528
529 smm = kmalloc(sizeof(*smm), GFP_KERNEL);
530 if (!smm)
531 return ERR_PTR(-ENOMEM);
532
533 memcpy(&smm->sm, &ops, sizeof(smm->sm));
534
535 return &smm->sm;
536}
537
538int dm_sm_metadata_create(struct dm_space_map *sm,
539 struct dm_transaction_manager *tm,
540 dm_block_t nr_blocks,
541 dm_block_t superblock)
542{
543 int r;
544 dm_block_t i;
545 enum allocation_event ev;
546 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
547
548 smm->begin = superblock + 1;
549 smm->recursion_count = 0;
550 smm->allocated_this_transaction = 0;
551 smm->nr_uncommitted = 0;
552
553 memcpy(&smm->sm, &bootstrap_ops, sizeof(smm->sm));
554
555 r = sm_ll_new_metadata(&smm->ll, tm);
556 if (r)
557 return r;
558
559 r = sm_ll_extend(&smm->ll, nr_blocks);
560 if (r)
561 return r;
562
563 memcpy(&smm->sm, &ops, sizeof(smm->sm));
564
565 /*
566 * Now we need to update the newly created data structures with the
567 * allocated blocks that they were built from.
568 */
569 for (i = superblock; !r && i < smm->begin; i++)
570 r = sm_ll_inc(&smm->ll, i, &ev);
571
572 if (r)
573 return r;
574
575 return sm_metadata_commit(sm);
576}
577
578int dm_sm_metadata_open(struct dm_space_map *sm,
579 struct dm_transaction_manager *tm,
580 void *root_le, size_t len)
581{
582 int r;
583 struct sm_metadata *smm = container_of(sm, struct sm_metadata, sm);
584
585 r = sm_ll_open_metadata(&smm->ll, tm, root_le, len);
586 if (r)
587 return r;
588
589 smm->begin = 0;
590 smm->recursion_count = 0;
591 smm->allocated_this_transaction = 0;
592 smm->nr_uncommitted = 0;
593
594 memcpy(&smm->old_ll, &smm->ll, sizeof(smm->old_ll));
595 return 0;
596}
diff --git a/drivers/md/persistent-data/dm-space-map-metadata.h b/drivers/md/persistent-data/dm-space-map-metadata.h
new file mode 100644
index 00000000000..39bba0801cf
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map-metadata.h
@@ -0,0 +1,33 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef DM_SPACE_MAP_METADATA_H
8#define DM_SPACE_MAP_METADATA_H
9
10#include "dm-transaction-manager.h"
11
12/*
13 * Unfortunately we have to use two-phase construction due to the cycle
14 * between the tm and sm.
15 */
16struct dm_space_map *dm_sm_metadata_init(void);
17
18/*
19 * Create a fresh space map.
20 */
21int dm_sm_metadata_create(struct dm_space_map *sm,
22 struct dm_transaction_manager *tm,
23 dm_block_t nr_blocks,
24 dm_block_t superblock);
25
26/*
27 * Open from a previously-recorded root.
28 */
29int dm_sm_metadata_open(struct dm_space_map *sm,
30 struct dm_transaction_manager *tm,
31 void *root_le, size_t len);
32
33#endif /* DM_SPACE_MAP_METADATA_H */
diff --git a/drivers/md/persistent-data/dm-space-map.h b/drivers/md/persistent-data/dm-space-map.h
new file mode 100644
index 00000000000..1cbfc6b1638
--- /dev/null
+++ b/drivers/md/persistent-data/dm-space-map.h
@@ -0,0 +1,134 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef _LINUX_DM_SPACE_MAP_H
8#define _LINUX_DM_SPACE_MAP_H
9
10#include "dm-block-manager.h"
11
12/*
13 * struct dm_space_map keeps a record of how many times each block in a device
14 * is referenced. It needs to be fixed on disk as part of the transaction.
15 */
16struct dm_space_map {
17 void (*destroy)(struct dm_space_map *sm);
18
19 /*
20 * You must commit before allocating the newly added space.
21 */
22 int (*extend)(struct dm_space_map *sm, dm_block_t extra_blocks);
23
24 /*
25 * Extensions do not appear in this count until after commit has
26 * been called.
27 */
28 int (*get_nr_blocks)(struct dm_space_map *sm, dm_block_t *count);
29
30 /*
31 * Space maps must never allocate a block from the previous
32 * transaction, in case we need to rollback. This complicates the
33 * semantics of get_nr_free(), it should return the number of blocks
34 * that are available for allocation _now_. For instance you may
35 * have blocks with a zero reference count that will not be
36 * available for allocation until after the next commit.
37 */
38 int (*get_nr_free)(struct dm_space_map *sm, dm_block_t *count);
39
40 int (*get_count)(struct dm_space_map *sm, dm_block_t b, uint32_t *result);
41 int (*count_is_more_than_one)(struct dm_space_map *sm, dm_block_t b,
42 int *result);
43 int (*set_count)(struct dm_space_map *sm, dm_block_t b, uint32_t count);
44
45 int (*commit)(struct dm_space_map *sm);
46
47 int (*inc_block)(struct dm_space_map *sm, dm_block_t b);
48 int (*dec_block)(struct dm_space_map *sm, dm_block_t b);
49
50 /*
51 * new_block will increment the returned block.
52 */
53 int (*new_block)(struct dm_space_map *sm, dm_block_t *b);
54
55 /*
56 * The root contains all the information needed to fix the space map.
57 * Generally this info is small, so squirrel it away in a disk block
58 * along with other info.
59 */
60 int (*root_size)(struct dm_space_map *sm, size_t *result);
61 int (*copy_root)(struct dm_space_map *sm, void *copy_to_here_le, size_t len);
62};
63
64/*----------------------------------------------------------------*/
65
66static inline void dm_sm_destroy(struct dm_space_map *sm)
67{
68 sm->destroy(sm);
69}
70
71static inline int dm_sm_extend(struct dm_space_map *sm, dm_block_t extra_blocks)
72{
73 return sm->extend(sm, extra_blocks);
74}
75
76static inline int dm_sm_get_nr_blocks(struct dm_space_map *sm, dm_block_t *count)
77{
78 return sm->get_nr_blocks(sm, count);
79}
80
81static inline int dm_sm_get_nr_free(struct dm_space_map *sm, dm_block_t *count)
82{
83 return sm->get_nr_free(sm, count);
84}
85
86static inline int dm_sm_get_count(struct dm_space_map *sm, dm_block_t b,
87 uint32_t *result)
88{
89 return sm->get_count(sm, b, result);
90}
91
92static inline int dm_sm_count_is_more_than_one(struct dm_space_map *sm,
93 dm_block_t b, int *result)
94{
95 return sm->count_is_more_than_one(sm, b, result);
96}
97
98static inline int dm_sm_set_count(struct dm_space_map *sm, dm_block_t b,
99 uint32_t count)
100{
101 return sm->set_count(sm, b, count);
102}
103
104static inline int dm_sm_commit(struct dm_space_map *sm)
105{
106 return sm->commit(sm);
107}
108
109static inline int dm_sm_inc_block(struct dm_space_map *sm, dm_block_t b)
110{
111 return sm->inc_block(sm, b);
112}
113
114static inline int dm_sm_dec_block(struct dm_space_map *sm, dm_block_t b)
115{
116 return sm->dec_block(sm, b);
117}
118
119static inline int dm_sm_new_block(struct dm_space_map *sm, dm_block_t *b)
120{
121 return sm->new_block(sm, b);
122}
123
124static inline int dm_sm_root_size(struct dm_space_map *sm, size_t *result)
125{
126 return sm->root_size(sm, result);
127}
128
129static inline int dm_sm_copy_root(struct dm_space_map *sm, void *copy_to_here_le, size_t len)
130{
131 return sm->copy_root(sm, copy_to_here_le, len);
132}
133
134#endif /* _LINUX_DM_SPACE_MAP_H */
diff --git a/drivers/md/persistent-data/dm-transaction-manager.c b/drivers/md/persistent-data/dm-transaction-manager.c
new file mode 100644
index 00000000000..728e89a3f97
--- /dev/null
+++ b/drivers/md/persistent-data/dm-transaction-manager.c
@@ -0,0 +1,400 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6#include "dm-transaction-manager.h"
7#include "dm-space-map.h"
8#include "dm-space-map-checker.h"
9#include "dm-space-map-disk.h"
10#include "dm-space-map-metadata.h"
11#include "dm-persistent-data-internal.h"
12
13#include <linux/module.h>
14#include <linux/slab.h>
15#include <linux/device-mapper.h>
16
17#define DM_MSG_PREFIX "transaction manager"
18
19/*----------------------------------------------------------------*/
20
21struct shadow_info {
22 struct hlist_node hlist;
23 dm_block_t where;
24};
25
26/*
27 * It would be nice if we scaled with the size of transaction.
28 */
29#define HASH_SIZE 256
30#define HASH_MASK (HASH_SIZE - 1)
31
32struct dm_transaction_manager {
33 int is_clone;
34 struct dm_transaction_manager *real;
35
36 struct dm_block_manager *bm;
37 struct dm_space_map *sm;
38
39 spinlock_t lock;
40 struct hlist_head buckets[HASH_SIZE];
41};
42
43/*----------------------------------------------------------------*/
44
45static int is_shadow(struct dm_transaction_manager *tm, dm_block_t b)
46{
47 int r = 0;
48 unsigned bucket = dm_hash_block(b, HASH_MASK);
49 struct shadow_info *si;
50 struct hlist_node *n;
51
52 spin_lock(&tm->lock);
53 hlist_for_each_entry(si, n, tm->buckets + bucket, hlist)
54 if (si->where == b) {
55 r = 1;
56 break;
57 }
58 spin_unlock(&tm->lock);
59
60 return r;
61}
62
63/*
64 * This can silently fail if there's no memory. We're ok with this since
65 * creating redundant shadows causes no harm.
66 */
67static void insert_shadow(struct dm_transaction_manager *tm, dm_block_t b)
68{
69 unsigned bucket;
70 struct shadow_info *si;
71
72 si = kmalloc(sizeof(*si), GFP_NOIO);
73 if (si) {
74 si->where = b;
75 bucket = dm_hash_block(b, HASH_MASK);
76 spin_lock(&tm->lock);
77 hlist_add_head(&si->hlist, tm->buckets + bucket);
78 spin_unlock(&tm->lock);
79 }
80}
81
82static void wipe_shadow_table(struct dm_transaction_manager *tm)
83{
84 struct shadow_info *si;
85 struct hlist_node *n, *tmp;
86 struct hlist_head *bucket;
87 int i;
88
89 spin_lock(&tm->lock);
90 for (i = 0; i < HASH_SIZE; i++) {
91 bucket = tm->buckets + i;
92 hlist_for_each_entry_safe(si, n, tmp, bucket, hlist)
93 kfree(si);
94
95 INIT_HLIST_HEAD(bucket);
96 }
97
98 spin_unlock(&tm->lock);
99}
100
101/*----------------------------------------------------------------*/
102
103static struct dm_transaction_manager *dm_tm_create(struct dm_block_manager *bm,
104 struct dm_space_map *sm)
105{
106 int i;
107 struct dm_transaction_manager *tm;
108
109 tm = kmalloc(sizeof(*tm), GFP_KERNEL);
110 if (!tm)
111 return ERR_PTR(-ENOMEM);
112
113 tm->is_clone = 0;
114 tm->real = NULL;
115 tm->bm = bm;
116 tm->sm = sm;
117
118 spin_lock_init(&tm->lock);
119 for (i = 0; i < HASH_SIZE; i++)
120 INIT_HLIST_HEAD(tm->buckets + i);
121
122 return tm;
123}
124
125struct dm_transaction_manager *dm_tm_create_non_blocking_clone(struct dm_transaction_manager *real)
126{
127 struct dm_transaction_manager *tm;
128
129 tm = kmalloc(sizeof(*tm), GFP_KERNEL);
130 if (tm) {
131 tm->is_clone = 1;
132 tm->real = real;
133 }
134
135 return tm;
136}
137EXPORT_SYMBOL_GPL(dm_tm_create_non_blocking_clone);
138
139void dm_tm_destroy(struct dm_transaction_manager *tm)
140{
141 kfree(tm);
142}
143EXPORT_SYMBOL_GPL(dm_tm_destroy);
144
145int dm_tm_pre_commit(struct dm_transaction_manager *tm)
146{
147 int r;
148
149 if (tm->is_clone)
150 return -EWOULDBLOCK;
151
152 r = dm_sm_commit(tm->sm);
153 if (r < 0)
154 return r;
155
156 return 0;
157}
158EXPORT_SYMBOL_GPL(dm_tm_pre_commit);
159
160int dm_tm_commit(struct dm_transaction_manager *tm, struct dm_block *root)
161{
162 if (tm->is_clone)
163 return -EWOULDBLOCK;
164
165 wipe_shadow_table(tm);
166
167 return dm_bm_flush_and_unlock(tm->bm, root);
168}
169EXPORT_SYMBOL_GPL(dm_tm_commit);
170
171int dm_tm_new_block(struct dm_transaction_manager *tm,
172 struct dm_block_validator *v,
173 struct dm_block **result)
174{
175 int r;
176 dm_block_t new_block;
177
178 if (tm->is_clone)
179 return -EWOULDBLOCK;
180
181 r = dm_sm_new_block(tm->sm, &new_block);
182 if (r < 0)
183 return r;
184
185 r = dm_bm_write_lock_zero(tm->bm, new_block, v, result);
186 if (r < 0) {
187 dm_sm_dec_block(tm->sm, new_block);
188 return r;
189 }
190
191 /*
192 * New blocks count as shadows in that they don't need to be
193 * shadowed again.
194 */
195 insert_shadow(tm, new_block);
196
197 return 0;
198}
199
200static int __shadow_block(struct dm_transaction_manager *tm, dm_block_t orig,
201 struct dm_block_validator *v,
202 struct dm_block **result)
203{
204 int r;
205 dm_block_t new;
206 struct dm_block *orig_block;
207
208 r = dm_sm_new_block(tm->sm, &new);
209 if (r < 0)
210 return r;
211
212 r = dm_sm_dec_block(tm->sm, orig);
213 if (r < 0)
214 return r;
215
216 r = dm_bm_read_lock(tm->bm, orig, v, &orig_block);
217 if (r < 0)
218 return r;
219
220 r = dm_bm_unlock_move(orig_block, new);
221 if (r < 0) {
222 dm_bm_unlock(orig_block);
223 return r;
224 }
225
226 return dm_bm_write_lock(tm->bm, new, v, result);
227}
228
229int dm_tm_shadow_block(struct dm_transaction_manager *tm, dm_block_t orig,
230 struct dm_block_validator *v, struct dm_block **result,
231 int *inc_children)
232{
233 int r;
234
235 if (tm->is_clone)
236 return -EWOULDBLOCK;
237
238 r = dm_sm_count_is_more_than_one(tm->sm, orig, inc_children);
239 if (r < 0)
240 return r;
241
242 if (is_shadow(tm, orig) && !*inc_children)
243 return dm_bm_write_lock(tm->bm, orig, v, result);
244
245 r = __shadow_block(tm, orig, v, result);
246 if (r < 0)
247 return r;
248 insert_shadow(tm, dm_block_location(*result));
249
250 return r;
251}
252
253int dm_tm_read_lock(struct dm_transaction_manager *tm, dm_block_t b,
254 struct dm_block_validator *v,
255 struct dm_block **blk)
256{
257 if (tm->is_clone)
258 return dm_bm_read_try_lock(tm->real->bm, b, v, blk);
259
260 return dm_bm_read_lock(tm->bm, b, v, blk);
261}
262
263int dm_tm_unlock(struct dm_transaction_manager *tm, struct dm_block *b)
264{
265 return dm_bm_unlock(b);
266}
267EXPORT_SYMBOL_GPL(dm_tm_unlock);
268
269void dm_tm_inc(struct dm_transaction_manager *tm, dm_block_t b)
270{
271 /*
272 * The non-blocking clone doesn't support this.
273 */
274 BUG_ON(tm->is_clone);
275
276 dm_sm_inc_block(tm->sm, b);
277}
278EXPORT_SYMBOL_GPL(dm_tm_inc);
279
280void dm_tm_dec(struct dm_transaction_manager *tm, dm_block_t b)
281{
282 /*
283 * The non-blocking clone doesn't support this.
284 */
285 BUG_ON(tm->is_clone);
286
287 dm_sm_dec_block(tm->sm, b);
288}
289EXPORT_SYMBOL_GPL(dm_tm_dec);
290
291int dm_tm_ref(struct dm_transaction_manager *tm, dm_block_t b,
292 uint32_t *result)
293{
294 if (tm->is_clone)
295 return -EWOULDBLOCK;
296
297 return dm_sm_get_count(tm->sm, b, result);
298}
299
300struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm)
301{
302 return tm->bm;
303}
304
305/*----------------------------------------------------------------*/
306
307static int dm_tm_create_internal(struct dm_block_manager *bm,
308 dm_block_t sb_location,
309 struct dm_block_validator *sb_validator,
310 size_t root_offset, size_t root_max_len,
311 struct dm_transaction_manager **tm,
312 struct dm_space_map **sm,
313 struct dm_block **sblock,
314 int create)
315{
316 int r;
317 struct dm_space_map *inner;
318
319 inner = dm_sm_metadata_init();
320 if (IS_ERR(inner))
321 return PTR_ERR(inner);
322
323 *tm = dm_tm_create(bm, inner);
324 if (IS_ERR(*tm)) {
325 dm_sm_destroy(inner);
326 return PTR_ERR(*tm);
327 }
328
329 if (create) {
330 r = dm_bm_write_lock_zero(dm_tm_get_bm(*tm), sb_location,
331 sb_validator, sblock);
332 if (r < 0) {
333 DMERR("couldn't lock superblock");
334 goto bad1;
335 }
336
337 r = dm_sm_metadata_create(inner, *tm, dm_bm_nr_blocks(bm),
338 sb_location);
339 if (r) {
340 DMERR("couldn't create metadata space map");
341 goto bad2;
342 }
343
344 *sm = dm_sm_checker_create(inner);
345 if (!*sm)
346 goto bad2;
347
348 } else {
349 r = dm_bm_write_lock(dm_tm_get_bm(*tm), sb_location,
350 sb_validator, sblock);
351 if (r < 0) {
352 DMERR("couldn't lock superblock");
353 goto bad1;
354 }
355
356 r = dm_sm_metadata_open(inner, *tm,
357 dm_block_data(*sblock) + root_offset,
358 root_max_len);
359 if (r) {
360 DMERR("couldn't open metadata space map");
361 goto bad2;
362 }
363
364 *sm = dm_sm_checker_create(inner);
365 if (!*sm)
366 goto bad2;
367 }
368
369 return 0;
370
371bad2:
372 dm_tm_unlock(*tm, *sblock);
373bad1:
374 dm_tm_destroy(*tm);
375 dm_sm_destroy(inner);
376 return r;
377}
378
379int dm_tm_create_with_sm(struct dm_block_manager *bm, dm_block_t sb_location,
380 struct dm_block_validator *sb_validator,
381 struct dm_transaction_manager **tm,
382 struct dm_space_map **sm, struct dm_block **sblock)
383{
384 return dm_tm_create_internal(bm, sb_location, sb_validator,
385 0, 0, tm, sm, sblock, 1);
386}
387EXPORT_SYMBOL_GPL(dm_tm_create_with_sm);
388
389int dm_tm_open_with_sm(struct dm_block_manager *bm, dm_block_t sb_location,
390 struct dm_block_validator *sb_validator,
391 size_t root_offset, size_t root_max_len,
392 struct dm_transaction_manager **tm,
393 struct dm_space_map **sm, struct dm_block **sblock)
394{
395 return dm_tm_create_internal(bm, sb_location, sb_validator, root_offset,
396 root_max_len, tm, sm, sblock, 0);
397}
398EXPORT_SYMBOL_GPL(dm_tm_open_with_sm);
399
400/*----------------------------------------------------------------*/
diff --git a/drivers/md/persistent-data/dm-transaction-manager.h b/drivers/md/persistent-data/dm-transaction-manager.h
new file mode 100644
index 00000000000..6da784871db
--- /dev/null
+++ b/drivers/md/persistent-data/dm-transaction-manager.h
@@ -0,0 +1,130 @@
1/*
2 * Copyright (C) 2011 Red Hat, Inc.
3 *
4 * This file is released under the GPL.
5 */
6
7#ifndef _LINUX_DM_TRANSACTION_MANAGER_H
8#define _LINUX_DM_TRANSACTION_MANAGER_H
9
10#include "dm-block-manager.h"
11
12struct dm_transaction_manager;
13struct dm_space_map;
14
15/*----------------------------------------------------------------*/
16
17/*
18 * This manages the scope of a transaction. It also enforces immutability
19 * of the on-disk data structures by limiting access to writeable blocks.
20 *
21 * Clients should not fiddle with the block manager directly.
22 */
23
24void dm_tm_destroy(struct dm_transaction_manager *tm);
25
26/*
27 * The non-blocking version of a transaction manager is intended for use in
28 * fast path code that needs to do lookups e.g. a dm mapping function.
29 * You create the non-blocking variant from a normal tm. The interface is
30 * the same, except that most functions will just return -EWOULDBLOCK.
31 * Methods that return void yet may block should not be called on a clone
32 * viz. dm_tm_inc, dm_tm_dec. Call dm_tm_destroy() as you would with a normal
33 * tm when you've finished with it. You may not destroy the original prior
34 * to clones.
35 */
36struct dm_transaction_manager *dm_tm_create_non_blocking_clone(struct dm_transaction_manager *real);
37
38/*
39 * We use a 2-phase commit here.
40 *
41 * i) In the first phase the block manager is told to start flushing, and
42 * the changes to the space map are written to disk. You should interrogate
43 * your particular space map to get detail of its root node etc. to be
44 * included in your superblock.
45 *
46 * ii) @root will be committed last. You shouldn't use more than the
47 * first 512 bytes of @root if you wish the transaction to survive a power
48 * failure. You *must* have a write lock held on @root for both stage (i)
49 * and (ii). The commit will drop the write lock.
50 */
51int dm_tm_pre_commit(struct dm_transaction_manager *tm);
52int dm_tm_commit(struct dm_transaction_manager *tm, struct dm_block *root);
53
54/*
55 * These methods are the only way to get hold of a writeable block.
56 */
57
58/*
59 * dm_tm_new_block() is pretty self-explanatory. Make sure you do actually
60 * write to the whole of @data before you unlock, otherwise you could get
61 * a data leak. (The other option is for tm_new_block() to zero new blocks
62 * before handing them out, which will be redundant in most, if not all,
63 * cases).
64 * Zeroes the new block and returns with write lock held.
65 */
66int dm_tm_new_block(struct dm_transaction_manager *tm,
67 struct dm_block_validator *v,
68 struct dm_block **result);
69
70/*
71 * dm_tm_shadow_block() allocates a new block and copies the data from @orig
72 * to it. It then decrements the reference count on original block. Use
73 * this to update the contents of a block in a data structure, don't
74 * confuse this with a clone - you shouldn't access the orig block after
75 * this operation. Because the tm knows the scope of the transaction it
76 * can optimise requests for a shadow of a shadow to a no-op. Don't forget
77 * to unlock when you've finished with the shadow.
78 *
79 * The @inc_children flag is used to tell the caller whether it needs to
80 * adjust reference counts for children. (Data in the block may refer to
81 * other blocks.)
82 *
83 * Shadowing implicitly drops a reference on @orig so you must not have
84 * it locked when you call this.
85 */
86int dm_tm_shadow_block(struct dm_transaction_manager *tm, dm_block_t orig,
87 struct dm_block_validator *v,
88 struct dm_block **result, int *inc_children);
89
90/*
91 * Read access. You can lock any block you want. If there's a write lock
92 * on it outstanding then it'll block.
93 */
94int dm_tm_read_lock(struct dm_transaction_manager *tm, dm_block_t b,
95 struct dm_block_validator *v,
96 struct dm_block **result);
97
98int dm_tm_unlock(struct dm_transaction_manager *tm, struct dm_block *b);
99
100/*
101 * Functions for altering the reference count of a block directly.
102 */
103void dm_tm_inc(struct dm_transaction_manager *tm, dm_block_t b);
104
105void dm_tm_dec(struct dm_transaction_manager *tm, dm_block_t b);
106
107int dm_tm_ref(struct dm_transaction_manager *tm, dm_block_t b,
108 uint32_t *result);
109
110struct dm_block_manager *dm_tm_get_bm(struct dm_transaction_manager *tm);
111
112/*
113 * A little utility that ties the knot by producing a transaction manager
114 * that has a space map managed by the transaction manager...
115 *
116 * Returns a tm that has an open transaction to write the new disk sm.
117 * Caller should store the new sm root and commit.
118 */
119int dm_tm_create_with_sm(struct dm_block_manager *bm, dm_block_t sb_location,
120 struct dm_block_validator *sb_validator,
121 struct dm_transaction_manager **tm,
122 struct dm_space_map **sm, struct dm_block **sblock);
123
124int dm_tm_open_with_sm(struct dm_block_manager *bm, dm_block_t sb_location,
125 struct dm_block_validator *sb_validator,
126 size_t root_offset, size_t root_max_len,
127 struct dm_transaction_manager **tm,
128 struct dm_space_map **sm, struct dm_block **sblock);
129
130#endif /* _LINUX_DM_TRANSACTION_MANAGER_H */
diff --git a/include/linux/device-mapper.h b/include/linux/device-mapper.h
index 99e3e50b5c5..98f34b886f9 100644
--- a/include/linux/device-mapper.h
+++ b/include/linux/device-mapper.h
@@ -10,6 +10,7 @@
10 10
11#include <linux/bio.h> 11#include <linux/bio.h>
12#include <linux/blkdev.h> 12#include <linux/blkdev.h>
13#include <linux/ratelimit.h>
13 14
14struct dm_dev; 15struct dm_dev;
15struct dm_target; 16struct dm_target;
@@ -127,10 +128,6 @@ void dm_put_device(struct dm_target *ti, struct dm_dev *d);
127 * Information about a target type 128 * Information about a target type
128 */ 129 */
129 130
130/*
131 * Target features
132 */
133
134struct target_type { 131struct target_type {
135 uint64_t features; 132 uint64_t features;
136 const char *name; 133 const char *name;
@@ -159,6 +156,30 @@ struct target_type {
159 struct list_head list; 156 struct list_head list;
160}; 157};
161 158
159/*
160 * Target features
161 */
162
163/*
164 * Any table that contains an instance of this target must have only one.
165 */
166#define DM_TARGET_SINGLETON 0x00000001
167#define dm_target_needs_singleton(type) ((type)->features & DM_TARGET_SINGLETON)
168
169/*
170 * Indicates that a target does not support read-only devices.
171 */
172#define DM_TARGET_ALWAYS_WRITEABLE 0x00000002
173#define dm_target_always_writeable(type) \
174 ((type)->features & DM_TARGET_ALWAYS_WRITEABLE)
175
176/*
177 * Any device that contains a table with an instance of this target may never
178 * have tables containing any different target type.
179 */
180#define DM_TARGET_IMMUTABLE 0x00000004
181#define dm_target_is_immutable(type) ((type)->features & DM_TARGET_IMMUTABLE)
182
162struct dm_target { 183struct dm_target {
163 struct dm_table *table; 184 struct dm_table *table;
164 struct target_type *type; 185 struct target_type *type;
@@ -375,6 +396,14 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
375 *---------------------------------------------------------------*/ 396 *---------------------------------------------------------------*/
376#define DM_NAME "device-mapper" 397#define DM_NAME "device-mapper"
377 398
399#ifdef CONFIG_PRINTK
400extern struct ratelimit_state dm_ratelimit_state;
401
402#define dm_ratelimit() __ratelimit(&dm_ratelimit_state)
403#else
404#define dm_ratelimit() 0
405#endif
406
378#define DMCRIT(f, arg...) \ 407#define DMCRIT(f, arg...) \
379 printk(KERN_CRIT DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 408 printk(KERN_CRIT DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg)
380 409
@@ -382,7 +411,7 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
382 printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 411 printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg)
383#define DMERR_LIMIT(f, arg...) \ 412#define DMERR_LIMIT(f, arg...) \
384 do { \ 413 do { \
385 if (printk_ratelimit()) \ 414 if (dm_ratelimit()) \
386 printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " \ 415 printk(KERN_ERR DM_NAME ": " DM_MSG_PREFIX ": " \
387 f "\n", ## arg); \ 416 f "\n", ## arg); \
388 } while (0) 417 } while (0)
@@ -391,7 +420,7 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
391 printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 420 printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg)
392#define DMWARN_LIMIT(f, arg...) \ 421#define DMWARN_LIMIT(f, arg...) \
393 do { \ 422 do { \
394 if (printk_ratelimit()) \ 423 if (dm_ratelimit()) \
395 printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " \ 424 printk(KERN_WARNING DM_NAME ": " DM_MSG_PREFIX ": " \
396 f "\n", ## arg); \ 425 f "\n", ## arg); \
397 } while (0) 426 } while (0)
@@ -400,7 +429,7 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
400 printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg) 429 printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f "\n", ## arg)
401#define DMINFO_LIMIT(f, arg...) \ 430#define DMINFO_LIMIT(f, arg...) \
402 do { \ 431 do { \
403 if (printk_ratelimit()) \ 432 if (dm_ratelimit()) \
404 printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f \ 433 printk(KERN_INFO DM_NAME ": " DM_MSG_PREFIX ": " f \
405 "\n", ## arg); \ 434 "\n", ## arg); \
406 } while (0) 435 } while (0)
@@ -410,7 +439,7 @@ void *dm_vcalloc(unsigned long nmemb, unsigned long elem_size);
410 printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX " DEBUG: " f "\n", ## arg) 439 printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX " DEBUG: " f "\n", ## arg)
411# define DMDEBUG_LIMIT(f, arg...) \ 440# define DMDEBUG_LIMIT(f, arg...) \
412 do { \ 441 do { \
413 if (printk_ratelimit()) \ 442 if (dm_ratelimit()) \
414 printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX ": " f \ 443 printk(KERN_DEBUG DM_NAME ": " DM_MSG_PREFIX ": " f \
415 "\n", ## arg); \ 444 "\n", ## arg); \
416 } while (0) 445 } while (0)
diff --git a/include/linux/dm-ioctl.h b/include/linux/dm-ioctl.h
index 0cb8eff76bd..75fd5573516 100644
--- a/include/linux/dm-ioctl.h
+++ b/include/linux/dm-ioctl.h
@@ -267,9 +267,9 @@ enum {
267#define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl) 267#define DM_DEV_SET_GEOMETRY _IOWR(DM_IOCTL, DM_DEV_SET_GEOMETRY_CMD, struct dm_ioctl)
268 268
269#define DM_VERSION_MAJOR 4 269#define DM_VERSION_MAJOR 4
270#define DM_VERSION_MINOR 21 270#define DM_VERSION_MINOR 22
271#define DM_VERSION_PATCHLEVEL 0 271#define DM_VERSION_PATCHLEVEL 0
272#define DM_VERSION_EXTRA "-ioctl (2011-07-06)" 272#define DM_VERSION_EXTRA "-ioctl (2011-10-19)"
273 273
274/* Status bits */ 274/* Status bits */
275#define DM_READONLY_FLAG (1 << 0) /* In/Out */ 275#define DM_READONLY_FLAG (1 << 0) /* In/Out */
diff --git a/include/linux/dm-kcopyd.h b/include/linux/dm-kcopyd.h
index 5e54458e920..47d9d376e4e 100644
--- a/include/linux/dm-kcopyd.h
+++ b/include/linux/dm-kcopyd.h
@@ -57,5 +57,9 @@ void *dm_kcopyd_prepare_callback(struct dm_kcopyd_client *kc,
57 dm_kcopyd_notify_fn fn, void *context); 57 dm_kcopyd_notify_fn fn, void *context);
58void dm_kcopyd_do_callback(void *job, int read_err, unsigned long write_err); 58void dm_kcopyd_do_callback(void *job, int read_err, unsigned long write_err);
59 59
60int dm_kcopyd_zero(struct dm_kcopyd_client *kc,
61 unsigned num_dests, struct dm_io_region *dests,
62 unsigned flags, dm_kcopyd_notify_fn fn, void *context);
63
60#endif /* __KERNEL__ */ 64#endif /* __KERNEL__ */
61#endif /* _LINUX_DM_KCOPYD_H */ 65#endif /* _LINUX_DM_KCOPYD_H */
diff --git a/include/linux/dm-log-userspace.h b/include/linux/dm-log-userspace.h
index eeace7d3ff1..0678c2adc42 100644
--- a/include/linux/dm-log-userspace.h
+++ b/include/linux/dm-log-userspace.h
@@ -52,15 +52,20 @@
52 * Payload-to-userspace: 52 * Payload-to-userspace:
53 * A single string containing all the argv arguments separated by ' 's 53 * A single string containing all the argv arguments separated by ' 's
54 * Payload-to-kernel: 54 * Payload-to-kernel:
55 * None. ('data_size' in the dm_ulog_request struct should be 0.) 55 * A NUL-terminated string that is the name of the device that is used
56 * as the backing store for the log data. 'dm_get_device' will be called
57 * on this device. ('dm_put_device' will be called on this device
58 * automatically after calling DM_ULOG_DTR.) If there is no device needed
59 * for log data, 'data_size' in the dm_ulog_request struct should be 0.
56 * 60 *
57 * The UUID contained in the dm_ulog_request structure is the reference that 61 * The UUID contained in the dm_ulog_request structure is the reference that
58 * will be used by all request types to a specific log. The constructor must 62 * will be used by all request types to a specific log. The constructor must
59 * record this assotiation with instance created. 63 * record this association with the instance created.
60 * 64 *
61 * When the request has been processed, user-space must return the 65 * When the request has been processed, user-space must return the
62 * dm_ulog_request to the kernel - setting the 'error' field and 66 * dm_ulog_request to the kernel - setting the 'error' field, filling the
63 * 'data_size' appropriately. 67 * data field with the log device if necessary, and setting 'data_size'
68 * appropriately.
64 */ 69 */
65#define DM_ULOG_CTR 1 70#define DM_ULOG_CTR 1
66 71
@@ -377,8 +382,11 @@
377 * dm_ulog_request or a change in the way requests are 382 * dm_ulog_request or a change in the way requests are
378 * issued/handled. Changes are outlined here: 383 * issued/handled. Changes are outlined here:
379 * version 1: Initial implementation 384 * version 1: Initial implementation
385 * version 2: DM_ULOG_CTR allowed to return a string containing a
386 * device name that is to be registered with DM via
387 * 'dm_get_device'.
380 */ 388 */
381#define DM_ULOG_REQUEST_VERSION 1 389#define DM_ULOG_REQUEST_VERSION 2
382 390
383struct dm_ulog_request { 391struct dm_ulog_request {
384 /* 392 /*