aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorMike Snitzer <snitzer@redhat.com>2014-02-14 11:58:41 -0500
committerMike Snitzer <snitzer@redhat.com>2014-03-05 15:25:35 -0500
commit07f2b6e0382ec4c59887d5954683f1a0b265574e (patch)
tree4863593c0fb83c54cb2865f4f0f6a44969aa28a6
parentcdc2b4158405f1975f9d5205096f08430eda1c0e (diff)
dm thin: ensure user takes action to validate data and metadata consistency
If a thin metadata operation fails the current transaction will abort, whereby causing potential for IO layers up the stack (e.g. filesystems) to have data loss. As such, set THIN_METADATA_NEEDS_CHECK_FLAG in the thin metadata's superblock which: 1) requires the user verify the thin metadata is consistent (e.g. use thin_check, etc) 2) suggests the user verify the thin data is consistent (e.g. use fsck) The only way to clear the superblock's THIN_METADATA_NEEDS_CHECK_FLAG is to run thin_repair. On metadata operation failure: abort current metadata transaction, set pool in read-only mode, and now set the needs_check flag. As part of this change, constraints are introduced or relaxed: * don't allow a pool to transition to write mode if needs_check is set * don't allow data or metadata space to be resized if needs_check is set * if a thin pool's metadata space is exhausted: the kernel will now force the user to take the pool offline for repair before the kernel will allow the metadata space to be extended. Also, update Documentation to include information about when the thin provisioning target commits metadata, how it handles metadata failures and running out of space. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Joe Thornber <ejt@redhat.com>
-rw-r--r--Documentation/device-mapper/cache.txt11
-rw-r--r--Documentation/device-mapper/thin-provisioning.txt29
-rw-r--r--drivers/md/dm-thin-metadata.c37
-rw-r--r--drivers/md/dm-thin-metadata.h11
-rw-r--r--drivers/md/dm-thin.c76
5 files changed, 135 insertions, 29 deletions
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt
index e6b72d355151..68c0f517c60e 100644
--- a/Documentation/device-mapper/cache.txt
+++ b/Documentation/device-mapper/cache.txt
@@ -124,12 +124,11 @@ the default being 204800 sectors (or 100MB).
124Updating on-disk metadata 124Updating on-disk metadata
125------------------------- 125-------------------------
126 126
127On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is 127On-disk metadata is committed every time a FLUSH or FUA bio is written.
128written. If no such requests are made then commits will occur every 128If no such requests are made then commits will occur every second. This
129second. This means the cache behaves like a physical disk that has a 129means the cache behaves like a physical disk that has a volatile write
130write cache (the same is true of the thin-provisioning target). If 130cache. If power is lost you may lose some recent writes. The metadata
131power is lost you may lose some recent writes. The metadata should 131should always be consistent in spite of any crash.
132always be consistent in spite of any crash.
133 132
134The 'dirty' state for a cache block changes far too frequently for us 133The 'dirty' state for a cache block changes far too frequently for us
135to keep updating it on the fly. So we treat it as a hint. In normal 134to keep updating it on the fly. So we treat it as a hint. In normal
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
index 8a7a3d46e0da..3b34b4fbb54f 100644
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -116,6 +116,35 @@ Resuming a device with a new table itself triggers an event so the
116userspace daemon can use this to detect a situation where a new table 116userspace daemon can use this to detect a situation where a new table
117already exceeds the threshold. 117already exceeds the threshold.
118 118
119A low water mark for the metadata device is maintained in the kernel and
120will trigger a dm event if free space on the metadata device drops below
121it.
122
123Updating on-disk metadata
124-------------------------
125
126On-disk metadata is committed every time a FLUSH or FUA bio is written.
127If no such requests are made then commits will occur every second. This
128means the thin-provisioning target behaves like a physical disk that has
129a volatile write cache. If power is lost you may lose some recent
130writes. The metadata should always be consistent in spite of any crash.
131
132If data space is exhausted the pool will either error or queue IO
133according to the configuration (see: error_if_no_space). If metadata
134space is exhausted or a metadata operation fails: the pool will error IO
135until the pool is taken offline and repair is performed to 1) fix any
136potential inconsistencies and 2) clear the flag that imposes repair.
137Once the pool's metadata device is repaired it may be resized, which
138will allow the pool to return to normal operation. Note that if a pool
139is flagged as needing repair, the pool's data and metadata devices
140cannot be resized until repair is performed. It should also be noted
141that when the pool's metadata space is exhausted the current metadata
142transaction is aborted. Given that the pool will cache IO whose
143completion may have already been acknowledged to upper IO layers
144(e.g. filesystem) it is strongly suggested that consistency checks
145(e.g. fsck) be performed on those layers when repair of the pool is
146required.
147
119Thin provisioning 148Thin provisioning
120----------------- 149-----------------
121 150
diff --git a/drivers/md/dm-thin-metadata.c b/drivers/md/dm-thin-metadata.c
index baa87ff12816..fb9efc829182 100644
--- a/drivers/md/dm-thin-metadata.c
+++ b/drivers/md/dm-thin-metadata.c
@@ -76,7 +76,7 @@
76 76
77#define THIN_SUPERBLOCK_MAGIC 27022010 77#define THIN_SUPERBLOCK_MAGIC 27022010
78#define THIN_SUPERBLOCK_LOCATION 0 78#define THIN_SUPERBLOCK_LOCATION 0
79#define THIN_VERSION 1 79#define THIN_VERSION 2
80#define THIN_METADATA_CACHE_SIZE 64 80#define THIN_METADATA_CACHE_SIZE 64
81#define SECTOR_TO_BLOCK_SHIFT 3 81#define SECTOR_TO_BLOCK_SHIFT 3
82 82
@@ -1755,3 +1755,38 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
1755 1755
1756 return r; 1756 return r;
1757} 1757}
1758
1759int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd)
1760{
1761 int r;
1762 struct dm_block *sblock;
1763 struct thin_disk_superblock *disk_super;
1764
1765 down_write(&pmd->root_lock);
1766 pmd->flags |= THIN_METADATA_NEEDS_CHECK_FLAG;
1767
1768 r = superblock_lock(pmd, &sblock);
1769 if (r) {
1770 DMERR("couldn't read superblock");
1771 goto out;
1772 }
1773
1774 disk_super = dm_block_data(sblock);
1775 disk_super->flags = cpu_to_le32(pmd->flags);
1776
1777 dm_bm_unlock(sblock);
1778out:
1779 up_write(&pmd->root_lock);
1780 return r;
1781}
1782
1783bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd)
1784{
1785 bool needs_check;
1786
1787 down_read(&pmd->root_lock);
1788 needs_check = pmd->flags & THIN_METADATA_NEEDS_CHECK_FLAG;
1789 up_read(&pmd->root_lock);
1790
1791 return needs_check;
1792}
diff --git a/drivers/md/dm-thin-metadata.h b/drivers/md/dm-thin-metadata.h
index 82ea384d36ff..e3c857db195a 100644
--- a/drivers/md/dm-thin-metadata.h
+++ b/drivers/md/dm-thin-metadata.h
@@ -25,6 +25,11 @@
25 25
26/*----------------------------------------------------------------*/ 26/*----------------------------------------------------------------*/
27 27
28/*
29 * Thin metadata superblock flags.
30 */
31#define THIN_METADATA_NEEDS_CHECK_FLAG (1 << 0)
32
28struct dm_pool_metadata; 33struct dm_pool_metadata;
29struct dm_thin_device; 34struct dm_thin_device;
30 35
@@ -202,6 +207,12 @@ int dm_pool_register_metadata_threshold(struct dm_pool_metadata *pmd,
202 dm_sm_threshold_fn fn, 207 dm_sm_threshold_fn fn,
203 void *context); 208 void *context);
204 209
210/*
211 * Updates the superblock immediately.
212 */
213int dm_pool_metadata_set_needs_check(struct dm_pool_metadata *pmd);
214bool dm_pool_metadata_needs_check(struct dm_pool_metadata *pmd);
215
205/*----------------------------------------------------------------*/ 216/*----------------------------------------------------------------*/
206 217
207#endif 218#endif
diff --git a/drivers/md/dm-thin.c b/drivers/md/dm-thin.c
index dfa48ec7b8ea..a04eba905922 100644
--- a/drivers/md/dm-thin.c
+++ b/drivers/md/dm-thin.c
@@ -1403,7 +1403,28 @@ static void set_pool_mode(struct pool *pool, enum pool_mode new_mode)
1403{ 1403{
1404 int r; 1404 int r;
1405 struct pool_c *pt = pool->ti->private; 1405 struct pool_c *pt = pool->ti->private;
1406 enum pool_mode old_mode = pool->pf.mode; 1406 bool needs_check = dm_pool_metadata_needs_check(pool->pmd);
1407 enum pool_mode old_mode = get_pool_mode(pool);
1408
1409 /*
1410 * Never allow the pool to transition to PM_WRITE mode if user
1411 * intervention is required to verify metadata and data consistency.
1412 */
1413 if (new_mode == PM_WRITE && needs_check) {
1414 DMERR("%s: unable to switch pool to write mode until repaired.",
1415 dm_device_name(pool->pool_md));
1416 if (old_mode != new_mode)
1417 new_mode = old_mode;
1418 else
1419 new_mode = PM_READ_ONLY;
1420 }
1421 /*
1422 * If we were in PM_FAIL mode, rollback of metadata failed. We're
1423 * not going to recover without a thin_repair. So we never let the
1424 * pool move out of the old mode.
1425 */
1426 if (old_mode == PM_FAIL)
1427 new_mode = old_mode;
1407 1428
1408 switch (new_mode) { 1429 switch (new_mode) {
1409 case PM_FAIL: 1430 case PM_FAIL:
@@ -1467,19 +1488,28 @@ static void out_of_data_space(struct pool *pool)
1467 set_pool_mode(pool, PM_READ_ONLY); 1488 set_pool_mode(pool, PM_READ_ONLY);
1468} 1489}
1469 1490
1470static void metadata_operation_failed(struct pool *pool, const char *op, int r) 1491static void abort_transaction(struct pool *pool)
1471{ 1492{
1472 dm_block_t free_blocks; 1493 const char *dev_name = dm_device_name(pool->pool_md);
1494
1495 DMERR_LIMIT("%s: aborting current metadata transaction", dev_name);
1496 if (dm_pool_abort_metadata(pool->pmd)) {
1497 DMERR("%s: failed to abort metadata transaction", dev_name);
1498 set_pool_mode(pool, PM_FAIL);
1499 }
1500
1501 if (dm_pool_metadata_set_needs_check(pool->pmd)) {
1502 DMERR("%s: failed to set 'needs_check' flag in metadata", dev_name);
1503 set_pool_mode(pool, PM_FAIL);
1504 }
1505}
1473 1506
1507static void metadata_operation_failed(struct pool *pool, const char *op, int r)
1508{
1474 DMERR_LIMIT("%s: metadata operation '%s' failed: error = %d", 1509 DMERR_LIMIT("%s: metadata operation '%s' failed: error = %d",
1475 dm_device_name(pool->pool_md), op, r); 1510 dm_device_name(pool->pool_md), op, r);
1476 1511
1477 if (r == -ENOSPC && 1512 abort_transaction(pool);
1478 !dm_pool_get_free_metadata_block_count(pool->pmd, &free_blocks) &&
1479 !free_blocks)
1480 DMERR_LIMIT("%s: no free metadata space available.",
1481 dm_device_name(pool->pool_md));
1482
1483 set_pool_mode(pool, PM_READ_ONLY); 1513 set_pool_mode(pool, PM_READ_ONLY);
1484} 1514}
1485 1515
@@ -1693,7 +1723,7 @@ static int bind_control_target(struct pool *pool, struct dm_target *ti)
1693 /* 1723 /*
1694 * We want to make sure that a pool in PM_FAIL mode is never upgraded. 1724 * We want to make sure that a pool in PM_FAIL mode is never upgraded.
1695 */ 1725 */
1696 enum pool_mode old_mode = pool->pf.mode; 1726 enum pool_mode old_mode = get_pool_mode(pool);
1697 enum pool_mode new_mode = pt->adjusted_pf.mode; 1727 enum pool_mode new_mode = pt->adjusted_pf.mode;
1698 1728
1699 /* 1729 /*
@@ -1707,16 +1737,6 @@ static int bind_control_target(struct pool *pool, struct dm_target *ti)
1707 pool->pf = pt->adjusted_pf; 1737 pool->pf = pt->adjusted_pf;
1708 pool->low_water_blocks = pt->low_water_blocks; 1738 pool->low_water_blocks = pt->low_water_blocks;
1709 1739
1710 /*
1711 * If we were in PM_FAIL mode, rollback of metadata failed. We're
1712 * not going to recover without a thin_repair. So we never let the
1713 * pool move out of the old mode. On the other hand a PM_READ_ONLY
1714 * may have been due to a lack of metadata or data space, and may
1715 * now work (ie. if the underlying devices have been resized).
1716 */
1717 if (old_mode == PM_FAIL)
1718 new_mode = old_mode;
1719
1720 set_pool_mode(pool, new_mode); 1740 set_pool_mode(pool, new_mode);
1721 1741
1722 return 0; 1742 return 0;
@@ -2259,6 +2279,12 @@ static int maybe_resize_data_dev(struct dm_target *ti, bool *need_commit)
2259 return -EINVAL; 2279 return -EINVAL;
2260 2280
2261 } else if (data_size > sb_data_size) { 2281 } else if (data_size > sb_data_size) {
2282 if (dm_pool_metadata_needs_check(pool->pmd)) {
2283 DMERR("%s: unable to grow the data device until repaired.",
2284 dm_device_name(pool->pool_md));
2285 return 0;
2286 }
2287
2262 if (sb_data_size) 2288 if (sb_data_size)
2263 DMINFO("%s: growing the data device from %llu to %llu blocks", 2289 DMINFO("%s: growing the data device from %llu to %llu blocks",
2264 dm_device_name(pool->pool_md), 2290 dm_device_name(pool->pool_md),
@@ -2300,6 +2326,12 @@ static int maybe_resize_metadata_dev(struct dm_target *ti, bool *need_commit)
2300 return -EINVAL; 2326 return -EINVAL;
2301 2327
2302 } else if (metadata_dev_size > sb_metadata_dev_size) { 2328 } else if (metadata_dev_size > sb_metadata_dev_size) {
2329 if (dm_pool_metadata_needs_check(pool->pmd)) {
2330 DMERR("%s: unable to grow the metadata device until repaired.",
2331 dm_device_name(pool->pool_md));
2332 return 0;
2333 }
2334
2303 warn_if_metadata_device_too_big(pool->md_dev); 2335 warn_if_metadata_device_too_big(pool->md_dev);
2304 DMINFO("%s: growing the metadata device from %llu to %llu blocks", 2336 DMINFO("%s: growing the metadata device from %llu to %llu blocks",
2305 dm_device_name(pool->pool_md), 2337 dm_device_name(pool->pool_md),
@@ -2801,7 +2833,7 @@ static struct target_type pool_target = {
2801 .name = "thin-pool", 2833 .name = "thin-pool",
2802 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE | 2834 .features = DM_TARGET_SINGLETON | DM_TARGET_ALWAYS_WRITEABLE |
2803 DM_TARGET_IMMUTABLE, 2835 DM_TARGET_IMMUTABLE,
2804 .version = {1, 10, 0}, 2836 .version = {1, 11, 0},
2805 .module = THIS_MODULE, 2837 .module = THIS_MODULE,
2806 .ctr = pool_ctr, 2838 .ctr = pool_ctr,
2807 .dtr = pool_dtr, 2839 .dtr = pool_dtr,
@@ -3091,7 +3123,7 @@ static int thin_iterate_devices(struct dm_target *ti,
3091 3123
3092static struct target_type thin_target = { 3124static struct target_type thin_target = {
3093 .name = "thin", 3125 .name = "thin",
3094 .version = {1, 10, 0}, 3126 .version = {1, 11, 0},
3095 .module = THIS_MODULE, 3127 .module = THIS_MODULE,
3096 .ctr = thin_ctr, 3128 .ctr = thin_ctr,
3097 .dtr = thin_dtr, 3129 .dtr = thin_dtr,